Datasets
These are datasets for use in the NLP from scratch training workshops and webinars.
Table of contents
Amazon Reviews
25K record sample of reviews in the Software category from the Amazon Reviews Dataset
Amazon_SoftwareReviews_25K_sample.csv (25K records, 14 MB)
Amazon/Rotten Tomatoes/Yelp
Manually assembled dataset of Amazon reiews, Rotten Tomatoes reviews, and Yelp reviews for a multi-class NLP classification task. The dataset is perfectly balanced with 5K reviews from each source (the class label).
amazon_rt_yelp.csv (15K records, 5 MB)
Cornell Movie Dialog
This is a reduced version of the Cornell Movie Dialog Corpus by Cristian Danescu-Niculescu-Mizil.
The original dataset contains 220,579 conversational exchanges between 10,292 pairs of movie characters, involving 9,035 characters from 617 movies for a total 304,713 utterances.
This reduced version of the dataset contains only the character tags and utterances from the movie_lines.txt
file, with one utterance per line, suitable for training generative text models.
Emoji SMS
This is the emoji_sms
dataset, 50 hypothetical SMS messages including emojis, generated by ChatGPT on 2023-11-16.
- train.csv (50 records, 6 KB)
IMDB
This is the Large Movie Review Dataset dataset, as provided by Andrew Maas of Stanford.
The original dataset comprises 50,000 “highly polar” movie reviews from IMDB.com. This is the “IMDB data from 2006 to 2016” dataset from Kaggle, as originally provided by PromptCloud.
The dataset is of 1,000 most popular movies on IMDB in the last 10 years at the time of creation. The data points included are: Title, Genre, Description, Director, Actors, Year, Runtime, Rating, Votes, Revenue, Metascrore.
- IMDB Dataset.csv: The consolidated dataset as downloaded from Kaggle (50K records, 64 MB)
- imdb_reviews_sample.csv: A subset of 10% of the above (5K records, 6 MB)
Peripheral
This is a cleaned and reformatted version of transcriptions of the dialogue from the Amazon TV series The Peripheral.
The transcripts were originally sourced from this link at the site SubsLikeScript.
The data in the raw
folder are the original text with blank lines removed and expletives censored.
The processed
folder contains three files:
- all.csv - all dialogue, unshuffled and concatenated from the raw files (9976 records, 236 KB)
- train.csv - a 70% sample of the combined files, lines shuffled (6983 records, 166 KB)
- test.csv - a 30% sample of the combined files, lines shuffled (2993 records, 70 KB)
Both files contain a single column, text
, and are suitable for generative text tasks.
tinyshakespeare
This is the Tiny Shakespeare dataset, as provided by Andrej Karpathy in 2015 as part of char-rnn.
The data is “subset of the works of Shakespeare” in freeform text, is ~1 MB in size, and 40K lines long.
- Data: input.txt (1 MB)
Yelp
Sample of reviews from the Yelp Dataset, sourced from Github.
- Data: yelp_reviews_sample.csv (1 MB) (91K records, 65 MB)
Yoda
This is a reduced version of the Yoda Speech Corpus from Kaggle by Stefano Coretta
The original dataset contains 370 lines of dialog from all characters, as well other movie metadata. This reduced filed contains only the lines of dialog from Yoda and comprises only 102 lines of raw text.
- Data: yoda.csv (103 records, 7 KB)