Datasets

These are datasets for use in the NLP from scratch training workshops and webinars.

Table of contents

Amazon Reviews

25K record sample of reviews in the Software category from the Amazon Reviews Dataset

Amazon_SoftwareReviews_25K_sample.csv (25K records, 14 MB)

Amazon/Rotten Tomatoes/Yelp

Manually assembled dataset of Amazon reiews, Rotten Tomatoes reviews, and Yelp reviews for a multi-class NLP classification task. The dataset is perfectly balanced with 5K reviews from each source (the class label).

amazon_rt_yelp.csv (15K records, 5 MB)

Cornell Movie Dialog

This is a reduced version of the Cornell Movie Dialog Corpus by Cristian Danescu-Niculescu-Mizil.

The original dataset contains 220,579 conversational exchanges between 10,292 pairs of movie characters, involving 9,035 characters from 617 movies for a total 304,713 utterances.

This reduced version of the dataset contains only the character tags and utterances from the movie_lines.txt file, with one utterance per line, suitable for training generative text models.

Emoji SMS

This is the emoji_sms dataset, 50 hypothetical SMS messages including emojis, generated by ChatGPT on 2023-11-16.

IMDB

This is the Large Movie Review Dataset dataset, as provided by Andrew Maas of Stanford.

The original dataset comprises 50,000 “highly polar” movie reviews from IMDB.com. This is the “IMDB data from 2006 to 2016” dataset from Kaggle, as originally provided by PromptCloud.

The dataset is of 1,000 most popular movies on IMDB in the last 10 years at the time of creation. The data points included are: Title, Genre, Description, Director, Actors, Year, Runtime, Rating, Votes, Revenue, Metascrore.

Peripheral

This is a cleaned and reformatted version of transcriptions of the dialogue from the Amazon TV series The Peripheral.

The transcripts were originally sourced from this link at the site SubsLikeScript.

The data in the raw folder are the original text with blank lines removed and expletives censored.

The processed folder contains three files:

  • all.csv - all dialogue, unshuffled and concatenated from the raw files (9976 records, 236 KB)
  • train.csv - a 70% sample of the combined files, lines shuffled (6983 records, 166 KB)
  • test.csv - a 30% sample of the combined files, lines shuffled (2993 records, 70 KB)

Both files contain a single column, text, and are suitable for generative text tasks.

tinyshakespeare

This is the Tiny Shakespeare dataset, as provided by Andrej Karpathy in 2015 as part of char-rnn.

The data is “subset of the works of Shakespeare” in freeform text, is ~1 MB in size, and 40K lines long.

Yelp

Sample of reviews from the Yelp Dataset, sourced from Github.

Yoda

This is a reduced version of the Yoda Speech Corpus from Kaggle by Stefano Coretta

The original dataset contains 370 lines of dialog from all characters, as well other movie metadata. This reduced filed contains only the lines of dialog from Yoda and comprises only 102 lines of raw text.