Master NLP and LLM Resource List

This is the master resource list for NLP from scratch. This is a living document and will continually be updated and so should always be considered a work in progress.

This document is quite large, so you may wish to use the Table of Contents to find what you are looking for:

Table of contents

Thanks, and enjoy!

Traditional NLP

Datasets

  • nlp-datasets: Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)
  • awesome-public-datasets - Natural Language: Natural language section of the awesome public datasets github page
  • SMS Spam Dataset: The “Hello World” of NLP datasets, ~55K SMS messages with label of spam/not spam for binary classification. Hosted on UC Irvine Machine Learning repository.
  • IMDB dataset: The other “Hello World” of datasets for NLP, 50K “highly polar” movie reviews scraped from IMDB and compiled by Andrew Maas of Stanford.
  • Twitter Airline Sentiment: Tweets from February of 2015 and associated sentiment labels at major US airlines - hosted on Kaggle (~3.5MB)
  • CivilCommentst: Dataset from the Civil Comments platform which shut down in 2017. 2M public comments with labels for toxicity, obscenity, threat, insulting, etc.
  • Cornell Movie Dialog: ~220K conversations from 10K pairs of characters across 617 popular movies, compiled by Cristian Danescu-Niculescu-Mizil of Cornell. Tabular compiled format available on Hugging Face.
  • CNN Daily Mail: “Hello World” dataset for summarization, consisting of articles from CNN and Daily Mail and accompanying summaries. Also available through Tensorflow and via Hugging Face.
  • Entity Recognition Datasets: Very large list of named entity recognition (NER) datasets (on Github).
  • WikiNER: 7,200 manually-labelled Wikipedia articles across nine languages: English, German, French, Polish, Italian, Spanish,Dutch, Portuguese and Russian.
  • OntoNotes: Large corpus comprising various genres of text in three languages with structural information and shallow semantic information.
  • Flores-101 - Multilingual, multi-task dataset from Meta for machine translation research, focusing on “low resource” languages. Associated Github repo.
  • CulturaX: Open dataset of 167 languages with over 6T words, the largest multilingual dataset ever released
  • Amazon Review Datasets: Massive datasets of reviews from Amazon.com, compiled by Julian McAuley of University of California San Diego
  • Yelp Open Dataset: 7M reviews, 210K businesses, and 200K images released by Yelp. Note the educational license.
  • Google Books N-grams: Very large dataset (2.2TB) of all the n-grams from Google Books. Also available hosted in an S3 bucket by AWS.
  • Sentiment Analysis @ Stanford NLP: Includes a link to the dataset of movie reviews used for Stanford Sentiment Treebank 2 (SST2). Also available on Hugging Face.
  • CoNLL-2003: Language-independent entity recognition dataset from the Conference on Computational Natural Language Learning (CoNLL-2003) shared task. Foundational datasets for named entity recognition (NER).
  • LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset: Large scale dataset of LLM 1M conversations with LLMs collected from Chatbot Arena website.
  • TabLib: Largest publicly available dataset of tabular tokens (627M tables, 867B tokens), to encourage the community to build Large Data Models that better understand tabular data
  • LAION 5B: Massive dataset of images and captions from Large-scale Artificial Intelligence Open Network (LAION), used to train Stable Diffusion.
  • Databricks Dolly 15K: Instruction dataset compiled internally by Databricks, used to train the Dolly models based on the Pythia LLMs.
  • Conceptual Captions: Large image & caption pair dataset from Google research.
  • Instruction Tuning Volume 1: List of popular instruction-tuning datasets from Sebastian Ruder
  • Objaverse: Massive dataset of annotated 3D objects (with associated text labels) from Allen Institute. Comes in two sizes: 1.0 (800K objects) and XL (~10M objects).
  • Gretel Synthetic Text to SQL Dataset: Open dataset of synthetically generated natural language and SQL query pairs for LLM training, from Gretel AI.
  • Fineweb: 15T token dataset of cleaned and deduplicated data from CommonCrawl by Hugging Face.

Back to Top ⬆️

Data Acquisition

Back to Top ⬆️

Libraries

  • Natural Language Toolkit (NLTK): Core and essential NLP python library put together for teaching purposes by University of Pennsylvania, now fundamental to NLP work.
  • spaCy: Fundamental python NLP library for “industrial-strength natural language processing”, focused on building production systems.
  • Gensim: open-source python library with a focus on topic modeling, semantic similarity, and embeddings. Also contains implementations of word2vec and doc2vec.
  • fastText: Open-source, free, lightweight library that allows users to learn text representations (embeddings) and text classifiers. Includes pre-trained word vectors from Wikipedia and Common Crawl. From Meta’s FAIR Group.
  • KerasNLP: Natural language processing with deep learning and LLMs in Keras using Tensorflow, Pytorch, or JAX. Includes models such as BERT, GPT, and OPT.
  • Tensorflow Text: Lower level than KerasNLP, text manipulation built into Tensorflow.
  • Stanford CoreNLP: Java-based NLP library from Stanford, still important and in use
  • TextBlob: Easy to use NLP library in Python, including simple sentiment scoring and part-of-speech (POS) tagging.
  • Scikit-learn (sklearn): The essential library for doing machine learning in python, but more specifically for working with text data.
  • SparkNLP: Essential Big Data library for NLP work from John Snow Labs. Take a look at their extensive model repo. Github repo with lots of resources here. Medium post here on using the T5 model for classification with SparkNLP.

Back to Top ⬆️

Neural Networks / Deep Learning

Back to Top ⬆️

Sentiment Analysis

Back to Top ⬆️

Optical Character Recognition (OCR)

Back to Top ⬆️

Information Extraction and NERD

  • RAKE: Rapid Automatic Keyword Extraction, a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurrence with other words in the text.
  • YAKE: Yet Another Keyword Extractor is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text.
  • Pytextrank: Python implementation of TextRank and associated algorithms as a spaCy pipeline extension, for information extraction and extractive summarization.
  • PKE (Python Keyphrase Extraction): open source python-based keyphrase extraction toolkit, implementing a variety of algorithms. Uses spaCy.
  • KeyBERT: Keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.
  • UniversalNER: Targeted distillation model for named entity recognition from Microsoft Research and USC, based on data generated by ChatGPT.
  • SpanMarker: Framework for NER models based on transformers such as BERT, RoBERTa and ELECTRA using Hugging Face Transformers (HF page)

Back to Top ⬆️

Semantics and Syntax

  • Treebank: Definition at Wikipedia
  • Universal Dependencies: Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages.
  • UDPipe: UDPipe is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files.

Back to Top ⬆️

Topic Modeling & Embedding

Back to Top ⬆️

Multilingual NLP and Machine Translation:

  • fastText language identification models: Language identification models for use with fastText
  • SeamlessM4T: Multimodal translation and transcription model based on the transformer architecture from Meta research.
  • Helsinki NLP Translation Models: Well-known and used translation models in Hugging Face from the University of Helsinki Language Technology Research Group, based on the OPUS neural machine translation framework.
  • ACL 2023 Multilingual Models Tutorial: Microsoft’s presentations from ACL 2023 - a lot of dense content here on low resource languages, benchmarks, prompting, and bias.
  • ROUGE: Wikipedia page for ROUGE score for summarization and translation tasks.
  • BLEU: Wikipedia page for BLEU machine translation tasks.
  • sacreBLEU: Python library for hassle-free and reproducible BLEU scores
  • XTREME: Comprehensive benchmark for cross-lingual transfer learning on a diverse set of languages and tasks from researchers at Google and Carnegie Mellon
  • Belebele: Multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants from Meta, based upon the Flores dataset
  • OpenNMT: Open neural machine translation models in Pytorch and Tensorflow. Documentation for python here.
  • FinGPT-3: GPT model trained in Finnish, from a research group at the University of Turku, Finland.
  • Jais 13-B: Bilingual Arabic/English model based on GPT-3 architecture, from Inception AI / Core42 group in UAE.
  • Evo-LLM-JP: Japanese LLM from AI startup Sakana.ai created using evolutionary model merging. There is a chat model, a vision model, and a stable diffusion model all of which can be prompted and converse in Japanese. On Hugging Face here.

Back to Top ⬆️

Natural Language Inference (NLI) and Natural Language Understanding (NLU)

  • Adversarial NLI: Benchmark for NLI from Meta research and associated dataset.

Back to Top ⬆️

Interviewing

Back to Top ⬆️

Large Language Models (LLMs) and Gen AI

Introductory LLMs

Back to Top ⬆️

Foundation Models

Back to Top ⬆️

Text Generation

Back to Top ⬆️

Web-based Chat Clients

  • ChatGPT: Obviously. From OpenAI. Free, but requires an account.
  • Perplexity Labs: Free, web-based LLM chat client, no account required. Includes popular models such as versions of LLaMA and Mistral as well as Perplexity’s own pplx model.
  • HuggingChat: Chat client from HuggingFace, includes LLaMA and Mistral clients as well as OpenChat. Free for short conversations (in guest mode), account required for longer use.
  • DeepInfra Chat: Includes LLaMA and Mistral, even Mixtral 8x7B! Free to use.
  • Pi: Conversational LLM from Inflection. No account required.
  • Poe: AI assistant from Quora, allows interacting with OpenAI, Anthropic, LLaMA and Google models. Account required.
  • Copilot: Or is it Bing Chat? The lines are blurry. Backed by GPT, allows using GPT-4 on mobile (iOS, Android) for free! Requires a Microsoft account.

Back to Top ⬆️

Summarization

Back to Top ⬆️

Fine-tuning LLMs

Back to Top ⬆️

Model Quantization

Back to Top ⬆️

Data Labeling

  • Label Studio: Open source python library / framework for data labelling

Back to Top ⬆️

Code Examples and Cookbooks

  • OpenAI Cookbook: Recipes and tutorial posts for working and building with OpenAI, all in one place. Example code in the Github repo.
  • Cohere Guides: Example notebooks for working with Cohere for various LLM usage cases.

Back to Top ⬆️

Local LLM Development

  • GPT4All: Locally-hosted LLM from Nomic for offline development.
  • LM Studio: Software framework for local LLM development and usage.
  • Jan: Offline GUI for working with LLMs. Mobile app under development.
  • Open WebUI: Self-hosted WebUI for LLMS to operate entirely offline - formly Ollama Web UI.
  • TransformerLab: Open source project for GUI interface for working with LLMs locally.
  • SuperWhisper: Local usage of Whisper model on Mac OS, allows you to speak commands to your machine and have them transcribed (all locally).
  • Cursor: Locally installable code editor with autocomplete, chat, etc. backed by OpenAI GPT3.5/4.
  • llama.cpp: Inference from Meta’s LLaMA model in pure C/C++. Python integration through llama-cpp-python.
  • Ollama: Host LLMs locally, includes models like LLaMA, Mistral, Zephyr, Falcon, etc.
  • Exploring Ollama for On-Device AI: Comprehensive tutorial on Ollama from PyImageSearch
  • llamafile: Framework for LLMs as single executable files for local execution and development work, examples of one-liners and use from its creator here Bash One-Liners for LLMs
  • PowerInfer: CPU/GPU LLM inference engine leveraging activation locality for fast on-device generation and serving of results from LLMs locally.
  • MLC LLM: Native deployment of LLMs with native APIs with compiler acceleration. Includes WebLLM for serving LLMs through the browser and examples of locally developed Android and iPhone LLM apps.
  • DSPy: Framework for algorithmically optimizing LLM prompts and weights from Stanford NLP.
  • AnythingLLM: Docker-based framework for offline LLM usage with RAG.

Back to Top ⬆️

Multimodal LLMs

Images

Back to Top ⬆️

Audio

  • wav2vec 2.0 And w2v-BERT: Explanations of the technical details behind these multimodal models from Meta’s FAIR group and Google Brain, by Mohamed Anwar
  • Musenet: Older research from OpenAI, Musenet applied the GPT architecture to MIDI files to compose music.
  • AudioCraft: Multiple models from Meta research, for music (MusicGen), sound effect (AudioGen), and a codec and diffusion model for recovering compressed audio (EnCodec and Multi-band Diffusion). Demo also available in a Hugging Face space, and a sample Colab notebook here.
  • Audiobox: Text-to-audio and speech prompt to audio from Meta. Interactive demo site here.
  • StableAudio: Diffusion-based music generation model from Stability AI. Blog post with technical details.
  • SALMONN: Speech Audio Language Music Open Neural Network from researchers at Tsinghua University and ByteDance. Allows for things like inquiring about the content of audio files, multilingual speech recognition & translation and audio-speech co-reasoning.
  • Real-time translation and lip-synching: https://blog.invgate.com/video-translator
  • HeyGen: Startup creating AI generated avatars and multimedia content, _e.g. _for instructional videos. Video demo of lip-synching (dubbing) and translation.
  • Whisper: OpenAI’s open source multilingual, text-to-speech transcription model. Official Github repo with lots of details.
  • whisper_real_time: Example of real-time audio transcription using Whisper
  • whisper.cpp: High-performance plain C/C++ implementation of inference using OpenAI’s Whisper without dependencies
  • Deepgram: Audio AI company with enterprise offerings for TTS including both their own Nova-2 model as well as Whisper or custom models.
  • AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios: Model for realistic audio generation (text-to-speech / TTS) from researchers at Microsoft.
  • Project Gutenberg Audio Collection Project: Thousands of free audiobooks transcribed using AdaSpeech4, brought to you by Project Gutenberg, MIT, and Microsoft
  • ElevenLabs: Well-known American software company with AI voice cloning and translation products.
  • Projects: Create High-Quality Audiobooks in Minutes: Tool for creating high-quality audiobooks via TTS from ElevenLabs.
  • Brain2Music: Research from Google for using fMRI scans to reconstruct audio perceived by the listener.
  • WavJourney: Compositional Audio Creation with Large Language Models: An approach for generating audio combining generative text for scriptwriting plus audio generation models.
  • XTTS: Voice cloning model specifically designed with game creators in mind from coqui.ai. Available in a Hugging Face space here.
  • The Future of Music - How Generative AI Is Transforming the Music Industry: Blog post from Anderssen-Horowitz covering a lot of recent developments in the intersection of the music industry and GenAI tools.
  • StyleTTS2: Diffusion and adversarial model for realistic speech synthesis (TTS). Audio samples and comparisons with previous models are here.
  • Qwen-Audio: Multimodal audio understanding LLM from Alibaba Group
  • Audio Diffusion Pytorch: A fully featured audio diffusion library in PyTorch, from researchers at ElevenLabs.
  • MARS5-TTS: English TTS model from Camb.ai. With just 5 seconds of audio and a snippet of text, MARS5 can generate speech even for prosodically hard and diverse scenarios like sports commentary, anime and more.
  • IMS Toucan: IMS Toucan is a toolkit for teaching, training and using state-of-the-art Speech Synthesis models, developed at the Institute for Natural Language Processing (IMS), University of Stuttgart, Germany.

Back to Top ⬆️

Video and Animation

  • Generative Image Dynamics: Model from researchers at Google for creating looping images or interactive images from still ones.
  • IDEFICS: Open multimodal text and image model from Hugging Face based on Flamingo, similar to GPT4-V. Updated version IDEFICS 2 released 04/2024 with demo here.
  • NeRF; Neural Radiance fields creates multiple views of a scene from a single image.
  • ZipNeRF: Building on NeRF with more advanced techniques and impressive results, generating drone-style “fly-by” videos from still images of settings.
  • Pegasus-1: Multimodal model from TwelveLabs for describing videos and video-to-text generation.
  • Gen-2 by RunwayML: Video-generating multimodal model from Runway ML that takes text or images as input.
  • Replay: Video (animated picture) generating model from Genmo AI
  • Hotshot XL: Text to animated GIF generator based on Stable Diffusion XL. Github and Hugging Face model page.
  • ModelScope: Open model for text-to-video generation from Alibaba research
  • Stable Video Diffusion: Generative video diffusion model from Stability AI.
  • VideoPoet: Synthetic video generation from Google Research, taking a variety of inputs (text, image, video).
  • Pika Labs: AI startup for video creation with $55 million in backing.
  • Assistive Video: Video generation from text from AI startup Assistive
  • Haiper: Text-to-video for short clips (2-4s) from Google Deepmind alumni. Free to use with an account.
  • MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation. Text-to-video model from ByteDance research.
  • Video-LLaVA: Open model for visual question answering in images, video, and between video and image data.

Back to Top ⬆️

3D Model Generation

  • Stable Zero123: 3D image generation model from Stability AI building on the Zero123-XL model. Weights available for non-commercial use on HF here.
  • DreamBooth3D: Approach for generating high-quality custom 3D models from source images.
  • MVDream: 3D model generation from Diffusion from researchers at ByteDance.
  • TADA! Text to Animatable Digital Avatars: Research on models for synthetic generation of 3D avatars from text prompts, from researchers in China and Germany
  • TripoSR: Image to 3D generative model jointly developed by Tripo AI & Stability AI
  • Microdreamer: Github repo for implementation of Zero-shot 3D Generation in ~20 Seconds from researchers at Renmin University of China

Back to Top ⬆️

Powerpoint and Presentation Creation

  • Tome: Startup for AI-generated slides (Powerpoint). Free to signup.
  • Decktopus: “World’s #1 AI-Powered Presentation Generator”. Paid signup
  • Beautiful.ai: Another AI-based slide deck generator (paid)

Back to Top ⬆️

Domain-specific LLMs

Code

  • Github Copilot: Github’s AI coding assistant, based on OpenAI’s Codex model.
  • [GitHub Copilot Fundamentals - Understand the AI pair programmer](https://learn.microsoft.com/en-us/training/paths/copilot/ ): Introductory online training / short course on Copilot from Microsoft.
  • Gemini Code Assist: Code assistant from Google based on Gemini. Available in Google Cloud or in local IDEs via a plugin (requires subscription).
  • CodeCompose: (TechCruch article): Meta’s internal coding LLM / answer to Copilot
  • CodeInterpreter: Experimental ChatGPT plugin that provides it with access to executing python code.
  • StableCode: Stability AI’s generative LLM coding model. Hugging Face collection here. Github here.
  • Starcoder: Coding LLM from Hugging Face. Github is here. Update: Starcoder 2 has been released as of Feb 2024!
  • CodeQwen-1.5: Code-specific version of Alibaba’s Qwen model.
  • Codestral: 22B coding model from Mistral AI, supports 80+ languages.
  • Ghostwriter: an AI-powered programming assistant from Replit AI.
  • DeciCoder 1B: Code completion LLM from Deci AI, trained on Starcoder dataset.
  • SQLCoder: Open text-to-SQL query models fine-tuned on Starcoder, from Defog AI. Demo is here.
  • CodeLLama: Fine-tuned version of LLaMA 2 for coding tasks, from Meta.
  • Refact Code LLM: 1.6B coding LLM with fill-in-the-middle (fim) capability, trained by Refact AI.
  • Tabby: Open source, locally-hosted coding assistant framework. Can use Starcoder or CodeLLaMA.
  • DuetAI for Developers: Coding assistance based on PaLM as part of Google’s DuetAI offering.
  • Gorilla LLM: LLM model from researchers at UC Berkeley trained to generate API calls across many different platforms and tools.
  • Deepseek Coder: Series of bilinginual English/Chinese coding LLMs from DeepSeek AI, trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language.
  • Codestral Mamba: Open coding model from Mistral based on the MAMBA architecture.
  • Phind 70B: Code generation model purported to rival GPT-4 from AI startup Phind.
  • Granite: Open-sourced family of code-specific LLMs from IBM Research. On Hugging Face here.

Back to Top ⬆️

Mathematics

Back to Top ⬆️

Finance

  • BloombergGPT: LLM trained by Bloomberg from scratch based on code / approaches from BLOOM
  • FinGPT: Finance-specific family of models trained with RLHF, fine-tuned from various base foundation models.
  • DocLLM: Layout-aware large language model from JPMorgan

Back to Top ⬆️

Science and Health

  • Galactica: (MIT Blog Post) Learnings from Meta’s Galactica LLM, trained on scientific research papers.
  • BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, open LLM from Microsoft Research trained on PubMeb papers.
  • MedPALM: A large language model from Google Research, designed for the medical domain. Google has continued this work with MedLM,
  • Meditron: Fine-tuned LLaMAs on medical data from Swiss university EPFL. HuggingFace space here. Github here. Llama3 version released 2024/04/19.
  • MedicalLLM: Evaluation benchmark for medical LLMs from Hugging Face including leaderboard.

Back to Top ⬆️

Law

  • SaulLM-7B: Legal LLM from researchers at Equall.ai and other universities. A fine-tune of Mistral-7B trained on a legal corpus of over 30B tokens.

Back to Top ⬆️

Time Series

  • TimeGPT: Transformer-based time series prediction models from NIXTLA. Requires using their service / an API token.
  • Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting. Open-source foundation model for time series forecasting based on the transformer architecture.
  • Granite: Time-series versions of open-sourced family of LLMs from IBM Research. On Hugging Face here.

Back to Top ⬆️

Vector Databases and Frameworks

  • Docarray: python library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, and so on.
  • Faiss: Library for efficient similarity search and clustering of dense vectors from Meta Research.
  • Pinecone: Vector database is a vector-based database that offers high-performance search and similarity matching.
  • Weaviate: Open-source vector database to store data objects and vector embeddings from your favorite ML-models.
  • Chroma: Open-source vector store used for storing and retrieving vector embeddings and metadata for use with large language models.
  • Milvus: Vector database built for scalable similarity search.
  • AstraDB: Datastax’s vector database offering built atop of Apache Cassandra.
  • Activeloop: Database for AI powered by a unique storage format optimized for deep-learning and Large Language Model (LLM) based applications.
  • OSS Chat: Demo of RAG from Zilliz, allowing chat with OSS documentation.

Back to Top ⬆️

Evaluation

  • The Stanford Natural Language Inference (SNLI) Corpus: Foundational dataset for NLI-based evaluation, 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral.
  • GLUE: General Language Understanding Evaluation Benchmark from NYU, University of Washington, and Google - model evaluation using Natural Language Inference (NLI) tasks.
  • SuperGLUE: The Super General Language Understanding Evaluation, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.
  • SQuAD (Stanford Question Answering Dataset): Reading comprehension question answering dataset for LLM evaluation.
  • BigBench: The Beyond the Imitation Game Benchmark (BIG-bench) from Google, a collaborative benchmark with over 200 tasks.
  • BigBench Hard: Subset of BigBench tasks considered to be the most challenging, with associated paper.
  • MMLU: Multi-task Language Understanding is a benchmark developed by researchers at UC Berkeley and others to specifically measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings.
  • HeLM: Holistic Evaluation of Language Models, a “living” benchmark designed to be comprehensive, from the Center for Research on Foundation Models (CRFM) at Stanford.
  • HellaSwag: a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy).
  • Dynabench: A “platform for dynamic data collection and benchmarking”. Sort of a Kaggle / collaborative site for benchmarks and data collaboration, an effort of researchers from Meta and American universities.
  • LMSys Chatbot Area: Leaderboard from LMSys group based upon human evaluation and Elo score. The only evaluation that Andrej Karpathy trusts.
  • Hugging Face Open LLM Leaderboard: Leaderboard from H4 (alignment) Group at Hugging Face. Largely open and fine-tuned models, though this can be filtered.
  • AlpacaEval Leaderboard: AlpacaEval an LLM-based automatic evaluation based on the AlpacaFarm evaluation set, which tests the ability of models to follow general user instructions.
  • OpenCompass: Leaderboard for Chinese LLMs.
  • Evaluating LLMs is a minefield: Popular deck from researchers at Princeton (and authors of AI Snake Oil) on the pitfalls and intricacies of evaluating LLMs.
  • LM Contamination Index: The LM Contamination Index is a manually created database of contamination of LLM evaluation benchmarks.
  • The Curious Case of LLM Evaluation: In depth blog post, examining some of the finer nuances and sticking points of evaluating LLMs.
  • LLM Benchmarks: Dynamic dataset of crowd-sourced prompt that changes weekly for more realistic LLM evaluation.
  • Language Model Evaluation Harness: EleutherAI’s language model evaluation harness, a unified framework to test generative language models on over 200 different evaluation tasks.
  • PromptBench: Unified framework for LLM evaluation from Microsoft.
  • HarmBench: Standardized evaluation framework for automated red teaming for mitigating risks associated with malicious use of LLMs. Paper on arxiv.

Back to Top ⬆️

Agents

  • AutoGPT: One of the most popular frameworks for using LLM agents, using the OpenAI API / GPT4.
  • ThinkGPT: python library for implementing Chain of Thoughts for LLMs, prompting the model to think, reason, and to create generative agents.
  • AutoGen: Multi-agent LLM framework for building applications from Microsoft.
  • XAgent: Open-source experimental agent, designed to be a general-purpose and applied to a wide range of tasks. From students at Tsinghua University.
  • Thought Cloning: Github repo for implementation of Thought Cloning (TC), an imitation learning framework by training agents to think like humans.
  • Demonstrate-Search-Predict (DSP): framework for solving advanced tasks with language models (LMs) and retrieval models (RMs).
  • ReAct Framework: Prompting method includes examples with actions, the observations gained by taking those actions, and transcribed thoughts (reasoning) for LLMs to take complex actions and reason or solve problems.
  • Tree of Thoughts (ToT): LLM reasoning process as a tree, where each node is an intermediate “thought” or coherent piece of reasoning that serves as a step towards the final solution.
  • GPT Engineer: Python framework for attempting to get GPT to write code and build software.
  • MetaGPT - The Multi-Agent Framework: Agent framework where different assigned roles (product managers, architects, project managers, engineers) are used for building different products (user stories, competitive analysis, requirements, data structures, etc.) given a requirement.
  • OpenGPTs: Open source effort from Langchain to create a similar experience to OpenAI’s GPTs with greater flexibility and choice.
  • Devin: “AI software engineer” from startup Cognition Labs.
  • SWE-Agent: Open source software engineering agent framework from researchers at Princeton.
  • GATO: Generalist agent from Google Deepmind research for many tasks and media types
  • WebLLaMa: Fine-tuned version of LLaMa 3 from McGill University and optimized for web browsing tasks..

Back to Top ⬆️

Application Frameworks:

  • LlamaIndex: LlamaIndex (formerly GPT Index) is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. Usedl for RAG and building LLM applications working with stored data.
  • LangChain: LangChain is a framework for developing applications powered by language models.
  • Chainlit: Chainlit is an open-source Python package that makes it incredibly fast to build ChatGPT-like applications with your own business logic and data.

Back to Top ⬆️

LLM Training, Training Frameworks, Training at Scale

  • Deepspeed: Deep learning optimization software suite that enables unprecedented scale and speed for DL Training and Inference from Microsoft.
  • Megatron-LM: From NVIDIA, Megatron-LM enables training large transformer language models with efficient tensor, pipeline and sequence-based model parallelism for pre-training transformer based language models.
  • GPT-NeoX: Eleuther AI’s library for large scale GPU training of LLMs, based on Megatron.
  • TRL (Transformer Reinforcement Learning): Library for Reinforcement Learning of Transformer and Stable Diffusion models built atop of the transformers library.
  • Autotrain Advanced: In development offering and python library from Hugging Face for easy and fast auto-training of LLMs and Stable Diffusion models.
  • Transformer Math: Detailed blog post from Eleuther AI on the mathematics of compute requirements for training LLMs

Back to Top ⬆️

Reinforcement Learning from Human Feedback (RLHF)

Back to Top ⬆️

Embeddings

Back to Top ⬆️

LLM Serving

Back to Top ⬆️

Preprocessing and Tokenization

  • Tiktoken: OpenAI’s BPE-based tokenizer
  • SentencePiece: Unsupervised text tokenizer and detokenizer for text generation systems from Google (but not an official product).

Back to Top ⬆️

Open LLMs

  • LLaMa 2: Incredibly popular open weights (with license) model from Meta AI which spawned a generation of offspring and fine-tunes. Comes in 7, 13, and 70B versions.
  • Mistral 7B: Popular open model from French startup Mistral with no fine-tuning (only pretraining). See also: the Mixtral 8x7B mixture of experts successor, Mixtral 8x22B
  • Mistral NeMO: Open model from Mistral with 128B parameters, trained in partnership with NVIDIA and a new updated tokenizer (Tekken). Model on Hugging Face.
  • Gemma: Lightweight open models from Google based on the same architecture as Gemini. Comes in 2B and 7B base and instruction-tuned versions.
  • GPT-J and GPT Neo-X: Open model trained from scratch by Eleuther AI.
  • Falcon 40B: Open text generation LLM from UAE’s Technology Innovation Institute (TII). Available on Hugging Face here.
  • Falcon 2 11B: Second set of models in the series from TII, released May 2024, including a multimodal model. On Hugging Face herec.
  • StableLM: Open language model from Stability AI. Succeeded by StableLM 2, in 1.6B (Jan 2024) and 12B versions (April 2024, try live demo here)
  • OLMo: Open Language Models from the Allen Institute for AI (AI2)
  • DCLM-7B: 7 billion parameter language model from Apple designed to showcase the effectiveness of systematic data curation techniques for improving language model performance.
  • Snowflake Arctic: Open LLM from Snowflake, released April 2024. Github here and on Hugging Face here.
  • Minotaur 15B: Fine-tuned version of Starcoder on open code datasets from the OpenAccess AI Collective
  • MPT: Family of open models free for commercial use from MosaicML. Includes MPT Storywriter which has a 65K context window.
  • DBRX: Family of mixture-of-experts (MoE) large language model trained from scratch by Databricks Mosaic Research. Try it out in the Hugging Face playground here.
  • Qwen: Open LLM models from Alibaba Cloud in 7B and 14B sizes, including chat versions. Model family 1.5 released Feb 2024 and Qwen1.5-MoE Mixture of Experts model released 03/28/2024.
  • Command-R / Command-R+: Open LLM from Cohere for AI for long-context tasks such as retrieval augmented generation (RAG) and tool use. Available on HuggingFace Command-R, Command-R+
  • Aya: Massively multilingual models from Cohere for AI, Aya 101 and 23 which support those many languages respectively each. Aya 23 comes in 8B and 35B versions.
  • Grok-1: X.ai’s LLM, an MoE with 314B parameters, weights available via torrent. This is the (pre-trained) base model only, and not fine-tuned for chat.
  • SmolLM: Family of small language models (SLMs) from Huggingface in 135M, 360M, and 1.7B parameters. On Hugging Face here.
  • Jamba: Hybrid SSM-Transformer model from AI21 Labs - “world’s first production grade Mamba based model”. Weights on Hugging Face here.
  • Fuyu-8B: Open multimodal model from Adept AI, a smaller version of the model that powers their commercial product.
  • Yi: Bilingual open LLM from Chinese startup 01.AI founded by Kai-Fu Lee, with two versions Yi-34B & 6B. Also Yi-9B open-sourced in March 2024.
  • OpenHermes: Popular series of open (and uncensored) LLMs from Nousresearch, fine tunes of models such as LLaMA, Mixtral, Yi, and SOLAR.
  • Poro 34B: Fully open-source bilingual Finnish & English model trained in collaboration between Finnish startup Silo AI and the TurkuNLP group of the University of Turku.
  • Nemotron-3 8B: Family of “semi-open” (requires accepting a license) LLMs from NVIDIA, optimized for their Nemo framework. Find them all on the collections page on HF.
  • ML Foundations: Github repo for Ludwig Schmidt from University of Washington, includes open versions of multimodal models Flamingo & CLIP

Back to Top ⬆️

Visualization

Back to Top ⬆️

Prompt Engineering

Back to Top ⬆️

Back to Top ⬆️

Costing

Back to Top ⬆️

Books, Courses and other Resources

Communities

  • MLOps Community: Community of machine learning operations (MLOps) practitioners, but lately very much focused on LLMs.
  • LLMOps Space: global community for LLM practitioners & enthusiasts, focused on topics related to deploying LLMs into production
  • Aggregate Intellect Socratic Circles (AISC): Online community of ML and AI practitioners based in Toronto, with Slack server, journal club, and free talks
  • /r/LanguageTechnology: Reddit community on Natural Language Processing and LLMs with over 40K members
  • /r/LocalLLaMA: Subreddit to discuss training Llama and development around it, though also contains a lot of good general LLM discussion.

Back to Top ⬆️

MOOCS and Courses

Back to Top ⬆️

Books

Back to Top ⬆️

Surveys

Back to Top ⬆️

Aggregators and Online Resources

Back to Top ⬆️

Newsletters

These are not referral links.

  • GPTRoad: Daily no-nonsense newsletter covering developments in the AI / LLM space. They also have a site following the HackerNews template.
  • TLDR AI: Daily newsletter with little fluff, covering developments in AI news.
  • AI Tool Report: Newsletter from Respell, with AI headlines, jobs,
  • The Memo from Lifearchitect.ai: Bi-weekly newsletter with future-focused updates on developments in the LLM-space.
  • AI Breakfast: Curated weekly analysis of the latest AI projects, products, and news
  • The Rundown AI: Another daily AI newsletter (400K+ readers)
  • Interconnects: LLM / AI newsletter for more technical readers.
  • The Neuron: Another AI newsletter with cutesy and light tone.

Back to Top ⬆️

Papers (WIP)

Back to Top ⬆️

Conferences and Societies

Back to Top ⬆️