It’s been an incredibly transformative and disruptive year in AI, there is no denying that. As 2023 draws to a close, let’s take some time to look back on some of the key developments and milestones we saw, and look forward at what’s to come in 2024 and beyond.
This year represented a pivoting from “the AI that came before” into the age of generative AI. In fact, I would argue that 2023 was the year of generative AI. We may talk in the future of 2023, or perhaps 2022, as being the “year zero” – the time prior to which we could actually trust that content on the web was not generated by a model but had to be written, drawn, or otherwise created by a human being. Lately now with how fast things are moving, it’s becoming increasingly difficult to even know what is “real” anymore, and I tend to think about this on a daily basis.
How did we get here?
So, 2023. In case you’re not aware, this is a hype cycle, as popularized by Gartner.
For most of the past year, we have been precisely in the left half of the figure. In fact, Gartner put us squarely in the peak of the hype cycle for Generative AI earlier this year, and we’ve been there since. The rise of generative AI and everything around it has been meteoric. It’s shown absolutely no signs of abating.
And so people have been commenting that it’s becoming very hard to keep up with the pace of everything that’s happening. It’s been a year where the pace of advancement in the field really accelerated, and we heard about new breakthroughs and research and startups and models and rounds of funding literally every day. So how did we get here?
Realistically, we all got here where we are because of this thing, right? ChatGPT was the “Innovation Trigger” making generative AI and large language models a household term, and was, and continues to be, an incredibly disruptive force in both the data space, across industries, as well as just generally the broader world at large.
Sam Altman went from being well-known in tech to a globally known celebrity of sorts outside of it, further fueled by the controversy on social media and in the news around his firing and nearly immediate rehiring at OpenAI in November.
And if we look back at the release of ChatGPT prior to this year, which many folks have already done, its adoption really was unprecedented. It was the first service to reach a million users in five days, and it reached 100 million users in just over two and a half months soon after. The fastest service or product to see this kind of adoption in tech, if not in history.
So the question is why? Why was the adoption of ChatGPT so meteoric? Because large language models are actually not that new. The Attention is All You Need paper which introduced the Transformer architecture on which the vast majority of all LLMs are based came out from Google Brain in 2017. And other organizations were doing work with LLMs and integrating them into products and services, but in some cases the consumer may not have even been aware of it. For instance, you can see this in a blog post from Google about it integrating BERT, the “foundation model of foundation models”, into search to improve results way back in the fall of 2019 which seems like ages ago. And GPT-3 was first introduced in 2020.
So, why? Why was ChatGPT so disruptive?
I think the answer is twofold. On the one hand, there was the application of reinforcement learning from human feedback or RLHF, or using a separate reward model and keeping “humans in the loop” to steer and shape the model’s behavior after pretraining. OpenAI was not the inventor of RLHF, but they were one of the first to apply it to language models, and definitely the first to make the big bet of doing so at scale. And that bet ended up paying off. This was really the key innovation that resulted in ChatGPT being so convincing and blowing so many people’s minds.
The second, which I think a lot of people overlook, has absolutely nothing to do with data and AI but instead with the way it is presented. It is really about the end user. Sure, LLMs existed in research and were gaining steam in industry, but the importance of the way ChatGPT was presented from a user experience perspective cannot be underestimated. The ability to interact with a language model in a very conversational, human way in the browser was, I would argue, a huge factor in ChatGPT’s success and set it miles apart from the experience that we were used to with previous “chatbots”.
So that’s how we got here. Clearly, ChatGPT was the “innovation trigger”, or at least the most publicly noticeable one, that catapulted us into the Year of Generative AI. And with the new paradigm – if I can use that word – of generative AI, also came new language around it.
So, now let’s look at the key developments and points we’ve experienced across three areas: large language models (or LLMS), and as part of them, foundation models and multi-modal models.
Large Language Models
So let’s talk about large language models, or LLMs. As noted above, the vast majority of all the models we’ve seen released over the last year (with the exception of image and video generation models) are based on the Transformer architecture. In fact, many use the term “large language model” to almost be synonymous with large transformer models, though we are starting to see some other model architectures scaled up and gaining popularity, or at the very least, being actively researched and applied alongside the Transformer.
Many of the original authors of the paper have gone on to other AI companies, for example, Aidan Gomez to Cohere and Noam Shazeer to Character.ai, the latter of which Anderssen-Horowitz listed as #2 on their list of Top 50 Consumer Uses of Generative AI (ChatGPT is #1) this fall. The original use for the Transformer architecture was for a fairly simple translation task, the scale of which is now rather trivial by today’s standards. But it turned out that this type of model architecture, and the innovation of the notion of self-attention, was very powerful and generalizable. And then we saw OpenAI and other companies placing a bet by seeing what would happen if you threw enormous volumes of data at them and scaled them to incredible size, and as I mentioned before, that bet turned out to pay off.
So that has brought us to where we are. And the scale at which these things have grown really is amazing, if you look back a little in time.
We see in June of 2018, just after Attention is All You Need, OpenAI releasing GPT-1 with 117 million parameters, then about a year or so later in February of 2019, GPT-2 with 1.5 billion parameters. That’s more than a power of 10 in less than 2 years. We see other models from the usual suspects in the space following what appears to be a kind of power law curve.
And then if we look ahead to the release of GPT-3 in 2020, it’s off the charts with 175 billion parameters, so another power of 10. And then GPT-4 in March of this year would be really off the charts – no one really knows exactly how many parameters it has, as OpenAI has declined to disclose this – but it is believed to be a mixture-of-experts model having about 1.8 trillion parameters.
So we’re seeing a ten thousand fold increase in the size of these models over five years. The scale of these models, in terms of both parameter size and training data, appear to really be following Moore’s Law, and it’s interesting to ponder whether the sky actually is the limit, to think about just how far these things will go.
It’s getting a bit dated now, but from April of this year there’s a good paper charting the evolutionary tree of LLM growth. You can see there on the right in gray, that’s the decoder branch of models, of which GPT is one. So a lot of research in the past, around encoder-only models, for example with BERT, or full encoder and decoder models has slowed, and now almost everything has been entirely focused on decoder-only models, because of the success of ChatGPT and OpenAI.
Basically, all they do is generate text by predicting the next token based on previous tokens. We’ve seen a lot of the amazing things they can do, but there are still some significant problems which have arisen in their application in industry, of which hallucinations is one.
Here, I fed the LLaMA-2 model from Meta, a very popular model, a completely made up URL of a Wired article that doesn’t exist. This is not a real article. If you go to that URL, it’s a 404. And it very happily generates a summary of this fictitious article. So when people do this, we call it bullshit, but when models do it, we call it hallucination. And there’s a lot of discussion around this and how to “solve this problem” or whether this is just an intrinsic property of generative models and it’s something we’ll just have to live with, with this uncertainty in the future. People like Andrej Karpathy have noted hallucinations are not a problem with models, they are essentially all they do.
There are approaches for addressing this, like retrieval augmented generation, or RAG, as it is called, where you’re basically combining traditional information retrieval approaches or search against a data store or a known knowledge base with a large language model. So you’re giving it generative capability on top of some sort of data store. However, this doesn’t completely solve the hallucination problem, as in many cases now your model will just happily hallucinate around the facts or data that you give it from the data store.
In addition to hallucinations being a problem, evaluation is also still an open problem, though many people try to pretend that it is not. This deck from researchers at Princeton in October called Evaluating LLMs Is A Minefield was quite popular, as it talked about how many of the benchmarks people use to evaluate these things, are in fact, contaminated. That is to say, the models have actually been trained on data that is in the evaluation sets. So take every benchmark you see, coming out of companies or researchers around, claiming that their new model is the latest and greatest and “outperforms” other LLMs with a generous shaker of salt. I would argue that evaluation is very much broken, and that this is still another problem which is very much open in the space.
Foundation Models
So now let’s talk about foundation models. This is a fairly new term, and came from researchers at Stanford in the newly formed Center For Research On Foundation Models in this paper in 2021, and what they were trying to say is that the types of models we’re now seeing, the size of them, they’re actually having societal impacts and really disrupting industry and research and even day-to-day life to a degree where just talking about traditional machine learning models, using that kind of language, isn’t really sufficient.
“We introduce the term foundation models to fill a void in describing the paradigm shift we are witnessing…
Existing terms… fail to capture the significance of the paradigm shift in an accessible manner for those beyond machine learning.
In particular, foundation model designates a model class that are distinctive in their sociological impact and how they have conferred a broad shift in AI research and deployment.”
– On the Opportunities and Risks of Foundation Models, Bommasani et. al, August 2021, Center for Research on Foundation Models (CRFM), Stanford
So this new term, foundation models, is one they came up with to talk about these new kinds of models.
The original foundation models back in the day, that is to say, in 2017 or 2018, were the BERT model from Google, which was an encoder-only model; GPT, obviously from OpenAI, which is a decoder-only for text generation; and the T5 model from Google as well, which was a sort of a Swiss army knife kind of model, they gave it many, many different types of things that it could do.
So these remain openly available, and their weights – at least for GPT-2 and GPT-1 – obviously not including GPT-3 and beyond. So people are still using these, fine-tuning these and models like these.
And now we are now seeing foundation models as product, as companies like OpenAI; Cohere, with their command model; Anthropic with their Claude model; and Stability AI, an outlier here, as they’re doing text-to-image generation and they don’t use a transformer architecture, they’re using a different type of model, a diffusion model.
Inflection has their Pi chatbot, and they’ll soon be offering the Inflection-2 model available by API; and then companies like AI21Labs with their Jurassic model and others. So we’re now seeing these foundation models being offered as product that companies are building. But we’re also seeing it being offered as a service.
The usual suspects, companies like Microsoft on Azure, AWS, and Google are offering foundation models, both proprietary and open, available through the cloud services they provide. So we’re seeing foundation models, now not only as a service being offered but also integrated into those and products therein.
And so, the future of machine learning that we find ourselves in is looking a lot different than the past of data science and AI. There’s really a lot of APIs, it’s looking a lot more like traditional software development as opposed to doing data science work. And there’s a problem with this. I call this the double blackbox problem. Because in the past we had deep learning models, it wasn’t clear exactly how they functioned or how they made predictions. We have seen the rise of fields like explainable AI to deal with this.
So, you need something, like the branch of explainable AI to explain that, but now we have the double blackbox problem, because not only do we have deep learning models that are not interpretable, but they’re now buried behind some API. And they are also a product that is owned by a company, and in many cases, the data that was used to train these models, the architecture of these models, how they function, things like that, are not being made known, or are being omitted from the research papers that are released.
And we see a disturbing trend where this is becoming more and more opaque. It’s ironic that a company called OpenAI is becoming less and less open over time. I sometimes myself forget that they were originally started as a non-profit.
You can see that in the Foundation Models Transparency Index which was released by the CRFM earlier this year. Across their 100 different indicators, none of the foundation models being offered scored very highly, and it is very worrying when a company like Meta is at the top of any sort of transparency score.
On the other side of this, we see companies like Hugging Face championing open source and open models as part of their platform, promoting open alternatives and open models. That being said, Hugging Face is still a company that makes profit. They are also backed by a lot of the big companies who have sunk significant money into their funding, and raised their valuation.
We also see open models coming out of many of these companies that I was already talking about. We have models like the LLaMA 2 model from Meta, an incredibly popular model released in July, spawning a whole bunch of different offspring. Obviously, GPT from OpenAI is now proprietary, but their Whisper model remains open and is very popular for speech to text transcription. I already mentioned Stability AI with Stable Diffusion, which was released back in 2022, but now they’ve built upon that with Stable Diffusion XL and Turbo this year. EleutherAI building on the GPT-J and GPT-NeoX models with releasing the Pythia series of models this year. Mistral, with their very, very popular Mistral-7B model, and now they’ve released the Mixtral8x7B mixture-of-experts model very recently this month. Mosaic with their MPT model and they’ve been acquired by Databricks.
Salesforce had their XGen-7B model. The LMSYS Group released the Vicuna series of models and this was built upon with Guanaco as part of the QLoRA research, those both being fine-tunes of LLaMA from Meta. Deci has their DeciLM and coder models. And they just recently released a new version. Databricks with the Dolly model, which was a fine-tune of Pythia.
Microsoft focusing on smaller models with their Phi series, and they just recently released Phi 2. Hugging Face recently with a distilled Whisper model which is faster and smaller than OpenAI’s full Whisper.
Earlier this year, the TI Institute from Abu Dhabi released the very popular Falcon model. Replit made their coding model openly available, and then from Asia, we see models Qwen from Alibaba, Yi from 01.ai, and Deepseek LLM, all these models being highly bilingual in English and Chinese. So, an explosion of open models, and expect this space to continue expanding into next year and beyond.
Multimodal Models
Finally, we can talk about multimodal models. I already mentioned Whisper, which is one of the models that remains open from OpenAI. It’s a family of highly accurate multilingual speech-to-text transcription models, or what in the domain we would call models for automatic speech recognition, or ASR. People are doing pretty amazing things with this building product, because it’s open and freely available. There are libraries like insanely fast whisper, that was released recently where you can do things like transcribe hours of spoken audio locally on your machine in a matter of minutes.
As mentioned, Hugging Face released Distilled Whisper, a smaller version that is also faster to use.
From Meta, in the translation space, we saw the Seamless M4T model which stands for massively multilingual and multimodal machine translation model. This is a model that not only does machine translation, it also does it in a multimodal way: it can transcribe and translate speech to text, but it also does text to text, text to speech, and speech to speech.
It’s a highly multilingual model, the number of input languages for each modality combination varies, but is around 100 languages for each.
This is a space that Meta is really interested in, doing a lot of active research, working with what are called low resource languages. I believe their ultimate goal is to eventually have a universal translator through our smartphones like in Star Trek. We’re already seeing things move in this direction with real-time translation and it’s pretty exciting.
In the text-to-speech domain, Elevenlabs is a leader here, doing a lot of great work. They now have a very multilingual model to do this, which was released in August, and the voices that come out of it are very, very lifelike, very human sounding, very expressive. It’s not like the text-to-speech of the old days you hear on public transit. There are other projects in this space, where we see projects like Microsoft working with Project Gutenberg to do things like programmatically creating audiobooks that sound very expressive and life-like even though they are generated by a model, so it’s really amazing how far this technology has come.
In the audio modality as well, we now see music being generated from large language models. Stable Audio released by Stability AI, where you can type different text prompts and then get short music clips generated. Google, also later this year, released their Lyria model which does the same sort of thing, and they have worked to integrate that with Youtube creator tools in DreamTrack, and we can assume they will continue to integrate it into other product they offer. So the future of being a creative and creative production is going to continue to evolve and look a bit different. I wouldn’t be worried if I was a musician yet, but I would be if I was in the business of making elevator music.
Now moving over to talking about images, I’ve already mentioned Stability AI a number of times, who were the leader here with their release of Stable Diffusion back in the fall of 2022, and now with Stable Diffusion XL, which is actually a mixture of experts model with only two experts, a backbone image generation model and then a refiner model. They also now have Stable Diffusion Turbo, which generates images almost as fast as you can type. OpenAI has DALL-E, and they released DALL-E 3 this year, a new version. Also a diffusion model, being overshadowed now by GPT-4’s multimodal capabilities and the fact that it is integrated into GPT-4 for image generation.
Google had the Imagen model earlier this year, which did image and short video as well and they’ve very recently released a new version of this. But again, Gemini has multi-modal capability, and just like with GPT and DALL-E, as far as the public is concerned, there’s some overlap there. Meta has their Emu series of models, and they just recently released this as a standalone product with what they call their Imagine platform, so, if you have a Facebook account, you can use this model on its own, and it was trained on billions of Instagram posts and Facebook posts.
In the multimodal understanding domain, we see GPT-4 being really exciting for a lot of people. This was a popular paper from Microsoft research in which they were exploring the different abilities of GPT4-V and what it could and could not do, with lots of interesting insights there, including identifying some of the issues in the multimodal domain around hallucination, that are sort of the analog of what we see in the generative text side.
On the open source and academic research side, we LLaVA which was another multimodal model, and others like Mini-GPT4, which came out of the Middle East, and then from startups and smaller companies, there’s models like Fuyu-8B, which was released as an open model.
People are even doing things like putting these different types of models together, and so we see composite models like VideoLLaMA, where you can do things like interrogate video clips using the multi-modal capabilities of these models.
Speaking of video, we also now see synthetic video being generated by multimodal large language models. Stability AI released Stable Video Diffusion where you can input an image and get back short video clips, and Meta expanded their Emu model with Emu Video and Emu Edit.
A leader in this space is RunwayML. We see here on the left synthetic images generated from a text prompt by Midjourney, fed into Runway as input, and then on the right synthetic videos generated from those images. So, we see basically here, one hundred percent AI generated media.
They just released a new model, Gen-2. This clip was very popular on social media, as well, which shows you can do things like take an image and in-paint it, and it will just animate those parts to create a video from an image.
Now with combining audio and video generation together we see things like the rise of digital clones. Companies here like HeyGen on the left – here, this is not a real person, this is a generated video of him speaking and moving with what they call a digital avatar.
Basically, there are companies that will willingly deepfake you, and if you want to do things like create videos of yourself for social media or instructional purposes in a programmatic way, you can do so with generative AI. Microsoft released a version of this available on Azure in alpha at their Ignite Conference this year, although admittedly their current offering is a little more wooden than some others, so there is a ways to go in terms of this technology evolving.
We see other research like this being done that is similar. The Animate Anyone paper from Alibaba, where researchers created a generative model where you can take still images and different poses as input and it will generate video of the person following the poses. They make the point that there are obvious use cases for this, like in retail or fashion, as you can see above. When they released this, someone basically commented on social media, RIP Tiktok dancers, and then, Alibaba coming under fire on social media, when it was pointed out that they scraped a lot of Tiktok videos and used these to train the model without the individuals’ permission. So, this work, like many others, also raises questions about data ownership and privacy, and about whether just because data is publicly out there if it’s fair game to be used to train models. We’ll continue to see those sorts of discussions evolve in the future as well just as they have been in the image generation domain.
Furthermore, in multimodal generation we see applications like 3D models now being created synthetically. Here we see an example of text-to-3D avatar creation from the TADA! Model from the Max Planck Institute, where you can type a text prompt and then you get back a textured 3D asset from the prompt. So you can imagine in the future making video games with 100% synthetically generated assets, or using these in animated films or other media. So we’re going to start seeing these kinds of models be integrated into software and become a regular part of creative workflows.
And then finally this last one I think is very interesting. Neural radiance fields, or NeRF, is an area that’s been researched for some time, but they’re getting better and better, and we see this in this paper, Zip-NeRF, from Google Research. This video is not a real drone fly-by video, this is a synthetic video that was generated from still images taken in this house. A generative model then uses the directional information and the still images to create this very life-like looking video.
You can see very real use cases here, you can imagine that in the future if you want to put your home up for sale on Zillow, or for rent on Airbnb or VRBO or wherever, you just take a few photos in your house with your smartphone and these sorts of fly-through videos can be generated completely with AI.
2024 and Beyond
So, looking ahead into the wave of AI that is coming in the future, what can we expect to see?
As most people know, Google very recently released their Gemini model, basically to go directly up against OpenAI, given that things with Bard weren’t going so well.
And once again they bungled it a bit, at least somewhat. Because after it was released, it very quickly came to light that their very slick marketing was not 100% and completely real and live in the way it was presented. And, I, for one, am shocked. Very shocked that any marketing for an AI product is not 100% authentic and true. But basically, there was a bit of negative pushback about this, and things continue to be a bit of a struggle for Google.
People see this as Google going head-to-head against OpenAI, but I actually don’t see it that way, I see this as them being in direct competition with the other big tech giant, Microsoft, because as most folks know, Microsoft has a very, very close relationship with OpenAI, with GPT powering Copilot, and OpenAI models being trained and hosted on Azure Infrastructure.
So, the future is going to be interesting. We’ll see how this boxing match plays out. It’s also interesting to note that while these two tech giants are going head to head in the public space, in a very public way, certain other large tech giants are reluctant or have simply refused to do so. Apple notably being very reticent on this front. Even avoiding using the term AI, and instead their big announcement at the end of this year being they released a new ML framework for Silicon. And we even see companies like Amazon, seemingly content enough to be a provider of services and foundation models through AWS Bedrock, even though they do have their own Titan model that they’ve trained, but they haven’t really made a lot of fuss about it. So it’ll be interesting to see how things continue to play out.
Looking ahead into 2024, and beyond my predictions:
- We’re going to continue to see companies trying to make sense of all the world’s information using these large language models and generative AI. As a part of that, we’ll continue to see approaches develop, maybe building on RAG, or maybe new approaches we haven’t even thought of yet, to deal with problems like hallucinations, and also the fact that this type of computing is not one we can always necessarily trust and is not entirely deterministic in many ways, like a lot of the traditional computing we’ve been used to in the past.
- Secondly, we are going to see legislation begin to catch up with the pace of development in AI. Right now, the robots are winning, but very recently legislation has come out of the EU with the passing of the EU AI Act, Biden issued an executive order on AI, and Canada has AIDA on draft, so expect more to be coming out very soon. We can also expect to see the pace of innovation slow considerably, as people that do not understand how AI works seek to put in place rules about how it can and cannot be used.
- In 2024 or later, we will also see the very first feature film entirely generated by AI. Whether that film will be any good is another question, but we are going to see it at some point in the near future. As part of that, expect to continue to see the tension and the back-and-forth of litigation between creative industries like Hollywood and record labels and publishing houses increase, and we’ll also get to see how that is going to play out as generative AI becomes a part of everyday life. If it’s going to be anything like the battle between technologists and the record labels in the past, I know which side I’ll be betting on.
- Finally, there are no signs of things slowing down, so expect to see the scale, investment, and competition in the hardware continue to explode as the size of these models continues to grow. Obviously, Nvidia is very dominant in the space, making money hand over fist despite there being GPU shortages. AMD is going to be throwing their hat more into that ring next year with greater gusto next year. I’m not sure I have a lot of faith in that. We are even seeing startups in the AI infrastructure space, which is not something that used to be as common, I believe. But there’s a ton of demand there, so expect to continue to see that continue to evolve into next year and beyond.
In the future, will generative AI models replace professionals like doctors or lawyers? No. I think we’ll see the work that people do using these tools be different, they’ll be able to do the same work that they used to do, only now better or faster. A good example is nurses maybe being able to provide directions or make decisions or do certain things that in the past typically only doctors would be able to do. But I don’t believe that we will see any of these tools, whether it be text models, or models for generating synthetic images or video, or whatever else, I don’t believe we’ll ever see any of these tools fully replace those kinds of roles in that sort of way, especially in very specialized domains that require a lot of knowledge in which you need to be very certain and confident about.
There’s blog posts out there right now, where people are saying things like: we’re entering a new era of computing, where it’s the end of programming, we’re just going to be talking to these models, and it’s a whole new paradigm, and so on. I don’t believe that, I believe that’s a bit over the top.
But what I do believe is that now it is a bit of a weird time in that we get answers from these sorts of models, and the services that are built on top of them, but that we cannot entirely trust them. But that’s kind of the way you interact with people, right? You can go to a doctor, and you trust that they are knowledgeable. But people go and get second opinions for a reason, or we have bodies that make decisions about things, like governments or a board of directors that do that sort of “wisdom of crowds” thing, as they all also have different variable inputs into the decisions they must collectively make. And you can imagine this sort of thing in the future with “councils” of models.
I’m much more bullish on that sort of thing happening, as opposed to AI entirely taking to anyone’s job, or their job directly replaced by a model. And I think also the way we interact with these models is different now too, than the way we used to trust everything, the way we treat them. When you use ChatGPT, you should basically treat it like a really bad assistant, and you can’t trust its responses, but then you at least still have something to start from, which is better than starting from zero and doing all that work yourself.
So we’re going to find out these things as we go along and the technology progresses. The hype will fade, and we’re going to find sort of a steady state of where LLMs belong, and how they fit in, and how we will continue to use these in the near future and beyond. This will be driven by a mixture of innovation on the side of companies releasing these models, as well as consumers figuring out the use cases they actually want, what does and doesn’t work, and how these models are actually useful and provide value. 2023 was a truly remarkable year, and I’m looking forward to 2024.