Understanding the basic concepts & building blocks of modern LLMs

About words, tokens, vectors, context and layers. An easy-to-understand overview of how LLMs "learn" and how all these building blocks are joined together.

Article: A jargon-free explanation of how AI large language models work
Summary: "Want to really understand large language models? Here’s a gentle primer. Today, almost everyone has heard about LLMs, and tens of millions of people have tried them out. But not very many people understand how they work."
By: Timothy B. Lee & Sean Trott, 2023 (via Ars Technica)

Published in May 2024

Many insights start with basic admissions. This is also tru for fearning about Large Language Models (LLM):

”[…] no one on Earth fully understands the inner workings of LLMs. Researchers are working to gain a better understanding, but this is a slow process that will take years—perhaps decades—to complete.”

With that out of the way, the article does a great job in making this fairly complex topic quite approachable by breaking it down and adding straightforward examples.

“To understand how language models work, you first need to understand how they represent words. Humans represent English words with a sequence of letters, like C-A-T for “cat.” Language models use a long list of numbers called a ‘word vector’.”

Or course, it helps if you know what a vector is. With Vector Databasses or vector fields being in the technology news a lot recently, making this connection already explains why. Because they are a core building block of LLMs and AI applications.

“Words are too complex to represent in only two dimensions, so language models use vector spaces with hundreds or even thousands of dimensions. The human mind can’t envision a space with that many dimensions, but computers are perfectly capable of reasoning about them and producing useful results.”
“For example, the most powerful version of GPT-3 uses word vectors with 12,288 dimensions—that is, each word is represented by a list of 12,288 numbers.”
”[…] word vectors are a useful building block for language models because they encode subtle but important information about the relationships between words.”

After vectors, the article explains the idea and function of layers:

“GPT-3, a 2020 predecessor to the language models that power ChatGPT, is organized into dozens of layers. Each layer takes a sequence of vectors as inputs—one vector for each word in the input text—and adds information to help clarify the meaning of that word and better predict which word might come next.”
“Research suggests that the first few layers focus on understanding the sentence’s syntax and resolving ambiguities […]. Later layers […] work to develop a high-level understanding of the passage as a whole.”

I found one section particularly enlightening about the relationship between vectors and layers:

“You can think of all those extra dimensions as a kind of “scratch space” that GPT-3 can use to write notes to itself about the context of each word. Notes made by earlier layers can be read and modified by later layers, allowing the model to gradually sharpen its understanding of the passage as a whole.”

Another important concept is “matching”:

“In the attention step, words “look around” for other words that have relevant context and share information with one another.”
“You can think of the attention mechanism as a matchmaking service for words. Each word makes a checklist (called a query vector) describing the characteristics of words it is looking for. Each word also makes a checklist (called a key vector) describing its own characteristics. The network compares each key vector to each query vector (by computing a dot product) to find the words that are the best match.”

That’s why matrix multiplication is so important for LLM training and inference. And these models are doing a lot of these multiplications:

“The largest version of GPT-3 has 96 layers with 96 attention heads each, so GPT-3 performs 9,216 attention operations each time it predicts a new word.”

“What makes the feed-forward layer powerful is its huge number of connections. We’ve drawn this network with three neurons in the output layer and six neurons in the hidden layer, but the feed-forward layers of GPT-3 are much larger: 12,288 neurons in the output layer (corresponding to the model’s 12,288-dimensional word vectors) and 49,152 neurons in the hidden layer.”
”[…] patterns got more abstract in the later layers. The early layers tended to match specific words, whereas later layers matched phrases that fell into broader semantic categories […]”
“Attention heads retrieve information from earlier words in a prompt, whereas feed-forward layers enable language models to “remember” information that’s not in the prompt.”
“Indeed, one way to think about the feed-forward layers is as a database of information the model has learned from its training data.”
“Many early machine learning algorithms required training examples to be hand-labeled by human beings.”

“A key innovation of LLMs is that they don’t need explicitly labeled data. Instead, they learn by trying to predict the next word in ordinary passages of text. Almost any written material—from Wikipedia pages to news articles to computer code—is suitable for training these models.”

I like the attempt to illustrate the compelxity with a network of pipes and valves:

“Obviously, this example quickly gets silly if you take it too literally. It wouldn’t be realistic or useful to build a network of pipes with 175 billion valves. But thanks to Moore’s Law, computers can and do operate at this kind of scale.”

Many people, maybe even the researchers exploring the capabilities of LLMs, were surprised how sudden these modes went from being useless to solving actual tasks that could make life easier.

“Nonetheless, the near-human performance of GPT-3 on several tasks designed to measure theory of mind would have been unthinkable just a few years ago— and is consistent with the idea that bigger models are generally better at tasks requiring high-level reasoning.”
“At the moment, we don’t have any real insight into how LLMs accomplish feats like this. Some people argue that such examples demonstrate that the models are starting to truly understand the meanings of the words in their training set. Others insist that language models are “stochastic parrots” that merely repeat increasingly complex word sequences without truly understanding them.”

The article adds one more observation that might help putting these advances into context:

”Regularities in language are often (though not always) connected to regularities in the physical world. So when a language model learns about relationships among words, it’s often implicitly learning about relationships in the world, too.”

I’m still in the camp that tends to regard LLMs as “stochastical parrots”, but maybe the next few iterations of LLMs will convince me otherwise. In any case, this article really helped me understand some of the internal workings of modern LLMs.

(Prompt for Craiyon V3 to generate the header image: “Large Language Model in action: analyzing text, tokenizing words, calculating vectors, using multiple layers for processing to answer questions.” / Style: Drawing.)