Back when ChatGPT came out, I was so shocked by how _good_ it was for an “AI” product that I simply had to know how it worked. Over the next month I ended up drawing out a block diagram on a whiteboard I have in my office, with the math involved next to each step in the blackboard. I’d puzzle about each step along the way, and the triumph of completing the drawing was also that of this sense of deep understanding. I kept that drawing up for many months after, and would gaze at it often during meetings and idle moments in wonder.
This is to say: the autoregressive decoder-only transformer llm architecture as pioneered by openai is wildly simple for how revolutionary its results are. I was reading about non-learned classical SLAM systems (uses video + handcrafted math to produce 3d mappings of physical spaces while also locating the camera in those spaces) at the time, and comparatively speaking I’d say the math is about as complicated as ONE of the components in those complex formulations. The only reason frontier LLMs need 6-figure computers to run is because the model designers made the middle bit in those models REALLY BIG, dimensionally speaking. They just took the steam engine, made a few gargantuan versions of it, and are selling them as the ultimate source of power.
This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilities like being able to pick the best ending to a story/set of instructions or answer questions about broad factual knowledge. I’ve been meanwhile watching these AI companies attempt, successfully, to sell this capability as some sort of robot consciousness hand-crafted by supergeniuses. The fact that they are getting away with it is almost as shocking to me as the discovery itself.
show comments
miki123211
There's one thing I wish people understood about LLMs, and it doesn't really have anything to do with what's inside the neural network part. It's the fact that LLMs can only write in one direction — forward.
When you are writing an essay and realize midway through a sentence that what you've written doesn't make sense, you go back and edit. An LLM can't do that, the only thing it can do is keep on generating. Because training data typically contains full essays and not half-finished sentences which were then edited, LLMs have a strong preference for "saving face" and producing grammatically correct, internally coherent outputs. They will often do so even if the only way to write themselves out of the corner they wrote themselves into is to lie. To maintain internal coherence, they'll then repeat that lie for the rest of the response.
This is also why changing response structure used to affect LLM performance so dramatically. If you asked an LLM to solve a math problem and all-but-forced it to start with the answer, it would have had to calculate that answer before emitting any tokens, something which it very often wasn't able to do. If it was told to follow up the answer with an explanation, it would produce a plausible-sounding explanation to maintain coherence.
If, on the other hand, it was told to start by "thinking step by step", it would often be able to solve the first step, and then the next one given the results of the first, and so on, until it was able to reach the answer. Because the answer came last, it wasn't committing to anything, so had no reason to "save face" and lie.
This part of the problem is basically solved now with reasoning; reasoning is where all the step-by-step stuff happens, even if users aren't always able to see it. In the process of RLVR, models even train themselves into outputting phrases like "let me check my answer once again" in the chain-of-thought; those serve as their "life rafts" which they can use to both save face and change their answer.
show comments
helloplanets
The part about positional encoding is not correct.
> The intuition: instead of adding position info to each token’s vector, RoPE rotates the vector by an angle that depends on its position
You can't rotate the token's entire vector (or all three vectors, whatever is being implied is unclear). You rotate each token's Query and Key vectors only, so dot product can be used to tell how far apart the tokens are when comparing token 1's Query vector to token 2's Key vector.
Positional embedding should just be explained after explaining the Query, Key and Value vectors. When the article explains those only after that, the reader is building up on a wrong intuition and it gets confusing.
show comments
10GBps
I learned TCP/IP by watching and reading raw packets over packet radio at 1200 baud.
I've noticed the same thing is possible if you watch the output of a slow LLM. Eventually you start to see the machinery. input tokens = output tokens, it's math. I can't exactly predict the tokens generated but I can see how they are formed. It's a lot like chess. You can't see every possible move but the mechanism is understandable.
show comments
oceansky
Out of curiosity, I wondered if you could break a tokenizer by introducing weird characters not mapped to an id.
But apparently, they either just emit a [UNK] token or translate the unrecognized character into raw UTF-8 bytes.
Saying an article is of inferior quality just because editing was AI-assisted is like saying a book is lower quality just because it was printed rather than written by hand
show comments
andai
I couldn't load the article directly due to an SSL issue, so here's the archive link:
Nice article but chain of thought is what makes frontier LLMs smart, not really the token loop
show comments
agumonkey
Nice intro, gonna help me dig further a lot now. Thanks a ton.
whyage
Style nit: the transitions between dark-mode text and large diagrams with a snow white background are jarring.
AltruisticGapHN
I don't like how most LLM explainer articles and videos say that essentially a LLM " predicts the next word".
I'm a developer but not very good at maths and I still don't understand any of it.
A LLM clearly has some "visual" capacity. You ask Gemini to build something with Canvas and it's able to reason about the shape of things. Like recently I waanted a checkbox that has like a gradient flowing around the edge. It figured out it could use a radial gradient from the center of the checkbox, and overlay that with a small inner div so you only see the edge that looks like the gradient is circling around the checkbox.
How is that "predicting the next word"?
Not saying AI is intelligent or conscious or anything like that, but the algorithm clearly is far more complex than "predicting words".
What I mean, is the LLM is able to represent things in space . That part I don't understand.
I also still dont understand the relationship between the chat based LLM and the multi modal stuff. I think I read somewhere when image is generated it is also tokens?
show comments
yukIttEft
> so the model figures out during training what each token should look for and what it should offer
But how does it learn this token-relationship?
All it has is many text samples, but still, nowhere it says how the tokens relate to each other, so where does this information come from?
show comments
melvinroest
I thought Karpathy’s microgpt explain how LLMs work
show comments
stalfie
This article describes how Transformers work, but not really how LLMs work. Explaining the underlying architecture gives you about as much insight into how a modern LLM behaves as an breakdown of neuronal biochemistry and a few pathways does for the brain. Meaning, almost no insight at all.
rishbz
Great insights. RL training is the key
spacebacon
But how do they “think”? This is the only repo that can tell you that.
i'm not actually sure who your target audience is.
there's too many side tangents.
just like, structure it plz.
1. customer feels bad cuz they don't understand how llms work
2. provide high level abstracted explanation (don't dive into concepts yet)
3. provide breakdown guide of overall set of components.
4. walk through each component. don't side track. no need to explain, ROPE,GQA etc... it just distracts.
i.e.
customers don't know how llms work, leading them to feel bad about their own intelligence.
at a high level llms take in words, do some math on them, and then produce words, one by one.
inside llms have these different components. we walk through them step by step.
1. tokenizer
2. embedding
3. attention
4. heads
5. ffn
6. sampling
## tokenizer
show comments
mathisdev7
very interesting and useful!
lhd1
find it difficult to engage with AI generated text. What am I getting here that I couldn't get from a chatbot.
show comments
cubefox
We are living in a crazy science fiction world where on the top of the HN frontpage there is an article on how LLMs work which is likely itself LLM generated, and the only way to tell is its writing style rather than its factual accuracy.
lateral_cloud
I don't understand how these AI written articles get so many votes.
show comments
singpolyma3
Next do "why LLMs work"
show comments
codeakki
What's the point of this? Im not here to engage with AI bots
whateveracct
accidentally quadratic
lionkor
It sucks that this article is clearly LLM edited, with common phrases like "same shape as", "the intuition: ", and the "tiny explainer" which clearly generalized from a prompt accidentally.
Good article, but when sharing it I will have to preface "yes it's slop, but it's a good explanation".
Absolutely embarrassing that the author didn't catch that these LLM-isms are a (and here I'll use one) bad signal.
In fact, I would go so far as to say that publishing in this style stems from a lack of reading experience and writing experience, which does not bode well for someone pretending to be an expert. I gave this article to someone highly intelligent who doesn't know the first thing about how LLMs work internally, and she immediately called out that it reads like AI text.
Back when ChatGPT came out, I was so shocked by how _good_ it was for an “AI” product that I simply had to know how it worked. Over the next month I ended up drawing out a block diagram on a whiteboard I have in my office, with the math involved next to each step in the blackboard. I’d puzzle about each step along the way, and the triumph of completing the drawing was also that of this sense of deep understanding. I kept that drawing up for many months after, and would gaze at it often during meetings and idle moments in wonder.
This is to say: the autoregressive decoder-only transformer llm architecture as pioneered by openai is wildly simple for how revolutionary its results are. I was reading about non-learned classical SLAM systems (uses video + handcrafted math to produce 3d mappings of physical spaces while also locating the camera in those spaces) at the time, and comparatively speaking I’d say the math is about as complicated as ONE of the components in those complex formulations. The only reason frontier LLMs need 6-figure computers to run is because the model designers made the middle bit in those models REALLY BIG, dimensionally speaking. They just took the steam engine, made a few gargantuan versions of it, and are selling them as the ultimate source of power.
This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilities like being able to pick the best ending to a story/set of instructions or answer questions about broad factual knowledge. I’ve been meanwhile watching these AI companies attempt, successfully, to sell this capability as some sort of robot consciousness hand-crafted by supergeniuses. The fact that they are getting away with it is almost as shocking to me as the discovery itself.
There's one thing I wish people understood about LLMs, and it doesn't really have anything to do with what's inside the neural network part. It's the fact that LLMs can only write in one direction — forward.
When you are writing an essay and realize midway through a sentence that what you've written doesn't make sense, you go back and edit. An LLM can't do that, the only thing it can do is keep on generating. Because training data typically contains full essays and not half-finished sentences which were then edited, LLMs have a strong preference for "saving face" and producing grammatically correct, internally coherent outputs. They will often do so even if the only way to write themselves out of the corner they wrote themselves into is to lie. To maintain internal coherence, they'll then repeat that lie for the rest of the response.
This is also why changing response structure used to affect LLM performance so dramatically. If you asked an LLM to solve a math problem and all-but-forced it to start with the answer, it would have had to calculate that answer before emitting any tokens, something which it very often wasn't able to do. If it was told to follow up the answer with an explanation, it would produce a plausible-sounding explanation to maintain coherence.
If, on the other hand, it was told to start by "thinking step by step", it would often be able to solve the first step, and then the next one given the results of the first, and so on, until it was able to reach the answer. Because the answer came last, it wasn't committing to anything, so had no reason to "save face" and lie.
This part of the problem is basically solved now with reasoning; reasoning is where all the step-by-step stuff happens, even if users aren't always able to see it. In the process of RLVR, models even train themselves into outputting phrases like "let me check my answer once again" in the chain-of-thought; those serve as their "life rafts" which they can use to both save face and change their answer.
The part about positional encoding is not correct.
> The intuition: instead of adding position info to each token’s vector, RoPE rotates the vector by an angle that depends on its position
You can't rotate the token's entire vector (or all three vectors, whatever is being implied is unclear). You rotate each token's Query and Key vectors only, so dot product can be used to tell how far apart the tokens are when comparing token 1's Query vector to token 2's Key vector.
Positional embedding should just be explained after explaining the Query, Key and Value vectors. When the article explains those only after that, the reader is building up on a wrong intuition and it gets confusing.
I learned TCP/IP by watching and reading raw packets over packet radio at 1200 baud.
I've noticed the same thing is possible if you watch the output of a slow LLM. Eventually you start to see the machinery. input tokens = output tokens, it's math. I can't exactly predict the tokens generated but I can see how they are formed. It's a lot like chess. You can't see every possible move but the mechanism is understandable.
Out of curiosity, I wondered if you could break a tokenizer by introducing weird characters not mapped to an id.
But apparently, they either just emit a [UNK] token or translate the unrecognized character into raw UTF-8 bytes.
A better blog on Transformers: https://www.aleksagordic.com/blog/transformer
Saying an article is of inferior quality just because editing was AI-assisted is like saying a book is lower quality just because it was printed rather than written by hand
I couldn't load the article directly due to an SSL issue, so here's the archive link:
https://archive.ph/aWtFG
Nice article but chain of thought is what makes frontier LLMs smart, not really the token loop
Nice intro, gonna help me dig further a lot now. Thanks a ton.
Style nit: the transitions between dark-mode text and large diagrams with a snow white background are jarring.
I don't like how most LLM explainer articles and videos say that essentially a LLM " predicts the next word".
I'm a developer but not very good at maths and I still don't understand any of it.
A LLM clearly has some "visual" capacity. You ask Gemini to build something with Canvas and it's able to reason about the shape of things. Like recently I waanted a checkbox that has like a gradient flowing around the edge. It figured out it could use a radial gradient from the center of the checkbox, and overlay that with a small inner div so you only see the edge that looks like the gradient is circling around the checkbox.
How is that "predicting the next word"?
Not saying AI is intelligent or conscious or anything like that, but the algorithm clearly is far more complex than "predicting words".
What I mean, is the LLM is able to represent things in space . That part I don't understand.
I also still dont understand the relationship between the chat based LLM and the multi modal stuff. I think I read somewhere when image is generated it is also tokens?
> so the model figures out during training what each token should look for and what it should offer
But how does it learn this token-relationship?
All it has is many text samples, but still, nowhere it says how the tokens relate to each other, so where does this information come from?
I thought Karpathy’s microgpt explain how LLMs work
This article describes how Transformers work, but not really how LLMs work. Explaining the underlying architecture gives you about as much insight into how a modern LLM behaves as an breakdown of neuronal biochemistry and a few pathways does for the brain. Meaning, almost no insight at all.
Great insights. RL training is the key
But how do they “think”? This is the only repo that can tell you that.
https://github.com/space-bacon/SRT
this is hard to read...
it goes all over the place.
i'm not actually sure who your target audience is.
there's too many side tangents.
just like, structure it plz.
1. customer feels bad cuz they don't understand how llms work
2. provide high level abstracted explanation (don't dive into concepts yet)
3. provide breakdown guide of overall set of components.
4. walk through each component. don't side track. no need to explain, ROPE,GQA etc... it just distracts.
i.e. customers don't know how llms work, leading them to feel bad about their own intelligence.
at a high level llms take in words, do some math on them, and then produce words, one by one.
inside llms have these different components. we walk through them step by step.
1. tokenizer
2. embedding
3. attention
4. heads
5. ffn
6. sampling
## tokenizer
very interesting and useful!
find it difficult to engage with AI generated text. What am I getting here that I couldn't get from a chatbot.
We are living in a crazy science fiction world where on the top of the HN frontpage there is an article on how LLMs work which is likely itself LLM generated, and the only way to tell is its writing style rather than its factual accuracy.
I don't understand how these AI written articles get so many votes.
Next do "why LLMs work"
What's the point of this? Im not here to engage with AI bots
accidentally quadratic
It sucks that this article is clearly LLM edited, with common phrases like "same shape as", "the intuition: ", and the "tiny explainer" which clearly generalized from a prompt accidentally.
Good article, but when sharing it I will have to preface "yes it's slop, but it's a good explanation".
Absolutely embarrassing that the author didn't catch that these LLM-isms are a (and here I'll use one) bad signal.
In fact, I would go so far as to say that publishing in this style stems from a lack of reading experience and writing experience, which does not bode well for someone pretending to be an expert. I gave this article to someone highly intelligent who doesn't know the first thing about how LLMs work internally, and she immediately called out that it reads like AI text.