An aside: please use proper capitalization. With this article I found myself backtracking thinking I’d missed a word, which was very annoying. Not sure what the authors intention was with that decision but please reconsider.
show comments
maurits
For people interested in the softmax, log sum exp and energy models, have a look at "Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One" [1]
There are many useful tricks - like cosine distance.
In contrast, softmax has a very deep grounding in statistical physics - where it is called the Boltzmann distribution. In fact, this connection between statistical physics and machine learning was so fundamental that it was a key part of the 2024 Nobel Prize in Physics awarded to Hopfield and Hinton.
show comments
creakingstairs
Because the domain is a Korean name, I half-expected this to be about an old Korean game company[1] with the same name. They made some banger RPGs at the time and had really great art books.
i think that log-sum-exp should actually be the function that gets the name "softmax" because its actually a soft maximum over a set of values. And what we call "softmax" should be called "grad softmax" (since grad of logsumexp is softmax).
show comments
janalsncm
This is a really intuitive explanation, thanks for posting. I think everyone’s first intuition for “how do we turn these logits into probabilities” is to use a weighted sum of the absolute values of the numbers. The unjustified complexity of softmax annoyed me in college.
The author gives a really clean explanation for why that’s hard for a network to learn, starting from first principles.
calebm
Funny timing, I just used softmax yesterday to turn a list of numbers (some of which could be negative) into a probability distribution (summing up to 1). So useful. It was the perfect tool for the job.
AnotherGoodName
For the answer of is "is softmax the only way to turn unnormalized real values into a categorial distribution" you can just use statistics.
Eg. Using Bayesian stats, if i assume an even prior (pretend i have no assumptions about how biased it is), i see a coin flip heads 4 times in a row, what's the probability of it being heads?
Via a long winded proof using the dirichlet distribution Bayesian stats will say "add one to the top and two to the bottom". Here we saw 4/4 heads. So we guess 5/6 chance of being heads (+1 to the top, +2 to the bottom) the next time or a 1/6 chance of being tails. This represents that the statistical model is assuming some bias in the coin.
That's normalized as a probability against 1 which is what we want. It works for multiple probabilities as well, you add to the bottom as many different outcomes as you have. The Dirichlet distribution allows for real numbers and you can support this too. If you feel this gives too much weight to the possibility of the coin being biased you can actually simply add more to the top and bottom which is the same as accounting for this in your prior, eg. add 100 to the top and 200 to the bottom instead.
Now this has a lot of differences with outcomes compared to softmax. It actually gives everything a non-zero chance rather than using the classic sigmoid activation function that softmax has underneath which moves things to almost absolute 0 or 1. But... other distributions like this are very helpful in many circumstances. Do you actually think the chance of tails becomes 0 if you see heads flipped 100 times in a row? Of course not.
So anyway the softmax function fits things to a particular type of distribution but you can actually fit pretty much anything to any distribution with good old statistics. Choose the right one for your use case.
show comments
yorwba
The author admits they "kinda stopped reading this paper" after noticing that they only used one hyperparameter configuration, which I agree is a flaw in the paper, but that's not an excuse for sloppy treatment of the rest of the paper. (It would however, be an excuse to ignore it entirely.)
In particular, the assumption that |a_k| ≈ 0 initially is incorrect, since in the original paper https://arxiv.org/abs/2502.01628 the a_k are distances from one vector to multiple other vectors, and they're unlikely to be initialized in such a way that the distance is anywhere close to zero. So while the gradient divergence near 0 could certainly be a problem, it doesn't have to be as fatal as the author seems to think it is.
show comments
nobodywillobsrv
Softmax’s exponential comes from counting occupation states. Maximize the ways to arrange things with logits as energies, and you get exp(logits) over a partition function, pure Boltzmann style. It’s optimal because it’s how probability naturally piles up.
show comments
littlestymaar
Off topic: Unlike many out there I'm not usually bothered by lack of capitalization in comments or tweets, but for an essay like this, it makes the paragraphs so hard to read!
show comments
bambax
OT: refusing to capitalize the first word of each sentence is an annoying posture that makes reading what you write more difficult. I tend to do it too when taking notes for myself because I'm the only reader and it saves picoseconds of typing; but I wouldn't dream on inflicting it upon others.
show comments
xchip
The author is trying to show off, you can't tell this because his explanation makes no sense and made it overcomplicated to look smart.
An aside: please use proper capitalization. With this article I found myself backtracking thinking I’d missed a word, which was very annoying. Not sure what the authors intention was with that decision but please reconsider.
For people interested in the softmax, log sum exp and energy models, have a look at "Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One" [1]
[1]: https://arxiv.org/abs/1912.03263
There are many useful tricks - like cosine distance.
In contrast, softmax has a very deep grounding in statistical physics - where it is called the Boltzmann distribution. In fact, this connection between statistical physics and machine learning was so fundamental that it was a key part of the 2024 Nobel Prize in Physics awarded to Hopfield and Hinton.
Because the domain is a Korean name, I half-expected this to be about an old Korean game company[1] with the same name. They made some banger RPGs at the time and had really great art books.
[1] https://en.m.wikipedia.org/wiki/ESA_(company)
How to sample from a categorical: https://news.ycombinator.com/item?id=42596716
Note: I am the author
i think that log-sum-exp should actually be the function that gets the name "softmax" because its actually a soft maximum over a set of values. And what we call "softmax" should be called "grad softmax" (since grad of logsumexp is softmax).
This is a really intuitive explanation, thanks for posting. I think everyone’s first intuition for “how do we turn these logits into probabilities” is to use a weighted sum of the absolute values of the numbers. The unjustified complexity of softmax annoyed me in college.
The author gives a really clean explanation for why that’s hard for a network to learn, starting from first principles.
Funny timing, I just used softmax yesterday to turn a list of numbers (some of which could be negative) into a probability distribution (summing up to 1). So useful. It was the perfect tool for the job.
For the answer of is "is softmax the only way to turn unnormalized real values into a categorial distribution" you can just use statistics.
Eg. Using Bayesian stats, if i assume an even prior (pretend i have no assumptions about how biased it is), i see a coin flip heads 4 times in a row, what's the probability of it being heads?
Via a long winded proof using the dirichlet distribution Bayesian stats will say "add one to the top and two to the bottom". Here we saw 4/4 heads. So we guess 5/6 chance of being heads (+1 to the top, +2 to the bottom) the next time or a 1/6 chance of being tails. This represents that the statistical model is assuming some bias in the coin.
That's normalized as a probability against 1 which is what we want. It works for multiple probabilities as well, you add to the bottom as many different outcomes as you have. The Dirichlet distribution allows for real numbers and you can support this too. If you feel this gives too much weight to the possibility of the coin being biased you can actually simply add more to the top and bottom which is the same as accounting for this in your prior, eg. add 100 to the top and 200 to the bottom instead.
Now this has a lot of differences with outcomes compared to softmax. It actually gives everything a non-zero chance rather than using the classic sigmoid activation function that softmax has underneath which moves things to almost absolute 0 or 1. But... other distributions like this are very helpful in many circumstances. Do you actually think the chance of tails becomes 0 if you see heads flipped 100 times in a row? Of course not.
So anyway the softmax function fits things to a particular type of distribution but you can actually fit pretty much anything to any distribution with good old statistics. Choose the right one for your use case.
The author admits they "kinda stopped reading this paper" after noticing that they only used one hyperparameter configuration, which I agree is a flaw in the paper, but that's not an excuse for sloppy treatment of the rest of the paper. (It would however, be an excuse to ignore it entirely.)
In particular, the assumption that |a_k| ≈ 0 initially is incorrect, since in the original paper https://arxiv.org/abs/2502.01628 the a_k are distances from one vector to multiple other vectors, and they're unlikely to be initialized in such a way that the distance is anywhere close to zero. So while the gradient divergence near 0 could certainly be a problem, it doesn't have to be as fatal as the author seems to think it is.
Softmax’s exponential comes from counting occupation states. Maximize the ways to arrange things with logits as energies, and you get exp(logits) over a partition function, pure Boltzmann style. It’s optimal because it’s how probability naturally piles up.
Off topic: Unlike many out there I'm not usually bothered by lack of capitalization in comments or tweets, but for an essay like this, it makes the paragraphs so hard to read!
OT: refusing to capitalize the first word of each sentence is an annoying posture that makes reading what you write more difficult. I tend to do it too when taking notes for myself because I'm the only reader and it saves picoseconds of typing; but I wouldn't dream on inflicting it upon others.
The author is trying to show off, you can't tell this because his explanation makes no sense and made it overcomplicated to look smart.