Softmax forever, or why I like softmax

177 points110 comments5 days ago
roger_

An aside: please use proper capitalization. With this article I found myself backtracking thinking I’d missed a word, which was very annoying. Not sure what the authors intention was with that decision but please reconsider.

show comments
maurits

For people interested in the softmax, log sum exp and energy models, have a look at "Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One" [1]

[1]: https://arxiv.org/abs/1912.03263

stared

There are many useful tricks - like cosine distance.

In contrast, softmax has a very deep grounding in statistical physics - where it is called the Boltzmann distribution. In fact, this connection between statistical physics and machine learning was so fundamental that it was a key part of the 2024 Nobel Prize in Physics awarded to Hopfield and Hinton.

show comments
creakingstairs

Because the domain is a Korean name, I half-expected this to be about an old Korean game company[1] with the same name. They made some banger RPGs at the time and had really great art books.

[1] https://en.m.wikipedia.org/wiki/ESA_(company)

show comments
incognito124

How to sample from a categorical: https://news.ycombinator.com/item?id=42596716

Note: I am the author

show comments
semiinfinitely

i think that log-sum-exp should actually be the function that gets the name "softmax" because its actually a soft maximum over a set of values. And what we call "softmax" should be called "grad softmax" (since grad of logsumexp is softmax).

show comments
janalsncm

This is a really intuitive explanation, thanks for posting. I think everyone’s first intuition for “how do we turn these logits into probabilities” is to use a weighted sum of the absolute values of the numbers. The unjustified complexity of softmax annoyed me in college.

The author gives a really clean explanation for why that’s hard for a network to learn, starting from first principles.

calebm

Funny timing, I just used softmax yesterday to turn a list of numbers (some of which could be negative) into a probability distribution (summing up to 1). So useful. It was the perfect tool for the job.

AnotherGoodName

For the answer of is "is softmax the only way to turn unnormalized real values into a categorial distribution" you can just use statistics.

Eg. Using Bayesian stats, if i assume an even prior (pretend i have no assumptions about how biased it is), i see a coin flip heads 4 times in a row, what's the probability of it being heads?

Via a long winded proof using the dirichlet distribution Bayesian stats will say "add one to the top and two to the bottom". Here we saw 4/4 heads. So we guess 5/6 chance of being heads (+1 to the top, +2 to the bottom) the next time or a 1/6 chance of being tails. This represents that the statistical model is assuming some bias in the coin.

That's normalized as a probability against 1 which is what we want. It works for multiple probabilities as well, you add to the bottom as many different outcomes as you have. The Dirichlet distribution allows for real numbers and you can support this too. If you feel this gives too much weight to the possibility of the coin being biased you can actually simply add more to the top and bottom which is the same as accounting for this in your prior, eg. add 100 to the top and 200 to the bottom instead.

Now this has a lot of differences with outcomes compared to softmax. It actually gives everything a non-zero chance rather than using the classic sigmoid activation function that softmax has underneath which moves things to almost absolute 0 or 1. But... other distributions like this are very helpful in many circumstances. Do you actually think the chance of tails becomes 0 if you see heads flipped 100 times in a row? Of course not.

So anyway the softmax function fits things to a particular type of distribution but you can actually fit pretty much anything to any distribution with good old statistics. Choose the right one for your use case.

show comments
yorwba

The author admits they "kinda stopped reading this paper" after noticing that they only used one hyperparameter configuration, which I agree is a flaw in the paper, but that's not an excuse for sloppy treatment of the rest of the paper. (It would however, be an excuse to ignore it entirely.)

In particular, the assumption that |a_k| ≈ 0 initially is incorrect, since in the original paper https://arxiv.org/abs/2502.01628 the a_k are distances from one vector to multiple other vectors, and they're unlikely to be initialized in such a way that the distance is anywhere close to zero. So while the gradient divergence near 0 could certainly be a problem, it doesn't have to be as fatal as the author seems to think it is.

show comments
nobodywillobsrv

Softmax’s exponential comes from counting occupation states. Maximize the ways to arrange things with logits as energies, and you get exp(logits) over a partition function, pure Boltzmann style. It’s optimal because it’s how probability naturally piles up.

show comments
littlestymaar

Off topic: Unlike many out there I'm not usually bothered by lack of capitalization in comments or tweets, but for an essay like this, it makes the paragraphs so hard to read!

show comments
bambax

OT: refusing to capitalize the first word of each sentence is an annoying posture that makes reading what you write more difficult. I tend to do it too when taking notes for myself because I'm the only reader and it saves picoseconds of typing; but I wouldn't dream on inflicting it upon others.

show comments
xchip

The author is trying to show off, you can't tell this because his explanation makes no sense and made it overcomplicated to look smart.