RNNs have two huge issues:
- long context. Recurrence degrades the signal for the same reason that &#x27;deep&#x27; nn architectures don&#x27;t go much past 3-4 layers before you need residual connections and the like
- (this is the big one) training performance is terrible since you can&#x27;t parallelize them across a sequence like you can with causal masked attn in transformersOn the huge benefit side though you get:
- guaranteed state size so perfect batch packing, perfect memory use, easy load&#x2F;unload from a batch, O(1) of token gen so generally massive performance gains in inference. 
- unlimited context (well, no need for a concept of a position embedding or similar system)Taking the best of both worlds is definitely where it is at for the future. An architecture that can train parallelized, has a fixed state size so you can load&#x2F;unload and patch batches perfectly, unlimited context (with perfect recall), etc etc. That is the real architecture to go for.

&gt; Industry OTOH has gone all-in on Transformers.It&#x27;s so annoying. Transformers keep improving and recurrent networks are harder to train so until we hit some real wall, companies don&#x27;t seem eager to diverge. It&#x27;s like lithium batteries improving easy faster than it was profitable to work on sodium ones, even though we unfortunately want the sodium ones to be better.

&gt; Who out there is working on ... infinite history?Many people are still working on improving RNNs, mostly in academia. Examples off the top of my head:* RWKV: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2006.16236" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2006.16236</a> &#x2F; <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2404.05892" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2404.05892</a> <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2305.13048" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2305.13048</a>* Linear attention: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2503.14456" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2503.14456</a>* State space models: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2312.00752" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2312.00752</a> &#x2F; <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2405.21060" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2405.21060</a>* Linear RNNs: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2410.01201" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2410.01201</a>Industry OTOH has gone all-in on Transformers.

I built guided window attn (literally predict the position of the window) a while ago and that works great. Why are we still stuck on any form of attn that looks at the entire context in any meaningful way? Do humans work this way? Do I need a whole book to predict the next word? Who out there is working on really new unique ways to deal with infinite history, other than me of course :)

What you&#x27;re missing is that there&#x27;s no need to do extra work in the kernel smoothing step (what attention essentially is) because all the fancy transformation work is already happening in learning the kernel.The feedforward networks prior to the attention layer are effectively learning sophisticated kernels. If you&#x27;re unfamiliar (or for those who are) a Kernel is just a generalization of the dot product which is the most fundamental way of defining &quot;similarity&quot; between two points.By learning a kernel the transformer is learning the best way to define what &quot;similar&quot; means for the task at hand and then we simply apply some basic smoothing over the data. This will handle all sort of interesting ways to compare points and that comparison will allow all points to provide a little bit of information.Anything you could hope to achieve by performing more comparisons would be better solved by a better similarity function.

You can find papers discussing &quot;cubic&quot; attention, i.e. each token gets to interact with each pair of other tokens, but always in very theoretical settings with single-layer transformers on contrived synthetic tasks.Keep in mind that LLMs have many many layers, so they have plenty of opportunity to model higher-order interactions without needing to brute force every possible combination of 10 previous tokens, of which the vast majority will be useless. Empirically, even full &quot;quadratic&quot; attention is not always necessary, as evidenced by the existence of linear&#x2F;sparse attention variants that perform almost as well.

Aren&#x27;t layers basically doing n^k attention? The attention block is n^2 because it allows 1 number per input&#x2F;output pair. But nothing prevents you from stacking these on top of each other and get k-th order of &quot;attentioness&quot; with each layer encoding a different order.

This is a common way of thinking. In practice this type of thing is more like optimizing flop allocation. Surely with an infinite compute and parameter budget you could have a better model with more intensive operations.Another thing to consider is that transformers are very general computers. You can encode many many more complex architectures in simpler, multi layer transformers.

Yes, and it works in theory.Less so in practice. You saturate the memory of a b200 with a few dozen tokens on attentions higher than order 4. Training is even worse.To paraphrase Knuth: high order polynomials are much more unimaginably large than mere infinity.

There are lots more complicated operations than comparing every token to every other token &amp; the complexity increases when you start comparing not just token pairs but token bigrams, trigrams, &amp; so on. There is no obvious proof that all those comparisons would be equivalent to the standard attention mechanism of comparing every token to every other one.

n^2 isn&#x27;t a setting someone chose, it&#x27;s a mathematical consequence of what attention is.Here&#x27;s what attention does: every token looks at every other token to decide what&#x27;s relevant. If you have n tokens, and each one looks at n others, you get n * n = n^2 operations.Put another way: n^2 is when every token gets to look at every other token. What would n^3 be? n^10?(sibling comment has same interpretation as you, then handwaves transformers can emulate more complex systems)

OT but instead of quadratic attention can we not have n^10 or something crazier? I feel like we are limiting the intelligence just to save cost. But I can imagine that there might be some questions that may be worth paying higher cost for.I feel like n^10 attention can capture patterns that lower complexity attention may not. So it seems arbitrary that we have n^2 attention.

<a href="https:&#x2F;&#x2F;hazyresearch.stanford.edu&#x2F;blog&#x2F;2025-03-15-tk-blackwell" rel="nofollow">https:&#x2F;&#x2F;hazyresearch.stanford.edu&#x2F;blog&#x2F;2025-03-15-tk-blackwe...</a>cooperative execution yeahas you can tell I do not do CUDA for a living :D

I do CUDA for a living (not inference) and for the life of me (and a couple of LLMs for that matter) I cannot figure out what you mean by &quot;SM pairs&quot;.Do you mean the coupled dies on stuff like the B200? An NVidia chip die has many SMs if so.Do you mean TMEM MMA cooperative execution? I&#x27;m guessing that must be it given what the paper is about.

Look at am the email addresses. If you’ll recall there’s an embargo on China.

I still have 2x NVLinked A6000 and they aren&#x27;t that bad compared to a single RTX 6000 Pro.

yep, <a href="https:&#x2F;&#x2F;github.com&#x2F;poad42&#x2F;cuda-fp8-ampere" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;poad42&#x2F;cuda-fp8-ampere</a> recently another attempt at squeezing whatever&#x27;s left from ampere

Oh wow there&#x27;s still work being done on ampere?I was wondering - I&#x27;ve been thinking about switching to AI systems programming (I know, easy task), but from what I understand, industry cloud GPUs are the main winners, right? Nobody&#x27;s going to pay me (assuming I even had the skills) to optimize for consumer GPUs?From what I understand, it&#x27;s not just number + capacity + performance, it&#x27;s literal core primitives. I don&#x27;t think any of the &quot;Blackwell&quot; chips like the grace one or rtx 5090 have for example SM pairs in their ISA? And likewise similar fundamental differences between consumer and cloud hopper (where the majority of the perf is the cloud one&#x27;s ISA?)So I guess I&#x27;m wondering if I should buy a GPU myself or should I just rent on the cloud if I wanted to start getting some experience in this field. How do you even get experience in this normally anyways, do you get into really good schools and into their AI labs which have a lot of funding?

tri dao isn&#x27;t on the paper is it even allowed to call it &quot;FlashAttention&quot;???

link if you don&#x27;t want to automatically download files<a href="https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;pdf&#x2F;10.1145&#x2F;3774934.3786425" rel="nofollow">https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;pdf&#x2F;10.1145&#x2F;3774934.3786425</a>

Less annoying link directly to the paper: <a href="https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;pdf&#x2F;10.1145&#x2F;3774934.3786425?download=true" rel="nofollow">https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;pdf&#x2F;10.1145&#x2F;3774934.3786425?download=...</a>

Tldr: 5% - 17% speedup due to removing a bottleneck by juggling where on a GPU&#x2F;compute core a computation is done during Flash attention.

FlashAttention-T: Towards Tensorized Attention