OK, but then, in this regard, left to right generation is hardy better:Once you get to &quot;British cats &lt;next-token-here&gt;&quot; you can&#x27;t get to &quot;British munchkin cats &lt;next-token-here&gt;&quot;; the tokens to the left are done and dusted.It&#x27;s kind of a feature. Diffusion is used for images, right? It&#x27;s like saying, once the image of a door has started to form right next to a kitchen counter, it cannot insert a refrigerator there any more. Well, maybe it doesn&#x27;t &quot;want to&quot; because that layout is already settled by that time.

The cat example is from the section on their block-causal attention mask. I really don&#x27;t think this fixes the issue. So far as I can see, the block schedule dictates when they sample at each position. It does _not_ change that they basically have an array-of-token-vars representation, and once `t_i` is sampled, nothing can &quot;move&quot; that value left or right.

This blogpost references block diffusion which fixes this issue that you are describing.

But the &quot;infilling&quot; problem isn&#x27;t exactly solved for AR LLMs, so it&#x27;s a strange critique.Further more, you&#x27;re applying the logic of AR LLMs to diffusion models. AR LLMs are only seeking the probability of the next token (a chain of conditional probability), but diffusion LLMs are modeling the probability of the entire output at once. Because of this token structures that leads to invalid outputs should be extremely low probability if properly trained.

Early draft yes. But when you write an early draft of prose or code, you leave yourself the ability to insert or remove material in a way that _changes the indexes of the tokens you already put in your draft_. If you write a letter, you may know that it ends with &quot;Yours Truly, &lt;your name&gt;&quot;, but not know the absolute number of tokens the letter will use. In this framework, once you say that &quot;Yours Truly, John Hancock&quot; are tokens 501 to 506, infilling the preceding sentences requires that you exactly preserve the number of tokens before that point ... which to me seems silly.
I&#x27;m sure it&#x27;s computationally messy to be able to slide stuff around, but if it meaningfully changes the topology of the search process, it may be worth it.

I think that having an early draft of the output is part of the appeal of this type of models.

I think the gap is, if they&#x27;re building hybrids with _forward_ AR and diffusion, they risk giving up the cool part of diffusion which is reasoning back.
I may be imposing unreasonable human biases on to this, but I really think it would be interesting to have the model engage with the structure of the text, rather than just being either a sequence or an array of tokens. 
E.g. &quot;I&#x27;m going to _ tomorrow.&quot; If the _ is not just a token but an expansion in context, which might be a noun phrase, a verb phrase etc, it could be filled in with &quot;the mall&quot;, &quot;practice guitar&quot;.
In code &quot;if (_1) { return _2; }&quot;, _1 could be an expression whose type is bool, and which makes sense as a check to confirm that some process is finished. I don&#x27;t care specifically how many tokens either of those is, but I do care that it makes sense in context.

IIRC, some researchers are working on mixed AR+diffusion models for this sort of thing.

Diffusion model papers are always interesting to read but I always feel like they need some mechanism to insert or delete tokens.
In the example in the figure in this post, once it has fixed &quot;British munchkin cats _ _ and ...&quot; you _can&#x27;t_ get to &quot;British munchkin cats are a new and controversial breed.&quot; because there&#x27;s not the right number of tokens between &quot;cats&quot; and &quot;and&quot;.
In a coding context, if your model samples a paren or a comma or something which is entirely plausible at that position, it can still close off an expansion which would be syntactically correct.

You&#x27;re hitting on something really important that barely gets discussed. For instance, notice how opus 4.5&#x27;s speed essentially doubled, bringing it right in line with the speed of sonnet 4.5? (sonnet 4.6 got a speed bump too, though closer to 25%).It was the very first thing I noticed: it looks suspiciously like they just rebranded sonnet as opus and raised the price.I don&#x27;t know why more people aren&#x27;t talking about this. Even on X, where the owner directly competes in this market, it&#x27;s rarely brought up. I strongly suspect there is a sort of tacit collusion between competitors in 
this space. They all share a strong motivation to kill any deep discussion of token economics, even about each other because transparency only arms the customers. 
By keeping the underlying mechanics nebulous, they can all justify higher prices. Just look at the subscription tiers: every single major player has settled on the exact same pricing model, a $20 floor and a $200 cap, no exceptions.

Similar trend in open text-to-image models: Flux.1 was 12B but now we have 6B models with much better quality. Qwen Image goes from 20B to 7B while merging the edit line and improving quality. Now that the cost of spot H200s at 140GB came down to A100 levels, you can finally try larger scale finetuning&#x2F;distillation&#x2F;rl with these models. Very promising direction for open tools and models if the trend continues.

From what I&#x27;ve gathered, they&#x27;ve been mostly training limited. Better training methods and cleaner training data allows smaller models to rival or outperform larger models training with older methods and lower-quality training data.For example, the Qwen3 technical report[1] says that the Qwen3 models are architecturally very similar to Qwen2.5, with the main change being a tweak in the attention layers to stabilize training. And if you compare table 1 in Qwen3 paper with table 1 in Qwen 2.5 technical report[2], the layer count, attention configuration and such is very similar. Yet Qwen3 was widely regarded as a significant upgrade to Qwen2.5.However, for training, they doubled the pre-training token count, and tripled the number of languages. It&#x27;s been shown that training on more languages can actually help LLMs generalize better. They used Qwen2.5 VL and Qwen 2.5 to generate additional training data by parsing a large number PDFs and turning them into high quality training tokens. They improved their annotation so they could more effectively provide diverse training tokens to the model, improving training efficiency.They continued this trend with Qwen3.5, where even more and better training data[3] made their Qwen3.5-397B-A17B model match the 1T-parameter Qwen3-Max-Base.That said there&#x27;s also been a lot of work on model architecture[4], getting more speed and quality per parameter. In the case of Qwen3-Next architecture which 3.5 is based on, that means such things as hybrid attention for faster long-context operation, and sparse MoE and multi-token prediction for less compute per output token.I used Qwen as an example here, from what I gather they&#x27;re just an example of the general trend.[1]: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2505.09388" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2505.09388</a>[2]: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2412.15115" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2412.15115</a>[3]: <a href="https:&#x2F;&#x2F;qwen.ai&#x2F;blog?id=qwen3.5" rel="nofollow">https:&#x2F;&#x2F;qwen.ai&#x2F;blog?id=qwen3.5</a>[4]: <a href="https:&#x2F;&#x2F;qwen.ai&#x2F;blog?id=4074cca80393150c248e508aa62983f9cb7d27cd" rel="nofollow">https:&#x2F;&#x2F;qwen.ai&#x2F;blog?id=4074cca80393150c248e508aa62983f9cb7d...</a>

&gt; Parameter count was used as a measure for how great the proprietary models were until GPT3, then it suddenly stopped.AFAICT that&#x27;s mostly because what you&#x27;re getting when you select a &quot;model&quot; from most of these cloud chat model providers today, isn&#x27;t a specific concrete model, but rather is a model family, where your inference request is being routed to varying models within the family during the request. There&#x27;s thus no one number of weights for &quot;the model&quot;, since several entirely-independent models can be involved in generating each response.And to be clear, I&#x27;m not just talking about how selecting e.g. &quot;ChatGPT 5.2&quot; sometimes gets you a thinking model and sometimes doesn&#x27;t, etc.I&#x27;m rather saying that, even when specifically requesting the strongest &#x2F; most intelligent &quot;thinking&quot; models, there are architectural reasons that the workload could be (and probably is) routed to several component &quot;sub-models&quot;, that handle inference during different parts of the high-level response &quot;lifecycle&quot;; with the inference framework detecting transition points in the response stream, and &quot;handing off&quot; the context + response stream from one of these &quot;sub-models&quot; to another.(Why? Well, imagine how much &quot;smarter&quot; a model could be if it had a lot more of its layers available for deliberation, because it didn&#x27;t have to spend so many layers on full-fat NLP parsing of input or full-fat NLP generation of output. Split a model into a pipeline of three sub-models, where the first one is trained to &quot;just understand&quot; — i.e. deliberate by rephrasing whatever you say to it into simpler terms; the second one is trained to &quot;just think&quot; — i.e. assuming pre-&quot;understood&quot; input and doing deep scratch work in some arbitrary grammar to eventually write out a plan for a response; and the third one is trained to &quot;just speak&quot; — i.e. attend almost purely to the response plan and whatever context-tokens that plan attends to, to NLP-generate styled prose, in a given language, with whatever constraints the prompt required. Each of these sub-models can be far smaller and hotter in VRAM than a naive monolithic thinking model. And these sub-models can make a fixed assumption about which phase they&#x27;re operating in, rather than having to spend precious layers just to make that determination, over and over again, on every single token generation step.)And, presuming they&#x27;re doing this, the cloud provider can then choose to route each response lifecycle phase to a different weight-complexity-variant for that lifecycle phase&#x27;s sub-model. (Probably using a very cheap initial classifier model before each phase: context =&gt; scalar nextPhaseComplexityDemand.) Why? Because even if you choose the highest-intelligence model from the selector, and you give it a prompt that really depends on that intelligence for a response... your response will only require a complex understanding-phase sub-model if your input prose contained the high-NLP-complexity tokens that would confuse a lesser understanding-phase sub-model; and your response will only require a complex responding-phase sub-model if the thinking-phase model&#x27;s emitted response plan specifies complex NLP or prompt-instruction-following requirements that only a more-complex responding-phase sub-model knows how to manage.Which is great, because it means that now even when using the &quot;thinking&quot; model, most people with most requests are only holding a reservation on a GPU holding a copy of the (probably still hundreds-of-billions-of-weights) high-complexity-variant thinking-phase sub-model weights, for the limited part of that response generation lifecycle where the thinking phase is actually occurring. During the &quot;understanding&quot; and &quot;responding&quot; phases, that reservation can be released for someone else to use! And for the vast majority of requests, the &quot;thinking&quot; phase is the shortest phase. So users end up sitting around waiting for the &quot;understanding&quot; and &quot;responding&quot; phases to complete before triggering another inference request. Which brings the per-user duty cycle of thinking-phase sub-model use way down.

&gt; I doubt frontier models have actually substantially grown in size in the last 1.5 years... and you&#x27;d be most likely very correct with your doubt, given the evidence we have.What improved disproportionally more than the software- or hardware-side, is density[1]&#x2F;parameter, indicating that there&#x27;s a &quot;Moore&#x27;s Law&quot;-esque behind the amount of parameters, the density&#x2F;parameter and compute-requirements. As long as more and more information&#x2F;abilities can be squeezed into the same amount of parameters, inference will become cheaper and cheaper quicker and quicker.I write &quot;quicker and quicker&quot;, because next to improvements in density there will still be additional architectural-, software- and hardware-improvements. It&#x27;s almost as if it&#x27;s going exponential and we&#x27;re heading for a so called Singularity.Since it&#x27;s far more efficient and &quot;intelligent&quot; to have many small models competing with and correcting each other for the best possible answer, in parallel, there simply is no need for giant, inefficient, monolithic monsters.They ain&#x27;t gonna tell us that, though, because then we&#x27;d know that we don&#x27;t need them anymore.[1] for lack of a better term that I am not aware of.

Notice how all the major AI companies (at least the ones that don&#x27;t do open releases) stopped telling us how many parameters their models have. Parameter count was used as a measure for how great the proprietary models were until GPT3, then it suddenly stopped.And how inference prices have come down a lot, despite increasing pressure to make money. Opus 4.6 is $25&#x2F;MTok, Opus 4.1 was $75&#x2F;MTok, the same as Opus 4 and Opus 3. OpenAI&#x27;s o1 was $60&#x2F;MTok, o1 pro $600&#x2F;MTok, gpt-5.2 is $14&#x2F;MTok and 5.2-pro is $168&#x2F;MTok.Also note how GPT-4 was rumored to be in the 1.8T realm, and now Chinese models in the 1T realm can match or surpass it. And I doubt the Chinese have a monopoly on those efficiency improvementsI doubt frontier models have actually substantially grown in size in the last 1.5 years, and potentially have a lot fewer parameters than the frontier models of old

Why not both?Scaling laws are real! But they don&#x27;t preclude faster processing.

It&#x27;s the same thing. Quantize your parameters? &quot;Bigger&quot; model runs faster. MOE base model distillation? &quot;Bigger&quot; model runs as smaller model.There is no gain for anyone anywhere by reducing parameter count overall if that&#x27;s what you mean. That sounds more like you don&#x27;t like transformer models than a real performance desire

I wish there would be more of this research to speed things up rather than building ever larger models

Did you publish anything you could link wrt. query rewriting?

I worked on it for a more specialized task (query rewriting). It’s blazing fast.A lot of inference code is set up for autoregressive decoding now. Diffusion is less mature. Not sure if Ollama or llama cpp support it.

Based on my experience running diffusion image models I really hope this isn&#x27;t going to take over anytime soon. Parallel decoding may be great if you have a nice parallel gpu or npu but is dog slow for cpus

Because diffusion models have a substantially different refining process, most current software isn&#x27;t built to support it. So I&#x27;ve also been struggling to find a way to play with these models on my machine. I might see if I can cook something up myself before someone else does...

Is anyone doing any form of diffusion language models that are actually practical to run today on the actual machine under my desk? There&#x27;s loads of more &quot;traditional&quot; .gguf options (well, quants) that are practical even on shockingly weak hardware, and I&#x27;ve been seeing things that give me hope that diffusion is the next step forward, but so far it&#x27;s all been early research prototypes.

I&#x27;d love to know what&#x27;s going on with the Gemini Diffusion model - they had a preview last May and it was crazy fast but I&#x27;ve not heard anything since then.

Oh neat. So have the llm output csv instead of JSON and then convert it? How would handle nested structures?

One trick I learned for this was to use csv for LLM I&#x2F;I and translate json &lt;-&gt; csv at the boundary layer

Seeing half of an AR LLM&#x27;s output tokens go to generating a predefined json schema bothers me so much. I would love to have an option to use diffusion for infilling.

The 2.5kW figure is for a server running 10 HC1 chips:&gt; The first generation HC1 chip is implemented in the 6 nanometer N6 process from TSMC. ... Each HC1 chip has 53 billion transistors on the package, most of it very likely for ROM and SRAM memory. The HC1 card burns about 200 watts, says Bajic, and a two-socket X86 server with ten HC1 cards in it runs 2,500 watts.<a href="https:&#x2F;&#x2F;www.nextplatform.com&#x2F;2026&#x2F;02&#x2F;19&#x2F;taalas-etches-ai-models-onto-transistors-to-rocket-boost-inference&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.nextplatform.com&#x2F;2026&#x2F;02&#x2F;19&#x2F;taalas-etches-ai-mod...</a>

Nothing to do with each other. This is a general optimization. Taalas&#x27; is an ASIC that runs a tiny 8B model on SRAM.But I wonder how Taalas&#x27; product can scale. Making a custom chip for one single tiny model is different than running any model trillions in size for a billion users.Roughly, 53B transistors for every 8B params. For a 2T param model, you&#x27;d need 13 trillion transistor assuming scale is linear. One chip uses 2.5 kW of power? That&#x27;s 4x H100 GPUs. How does it draw so much power?If you assume that the frontier model is 1.5 trillion models, you&#x27;d need an entire N5 wafer chip to run it. And then if you need to change something in the model, you can&#x27;t since it&#x27;s physically printed on the chip. So this is something you do if you know you&#x27;re going to use this exact model without changing anything for years.Very interesting tech for edge inference though. Robots and self driving can make use of these in the distant future if power draw comes down drastically. 2.4kW chip running inside a robot is not realistic. Maybe a 150w chip.

A billion stupid LLMs don&#x27;t make a smart one, they just make one stupid LLM that&#x27;s really fast at stupidity.

Man, I&#x27;m in the exact opposite camp. 1 smart model beats 1000 chaos monkeys any day of the week.

When that genrates 10k of output slop in less latency than my web server doing some crud shit....amazing!

Just tried this. Holy fuck.I&#x27;d take an army of high-school graduate LLMs to build my agentic applications over a couple of genius LLMs any day.This is a whole new paradigm of AI.

This is exceptionally fast (almost instant) whats the catch? Answer was there before I lifted return key!

Releasing this on the same day as Taalas&#x27;s 16,000 token-per-second acceleration for the roughly comparable Llama 8B model must hurt!I wonder how far down they can scale a diffusion LM? I&#x27;ve been playing with in-browser models, and the speed is painful.<a href="https:&#x2F;&#x2F;taalas.com&#x2F;products&#x2F;" rel="nofollow">https:&#x2F;&#x2F;taalas.com&#x2F;products&#x2F;</a>

A lot of this post-training recipe feels reminiscent of DINO training (teacher&#x2F;student, use of stop gradients). I wonder if the more recent leJEPA SigREG regularization research might be relevant here for simpler post-training.

Diffusion models need to infer the causality of language from within a symmetric architecture (information can flow forward or backward). AR forces information to flow in a single direction and is substantially easier to control as a result. The 2nd sentence in a paragraph of English text often cannot come before the first or the statement wouldn&#x27;t make sense. Sometimes this is not an issue (and I think these are cases where parallel generation makes sense), but the edge cases are where all the money lives.

I do wonder why diffusion models aren&#x27;t used alongside constraint decoding for programming - surely it makes better sense then using an auto-regressive model.

Scaling laws mean that there&#x27;s not much need to actually scale things to the skies. Instead, you can run a bunch of experiments at small scale, fit the scaling law parameters, then extrapolate. If the predicted outcome is disappointing (e.g. it&#x27;s unlikely to beat the previous scaled-to-the-sky model), you can save the really expensive experiment for a more promising approach.It would certainly be nice though if this kind of negative result was published more often instead of leaving people to guess why a seemingly useful innovation wasn&#x27;t adopted in the end.

Probably because it&#x27;s expensive.But I wish there were more &quot;let&#x27;s scale this thing to the skies&quot; experiments from those who actually can afford to scale things to the skies.

Google is working on a similar line of research. Wonder why they haven&#x27;t rolled out a GPT40 scaled version of this yet

I think diffusion makes much more sense than auto-regressive (AR) specifically in code generation comparing to chatbot.

Is this available as open source anywhere to try?

This doesn&#x27;t mention the drawback of diffusion language models, the main reason why nobody is using them: they have significantly lower performance on benchmarks than autoregressive models at similar size.

I don&#x27;t subscribe to the Python craze, but this could be interesting. Thanks!

You can try it today. You can get them from huggingface. Here is an example:<a href="https:&#x2F;&#x2F;huggingface.co&#x2F;tencent&#x2F;WeDLM-8B-Instruct" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;tencent&#x2F;WeDLM-8B-Instruct</a>Diffusion isn’t natively supported in the transformers library yet so you have to use their custom inference code.

Can&#x27;t wait for the day I can actually try a diffusion model on my own machine (128GB M4 Max) rather than as a hosted service. So far I haven&#x27;t seen a single piece of software that supports it.

One appeal of it is for RL. If it ends up being a lot faster for generation, you&#x27;ll be able to do a lot more RL.If people can make RL scalable-- make it so that RL isn&#x27;t just a final phase, but something which is as big as the supervised stuff, then diffusion models are going to have an advantage.If not, I think autoregressive models will still be preferred. Diffusion models become fixed very fast, they can&#x27;t actually refine their outputs, so we&#x27;re not talking about some kind of refinement along the lines of: initial idea -&gt; better idea -&gt; something actually sound.

Didn&#x27;t thinking tokens resolve the most problematic part of autoregressive models (the first few tokens set the constraints the model can&#x27;t overcome later) and give it a massive advantage compared to diffusion models by showing the thinking trace? I can see diffusion models being used as a draft model to quickly predict a bunch of tokens and let the autoregressive model decide to use them or throw them away quickly, speeding it up considerably while keeping thinking traces available.

Feels like the sodium ion battery vs lithium ion battery thing, where there are theoretical benefits of one but the other has such a head start on commercialization that it&#x27;ll take a long time to catch up.

Diffusion language models seem poised to smash purely autoregressive models. I&#x27;m giving it 1-2 years.

If this means there’s a 2x-7x speed up available to a scaled diffusion model like Inception Mercury, that’ll be a game changer. It feels 10x faster already…

Consistency diffusion language models: Up to 14x faster, no quality loss