To my knowledge this connection was first noted in 2021 in https://arxiv.org/abs/2107.03006 (page 5). We wanted to do text diffusion where you’d corrupt words to semantically similar words (like “quick brown fox” -> “speedy black dog”) but kept finding that masking was easier for the model to uncover. Historically this goes back even further to https://arxiv.org/abs/1904.09324, which made a generative MLM without framing it in diffusion math.
show comments
thatguysaguy
Back when BERT came out, everyone was trying to get it to generate text. These attempts generally didn't work, here's one for reference though: https://arxiv.org/abs/1902.04094
This doesn't have an explicit diffusion tie in, but Savinov et al. at DeepMind figured out that doing two steps at training time and randomizing the masking probability is enough to get it to work reasonably well.
show comments
kibwen
To me, the diffusion-based approach "feels" more akin to whats going on in an animal brain than the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words; I start by having some fuzzy idea in my head and the challenge is in serializing it into language coherently.
show comments
bonoboTP
It feels like it would make more sense to allow the model to do Levenshtein-like edits instead of just masking and filling in the masked tokens. It seems that intuitively it's really hard in this diffusion setup to just swap one word with a longer but better synonym towards the end, because there's no way to shift everything to the right afterwards.
show comments
briandw
I love seeing these simple experiments. Easy to read through quickly and understand a bit more of the principles.
One of my stumbling blocks with text diffusers is that ideally you wouldn’t treat the tokens as discrete but rather probably fields. Image diffusers have the natural property that a pixel is a continuous value. You can smoothly transition from one color to another. Not so with tokens. In this case they just do a full replacement. You can’t add noise to a token, you have to work in the embedding space. But how can you train embeddings directly? I found a bunch of different approaches that have been tried but they are all much more complicated than the image based diffusion process.
zaptrem
When text diffusion models started popping up I thought the same thing as this guy (“wait, this is just MLM”) though I was thinking more MaskGIT. The only thing I could think of that would make it “diffusion” is if the model had to learn to replace incorrect tokens with correct ones (since continuous diffusion’s big thing is noise resistance). I don’t think anyone has done this because it’s hard to come up with good incorrect tokens.
They're doing continuous latent diffusion combined with autoregressive transformer-based text generation. The autoencoder and transformer are (or can be) trained in tandem.
alansaber
Interested in how this compares to electra
show comments
notsylver
I've really wanted to fine tune an inline code completion model to see if I could get at all close to cursor (I can't, but it would be fun), but as far as I know there are no open diffusion models to use as a base, and especially not any that would be good as a base. Hopefully something comes out soon that is viable for it
BoiledCabbage
To me part of the appeal of image diffusion models was starting with random noise to produce an image. Why do text diffudion models start with a blank slate (ie all "masked" tokens), instead of with random tokens?
show comments
schopra909
Very cool parallel. Never thought about it this way — but makes complete sense
skeptrune
Fun writeup! It's amazing how flexible an architecture can be to different objectives.
rafaelero
The problem with this approach to text generation is that it's still not flexible enough. If during inference the model changes its mind and wants to output something considerably different it can't because there are too many tokens already in place.
show comments
nodja
I think another easy improvement to this diffusion model would be for the logprobs to also affect the chance of a token being turned into a mask. So higher confidence tokens should have less of a chance to be pruned, should converge faster. I wonder if backprop would be able exploit that. (I'm not an ML engineer).
To my knowledge this connection was first noted in 2021 in https://arxiv.org/abs/2107.03006 (page 5). We wanted to do text diffusion where you’d corrupt words to semantically similar words (like “quick brown fox” -> “speedy black dog”) but kept finding that masking was easier for the model to uncover. Historically this goes back even further to https://arxiv.org/abs/1904.09324, which made a generative MLM without framing it in diffusion math.
Back when BERT came out, everyone was trying to get it to generate text. These attempts generally didn't work, here's one for reference though: https://arxiv.org/abs/1902.04094
This doesn't have an explicit diffusion tie in, but Savinov et al. at DeepMind figured out that doing two steps at training time and randomizing the masking probability is enough to get it to work reasonably well.
To me, the diffusion-based approach "feels" more akin to whats going on in an animal brain than the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words; I start by having some fuzzy idea in my head and the challenge is in serializing it into language coherently.
It feels like it would make more sense to allow the model to do Levenshtein-like edits instead of just masking and filling in the masked tokens. It seems that intuitively it's really hard in this diffusion setup to just swap one word with a longer but better synonym towards the end, because there's no way to shift everything to the right afterwards.
I love seeing these simple experiments. Easy to read through quickly and understand a bit more of the principles.
One of my stumbling blocks with text diffusers is that ideally you wouldn’t treat the tokens as discrete but rather probably fields. Image diffusers have the natural property that a pixel is a continuous value. You can smoothly transition from one color to another. Not so with tokens. In this case they just do a full replacement. You can’t add noise to a token, you have to work in the embedding space. But how can you train embeddings directly? I found a bunch of different approaches that have been tried but they are all much more complicated than the image based diffusion process.
When text diffusion models started popping up I thought the same thing as this guy (“wait, this is just MLM”) though I was thinking more MaskGIT. The only thing I could think of that would make it “diffusion” is if the model had to learn to replace incorrect tokens with correct ones (since continuous diffusion’s big thing is noise resistance). I don’t think anyone has done this because it’s hard to come up with good incorrect tokens.
I'm more excited about approaches like this one:
https://openreview.net/forum?id=c05qIG1Z2B
They're doing continuous latent diffusion combined with autoregressive transformer-based text generation. The autoencoder and transformer are (or can be) trained in tandem.
Interested in how this compares to electra
I've really wanted to fine tune an inline code completion model to see if I could get at all close to cursor (I can't, but it would be fun), but as far as I know there are no open diffusion models to use as a base, and especially not any that would be good as a base. Hopefully something comes out soon that is viable for it
To me part of the appeal of image diffusion models was starting with random noise to produce an image. Why do text diffudion models start with a blank slate (ie all "masked" tokens), instead of with random tokens?
Very cool parallel. Never thought about it this way — but makes complete sense
Fun writeup! It's amazing how flexible an architecture can be to different objectives.
The problem with this approach to text generation is that it's still not flexible enough. If during inference the model changes its mind and wants to output something considerably different it can't because there are too many tokens already in place.
I think another easy improvement to this diffusion model would be for the logprobs to also affect the chance of a token being turned into a mask. So higher confidence tokens should have less of a chance to be pruned, should converge faster. I wonder if backprop would be able exploit that. (I'm not an ML engineer).