I'm looking forward to comparing this to Inception 2 (the text diffusion model) which in my experience is very fast and reasonably high quality.
show comments
roger_
Can anyone explain why Mamba models start with a continuous time SSM (and discretize) vs discrete time?
I know the step isn’t fixed, also not sure why that’s important. Is that the only reason? There also seems to be a parameterization advantage too with the continuous formulation.
Havoc
Is there a reason we don’t switch halfway through? ie start with a classic LLM and switch to something linear like mamba as context grows
show comments
jychang
I'm not sure that I buy their conclusion that more compute during inference is good.
Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.
With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.
If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.
show comments
fudged71
This is really promising. Are they now going to scale this up to hundreds of billions of parameters? Why stop at 1.5B if they found a potentially SOTA architecture?
show comments
jeffhwang
I'm glad I clicked through bc I thought the article was about Mamba, the package manager I associate with Python (similar to conda).
I'm looking forward to the fifth iteration of this model.
robofanatic
> Mamba-3 is a new state space model (SSM) designed with inference efficiency as the primary goal — a departure from Mamba-2, which optimized for training speed. The key upgrades are a more expressive recurrence formula, complex-valued state tracking, and a MIMO (multi-input, multi-output) variant that boosts accuracy without slowing down decoding.
Why can’t they simply say -
Mamba-3 focuses on being faster and more efficient when making predictions, rather than just being fast to train like Mamba-2.
I'm looking forward to comparing this to Inception 2 (the text diffusion model) which in my experience is very fast and reasonably high quality.
Can anyone explain why Mamba models start with a continuous time SSM (and discretize) vs discrete time?
I know the step isn’t fixed, also not sure why that’s important. Is that the only reason? There also seems to be a parameterization advantage too with the continuous formulation.
Is there a reason we don’t switch halfway through? ie start with a classic LLM and switch to something linear like mamba as context grows
I'm not sure that I buy their conclusion that more compute during inference is good.
Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.
With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.
If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.
This is really promising. Are they now going to scale this up to hundreds of billions of parameters? Why stop at 1.5B if they found a potentially SOTA architecture?
I'm glad I clicked through bc I thought the article was about Mamba, the package manager I associate with Python (similar to conda).
https://github.com/mamba-org/mamba
More here https://news.ycombinator.com/item?id=47423208
https://arxiv.org/abs/2603.15569
I'm looking forward to the fifth iteration of this model.
> Mamba-3 is a new state space model (SSM) designed with inference efficiency as the primary goal — a departure from Mamba-2, which optimized for training speed. The key upgrades are a more expressive recurrence formula, complex-valued state tracking, and a MIMO (multi-input, multi-output) variant that boosts accuracy without slowing down decoding.
Why can’t they simply say -
Mamba-3 focuses on being faster and more efficient when making predictions, rather than just being fast to train like Mamba-2.