Mamba-3

257 points50 comments4 days ago
nl

I'm looking forward to comparing this to Inception 2 (the text diffusion model) which in my experience is very fast and reasonably high quality.

show comments
roger_

Can anyone explain why Mamba models start with a continuous time SSM (and discretize) vs discrete time?

I know the step isn’t fixed, also not sure why that’s important. Is that the only reason? There also seems to be a parameterization advantage too with the continuous formulation.

Havoc

Is there a reason we don’t switch halfway through? ie start with a classic LLM and switch to something linear like mamba as context grows

show comments
jychang

I'm not sure that I buy their conclusion that more compute during inference is good.

Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.

With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.

If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.

show comments
fudged71

This is really promising. Are they now going to scale this up to hundreds of billions of parameters? Why stop at 1.5B if they found a potentially SOTA architecture?

show comments
jeffhwang

I'm glad I clicked through bc I thought the article was about Mamba, the package manager I associate with Python (similar to conda).

https://github.com/mamba-org/mamba

manlymuppet

I'm looking forward to the fifth iteration of this model.

robofanatic

> Mamba-3 is a new state space model (SSM) designed with inference efficiency as the primary goal — a departure from Mamba-2, which optimized for training speed. The key upgrades are a more expressive recurrence formula, complex-valued state tracking, and a MIMO (multi-input, multi-output) variant that boosts accuracy without slowing down decoding.

Why can’t they simply say -

Mamba-3 focuses on being faster and more efficient when making predictions, rather than just being fast to train like Mamba-2.

show comments