These videos are worth a watch. There are tons of impressive moments, but they had me at the very first one where a woman says: "I'm going to tell you a story," and then pauses for a long, luxurious sip from a cup of coffee, and the model ... does nothing, just waits. Take my money.
Speaking of taking my money, what's the economic model for a company like this? They've published a fair amount about their architecture - enough that I imagine frontier labs could implement. Patents? Trade secrets? It's hard for me to understand how you'd be able to beat that training compute and knowhow at Anthropic/GOOG/oAI/Meta without some sort of legal protection.
I can't wait to see what these model architectures do with like 30-40% lower latency and more model intelligence. Very appealing. For reference, these look to be roughly 1/10 the size of Opus 4.7 / GPT 5.x series -- 275B, 12B active. So there's lots of room to add intelligence, and lots of hope that we could see lower latency.
show comments
alyxya
The noteworthy things to me are that the architecture is a transformer that takes in text, image, and audio input and produces text and audio output, all trained together, and it works in near real-time through interleaving inputs and outputs rather than pure generation of the output from a given prompt.
> Time-Aligned Micro-Turns. The interaction model works with micro-turns continuously interleaving the processing of 200ms worth of input and generation of 200ms worth of output. Rather than consuming a complete user-turn and generating a complete response, both input and output tokens are treated as streams. Working with 200ms chunks of these streams enables near real-time concurrency of multiple input and output modalities.
That's probably the main thing that distinguishes it from the multimodal models from other frontier labs as far as I can tell.
show comments
rohitpaulk
Aside from how impressive the model is, the demos here are very well done! Quirky and short, unlike what we're used to from Anthropic and OpenAI.
tedsanders
Very cool! The demos felt fairly contrived - e.g., count things while I talk. I wonder what more useful or commercial applications look like.
show comments
lostathome
This looks similar to things people are already building locally with Gemma4 and TTS; just a bit fancier.
Local models will catch up soon.
abhik24
Very cool demo, I wonder what would be the billion dollar applications of a thing like this.
nasreddin
Very cool tech. I think people are underrating how this will be used.
kburman
Simultaneous speech is best.
suriya-ganesh
incredibly impressive demos. I wonder how the training data for these models look like?
is it separate batches of special "skills" that are added post training? how can they guarantee the models won't eventually lose a skill?
emsign
That's neat and definitely the next step. But to be honest, I don't want an AI talk to me like that.
show comments
Nimitz14
Really really cool. If they can serve this efficiently it would disrupt a lot of things.
zuzululu
am i the only person not impressed by this ? it just feels akward still with pauses and doesnt openai offer voice cadence already
show comments
modeless
This deserves to be at the top of HN, shame it seems like it's not going to make it. Some of the demos are hilarious. Clearly having the model appropriately choose when to speak is a major thing that has been missing from voice models to date. It seems like the latency is still a touch too high to be truly human-like though.
These videos are worth a watch. There are tons of impressive moments, but they had me at the very first one where a woman says: "I'm going to tell you a story," and then pauses for a long, luxurious sip from a cup of coffee, and the model ... does nothing, just waits. Take my money.
Speaking of taking my money, what's the economic model for a company like this? They've published a fair amount about their architecture - enough that I imagine frontier labs could implement. Patents? Trade secrets? It's hard for me to understand how you'd be able to beat that training compute and knowhow at Anthropic/GOOG/oAI/Meta without some sort of legal protection.
I can't wait to see what these model architectures do with like 30-40% lower latency and more model intelligence. Very appealing. For reference, these look to be roughly 1/10 the size of Opus 4.7 / GPT 5.x series -- 275B, 12B active. So there's lots of room to add intelligence, and lots of hope that we could see lower latency.
The noteworthy things to me are that the architecture is a transformer that takes in text, image, and audio input and produces text and audio output, all trained together, and it works in near real-time through interleaving inputs and outputs rather than pure generation of the output from a given prompt.
> Time-Aligned Micro-Turns. The interaction model works with micro-turns continuously interleaving the processing of 200ms worth of input and generation of 200ms worth of output. Rather than consuming a complete user-turn and generating a complete response, both input and output tokens are treated as streams. Working with 200ms chunks of these streams enables near real-time concurrency of multiple input and output modalities.
That's probably the main thing that distinguishes it from the multimodal models from other frontier labs as far as I can tell.
Aside from how impressive the model is, the demos here are very well done! Quirky and short, unlike what we're used to from Anthropic and OpenAI.
Very cool! The demos felt fairly contrived - e.g., count things while I talk. I wonder what more useful or commercial applications look like.
This looks similar to things people are already building locally with Gemma4 and TTS; just a bit fancier.
Local models will catch up soon.
Very cool demo, I wonder what would be the billion dollar applications of a thing like this.
Very cool tech. I think people are underrating how this will be used.
Simultaneous speech is best.
incredibly impressive demos. I wonder how the training data for these models look like?
is it separate batches of special "skills" that are added post training? how can they guarantee the models won't eventually lose a skill?
That's neat and definitely the next step. But to be honest, I don't want an AI talk to me like that.
Really really cool. If they can serve this efficiently it would disrupt a lot of things.
am i the only person not impressed by this ? it just feels akward still with pauses and doesnt openai offer voice cadence already
This deserves to be at the top of HN, shame it seems like it's not going to make it. Some of the demos are hilarious. Clearly having the model appropriately choose when to speak is a major thing that has been missing from voice models to date. It seems like the latency is still a touch too high to be truly human-like though.