I struggle with these world models from the perspective of video games (so this post is a particular perspective).
I'm not a game developer myself, but some of my favorite games carry a deep sense of intentionality. For instance, there is typically not a single item misplaced in a FromSoftware game (or, for instance, Lies of P -- more recently). Almost every object is placed intentionally.
Games which lack this intentionality often feel dead in contrast. You run into experiences which break immersion, or pull you out of the experience that the developer is trying to convey to you.
It's difficult for me to imagine world models getting to a place where this sort of intentionality is captured. The best frontier LLMs fail to do this in writing (all the time), and even in code, and the surface of experiences for those mediums often feel "smaller" than the user interaction profile of a video game.
It's not clear how these world models could be used modularly by humans hoping to develop intentional experiences? I don't know much about their usage (LLMs are somewhat modular: they can produce text, humans can work on it, other LLMs can work on it). Is the same true for the video output here?
All this to say, I'm impressed with these world models, but similar to LLMs with writing, it's not really clear what it is that we are building towards? We are able to create less satisfying, less humane experiences faster? Perhaps the most immediate benefit is the ability for robotic systems to simulate actions (by conjuring a world, and imagining the implications).
In general, I have the feeling that we are hurtling towards a world with less intentionality behind all the things we experience. Everything becomes impersonal, more noisy, etc.
show comments
jubilanti
Model weights coming "soon" == currently vaporware. So the weights aren't even open, how can this be "open-source"?
Everyone is right to be skeptical of this coming from a 2.8B model. Weights or it didn't happen.
show comments
w10-1
Gist:
> 720p, 1-min video generation with 6-DoF camera control
Models are commercially usable.
You are free to create and distribute Derivative Models
(As usual: model output is unrestricted, and also unprotectable absent human authoring)
show comments
mejutoco
They all look like video games. I guess Unreal Engine is used to create synthetic data for training.
resist_futility
warning: viewing the videos that auto play on that page shot up my downloads to 350Mbps on that page
show comments
Fischgericht
So, where is the download? I can't find it on Github, and on your web page the download button is disabled.
Also, will this run on RTX 4090 with 24GB memory?
Thank you!
show comments
agentifysh
Running this on GPU is quite impressive. I see some people expressing discontent and worries but we are early and this is the worst its going to be, I am very excited to see the impact this will have on games
alloyed
silly question: what's "world" about what's being generated here? is the an actual abstract representation of physical space (like, eg, a game-engine style scene graph?) or does it just mean "this video generator is more coherent physically than other video generators"
show comments
Incipient
Outputting video of that quality/consistency at 1 minute, for a 2.6B model seems insane?
show comments
pferdone
First video with the guy walking the mountain in snow has consistency issues with the cave entrance. Which is "expected" at this model size?!
show comments
joenot443
What’s the long term utility of world models?
There’s no doubt they’re technically impressive, but what does one do with it?
show comments
mkl
2.6B, but then:
> A dedicated 17B long-video refiner sharpens texture, motion, and late-window quality on top of the long-rollout backbone.
bobkb
The trouble is the lack of training available to these models compared to the ones like Seedance and Kling who seems to be tapping into their unlimited video inventory. Many models like LTX is technically good but when it comes to slightly different camera movements or the subject interacting with objects they struggle. For a recent example we had to use sample videos generated by closed source models and then use the same for final video.
show comments
PyWoody
I tried watching the cave video and I was immediately overcome with nausea. I've never experienced anything like that before in my life. Wild.
I can't say I'm looking forward to an AI video future.
show comments
maxignol
I can’t seem to grasp why everyone says only slop gets produced by AI models (and particularly those world models).
Imo it’s shit in -> shit out. Great work can be achieved using those. Slop gets produced by careless users.
bilsbie
What would it take to get this on VR? Anyone looking into it?
show comments
CommanderData
All video models are terrible at consistency. Even closed source ones.
Seedance 2.0, Kling 3 are regarded the best closed source video models we have. I have subscribed to a few AI video subreddits, consensus atm is they are good for anything but long form videos with humans.
No surprises that we're very good at spotting even the most subtle differences while looking at other people.
show comments
trunkiedozer
It ain’t open source until it’s released.
It’s baitware.
agus4nas
Has anyone actually tested this for robotics simulation? Curious how it handles edge cases in physical environments.
show comments
ionwake
i survived flash, jquery, svn, soap, xml, microservices and crypto
now some norwegian teenager is generating netflix-quality worlds during lunch break from a jpeg of a forest
EDIT> dont ask how I came up with this quote
yieldcrv
Really great for visuals during a dj set at a festival or YouTube
sebringj
i see this and think about Suno's playbook where this could go... survival of the fittest rules the boards where you have user-generated-dynamic video games, not just static ones where design is fixed, the design will be adaptive... based off several prompt input boxes for various things and adhoc while playing, higher tier design boards and the like, this is all going toward user-gen commercial / vanity / personal enjoyment.
agus4nas
Increíbles resultados
utopiah
Nice, now instead of just reading slop you'll soon be able to experience slop Worlds, in 3D! /s
It's honestly impressive, on the surface. The visuals are gorgeous... but it's still empty. What makes a "World" a world is precisely it's coherency. It's not about how it looks but rather how it "works". The plants in an ecosystems are a certain way because of the available resources, all the way to forces like gravity. It doesn't just "look" like that. To echo Konrad Lorenz a fish doesn't just swim in the water, rather the fish IS an efficient representation of the water it lives within. Here in such "worlds" there is nothing happening. There is minimal superficial coherence, no logic, nothing.
I struggle with these world models from the perspective of video games (so this post is a particular perspective).
I'm not a game developer myself, but some of my favorite games carry a deep sense of intentionality. For instance, there is typically not a single item misplaced in a FromSoftware game (or, for instance, Lies of P -- more recently). Almost every object is placed intentionally.
Games which lack this intentionality often feel dead in contrast. You run into experiences which break immersion, or pull you out of the experience that the developer is trying to convey to you.
It's difficult for me to imagine world models getting to a place where this sort of intentionality is captured. The best frontier LLMs fail to do this in writing (all the time), and even in code, and the surface of experiences for those mediums often feel "smaller" than the user interaction profile of a video game.
It's not clear how these world models could be used modularly by humans hoping to develop intentional experiences? I don't know much about their usage (LLMs are somewhat modular: they can produce text, humans can work on it, other LLMs can work on it). Is the same true for the video output here?
All this to say, I'm impressed with these world models, but similar to LLMs with writing, it's not really clear what it is that we are building towards? We are able to create less satisfying, less humane experiences faster? Perhaps the most immediate benefit is the ability for robotic systems to simulate actions (by conjuring a world, and imagining the implications).
In general, I have the feeling that we are hurtling towards a world with less intentionality behind all the things we experience. Everything becomes impersonal, more noisy, etc.
Model weights coming "soon" == currently vaporware. So the weights aren't even open, how can this be "open-source"?
Everyone is right to be skeptical of this coming from a 2.8B model. Weights or it didn't happen.
Gist:
> 720p, 1-min video generation with 6-DoF camera control
As nl said,
> The model is out here: https://huggingface.co/Efficient-Large-Model/SANA-Video_2B_7...
README says "intended for research use only"
Code license is Apache 2.0
Model license (nvidia open...) says
(As usual: model output is unrestricted, and also unprotectable absent human authoring)They all look like video games. I guess Unreal Engine is used to create synthetic data for training.
warning: viewing the videos that auto play on that page shot up my downloads to 350Mbps on that page
So, where is the download? I can't find it on Github, and on your web page the download button is disabled.
Also, will this run on RTX 4090 with 24GB memory?
Thank you!
Running this on GPU is quite impressive. I see some people expressing discontent and worries but we are early and this is the worst its going to be, I am very excited to see the impact this will have on games
silly question: what's "world" about what's being generated here? is the an actual abstract representation of physical space (like, eg, a game-engine style scene graph?) or does it just mean "this video generator is more coherent physically than other video generators"
Outputting video of that quality/consistency at 1 minute, for a 2.6B model seems insane?
First video with the guy walking the mountain in snow has consistency issues with the cave entrance. Which is "expected" at this model size?!
What’s the long term utility of world models?
There’s no doubt they’re technically impressive, but what does one do with it?
2.6B, but then:
> A dedicated 17B long-video refiner sharpens texture, motion, and late-window quality on top of the long-rollout backbone.
The trouble is the lack of training available to these models compared to the ones like Seedance and Kling who seems to be tapping into their unlimited video inventory. Many models like LTX is technically good but when it comes to slightly different camera movements or the subject interacting with objects they struggle. For a recent example we had to use sample videos generated by closed source models and then use the same for final video.
I tried watching the cave video and I was immediately overcome with nausea. I've never experienced anything like that before in my life. Wild.
I can't say I'm looking forward to an AI video future.
I can’t seem to grasp why everyone says only slop gets produced by AI models (and particularly those world models). Imo it’s shit in -> shit out. Great work can be achieved using those. Slop gets produced by careless users.
What would it take to get this on VR? Anyone looking into it?
All video models are terrible at consistency. Even closed source ones.
Seedance 2.0, Kling 3 are regarded the best closed source video models we have. I have subscribed to a few AI video subreddits, consensus atm is they are good for anything but long form videos with humans.
No surprises that we're very good at spotting even the most subtle differences while looking at other people.
It ain’t open source until it’s released. It’s baitware.
Has anyone actually tested this for robotics simulation? Curious how it handles edge cases in physical environments.
i survived flash, jquery, svn, soap, xml, microservices and crypto now some norwegian teenager is generating netflix-quality worlds during lunch break from a jpeg of a forest
EDIT> dont ask how I came up with this quote
Really great for visuals during a dj set at a festival or YouTube
i see this and think about Suno's playbook where this could go... survival of the fittest rules the boards where you have user-generated-dynamic video games, not just static ones where design is fixed, the design will be adaptive... based off several prompt input boxes for various things and adhoc while playing, higher tier design boards and the like, this is all going toward user-gen commercial / vanity / personal enjoyment.
Increíbles resultados
Nice, now instead of just reading slop you'll soon be able to experience slop Worlds, in 3D! /s
It's honestly impressive, on the surface. The visuals are gorgeous... but it's still empty. What makes a "World" a world is precisely it's coherency. It's not about how it looks but rather how it "works". The plants in an ecosystems are a certain way because of the available resources, all the way to forces like gravity. It doesn't just "look" like that. To echo Konrad Lorenz a fish doesn't just swim in the water, rather the fish IS an efficient representation of the water it lives within. Here in such "worlds" there is nothing happening. There is minimal superficial coherence, no logic, nothing.
The ultimate liminal spaces.
ugly slop