V-JEPA 2 world model and new benchmarks for physical reasoning

aew2120

Interestingly, a small company called Ogma already did something very similar back in 2021 (on an embedded system, no less). This (https://ogma.ai/2021/07/unsupervised-behavioral-learning-ubl...) is a description/video of how they got a small RC car to predict the next frame of its video feed given the action it was about to take, and thereby made the car navigate to a given location when fed with a still frame of that location (all this with online learning, and no backprop).

Instead of vicreg, they induced their latent state with sparse auto-encoding. Also they predicted in pixel, as opposed to latent, space. The white paper describing their tech is a little bit of a mess, but schematically, at least, the hierarchical architecture they describe bears a strong resemblance to the hierarchical JEPA models LeCunn outlined in his big paper from a few years ago. A notable difference, though, is that their thing is essentially a reflex agent, as opposed to possessing a planning/optimization loop.

show comments

TheAceOfHearts

> With these visual subgoals, V-JEPA 2 achieves success rates of 65% – 80% for pick-and-placing new objects in new and unseen environments.

How does this compare with existing alternatives? Maybe I'm just lacking proper context, but a minimum 20% failure rate sounds pretty bad? The paper compares their results with older approaches, which apparently had something like a 15% success rate, so jumping to an 80% success rate does seem like a significant jump. If I'm reading the paper correctly, the amount of time required to compute and execute each action went down from 4 minutes to 16 seconds, which also seems significant.

Having to specify an end goal as an image seems pretty limited, but at least the authors acknowledge it in the paper:

> Second, as mentioned in Section 4, V-JEPA 2-AC currently relies upon tasks specified as image goals. Although this may be natural for some tasks, there are other situations where language-based goal specification may be preferable. Extending the V-JEPA 2-AC to accept language-based goals, e.g., by having a model that can embed language-based goals into the V-JEPA 2-AC representation space, is another important direction for future work. The results described in Section 7, aligning V-JEPA 2 with a language model, may serve as a starting point.

I think it would be interesting if the authors answered whether they think there's a clear trajectory towards a model that can be trained to achieve a >99% success rate.

show comments

siavosh

Does someone know how the "semantic" embeddings are learned? That seems like perhaps the main technical challenge here.

show comments

fidotron

You have to wonder if the model is going to end up recreating Verlet integration in there somewhere, or if it's generating a pile of those optical acceleration cancelation type heuristics in neural net form.

It's one of those ideas I've had around for a while that if you fused decent object tracking with an understanding of Verlet integration you should, in principle, start being able to measure all sorts of physical quantities quite easily.

cubefox

I think the fundamental idea behind JEPA (not necessarily this concrete Meta implementation) will ultimately be correct: predicting embeddings instead of concrete tokens. That's arguably what animals do. Next-token prediction (a probability distribution over the possible next tokens) works well for the discrete domain of text, but it doesn't work well for a continuous domain like video, which would be needed for real-time robotics.

For text, with a two-byte tokenizer you get 2^16 (~65.000) possible next tokens, and computing a probability distribution over them is very much doable. But the "possible next frames" in a video feed would already be an extremely large number. If one frame is 1 megabyte uncompressed (instead of just 2 bytes for a text token) there are 2^(8*2^20) possible next frames, which is far too large a number. So we somehow need to predict only an embedding of a frame, of how the next frame of a video feed will look approximately.

Moreover, for robotics we don't want to just predict the next (approximate) frame of a video feed. We want to predict future sensory data more generally. That's arguably what animals do, including humans. We constantly anticipate what happens to us in "the future", approximately, and where the farther future is predicted progressively less exactly. We are relatively sure of what happens in a second, but less and less sure of what happens in a minute, or a day, or a year.

show comments

rar00

the robot arm demonstration video jumps at the 00:28s mark...

artificialprint

Throw ARC-AGI 2 at it!

show comments

jcelerier

> That kind of physical intuition isn’t something adults obtain after years of education—young children develop this intuition by observing the world around them before they can even speak in full sentences.

I mean, it still takes them much more time than it takes to train even the largest LLMs we use (a couple months)

show comments

nlitened

I imagine that Russian-speaking team members had fun with naming the model V-JEPA

show comments

iLoveOncall

"World model" and "physical reasoning" is such a lie.

Those models don't have any understanding of physics, they just regurgitate what they see in their vision-based training set, just like any image or video generation model does.

Monkey see other monkey cannot go through wall, monkey don't try go through wall.