The First Fully General Computer Action Model

159 points54 comments2 days ago
segmondy

Nice, I have always felt the computer was the ultimate environment and screen capture the ultimate training data. Nice to see it in practice, now we have to wait to see if folks are going to argue on if your model could really learn a world model. I'm surprised this post doesn't have more comments, their site is worth checking out. Rooting for them, they are gritty, checkout their storage buildout story.

mcint

Congratulations! I’ll be interested to see the next steps in alignment. Do you plan to start selling access, or collect more data to train bigger & better? What tasks or benchmarks are your biggest guide stars, or what was unexpectedly tricky—a few are hinted in the post.

It would be pretty interesting to see activation maps for the encoder on video, confidence building to see the compression derived from so much training.

show comments
nee1r

Hey guys! I’m Neel, been holed up in our south park office for the past year working on model training. excited to share our research!

This is a preview of a very different type of computer use model—we train on the internet. Specifically we have 11 million hours of computer video stored on our storage cluster (previously shared https://news.ycombinator.com/item?id=45438496 !) and the model can work in 30 FPS. Since we match the fundamental form factor of computer-use, we can get our model to do CAD, browse websites, and even drive a car using arrow keys. I’m super excited to see what our model can do as we scale more, it's a fun frontier to work on (not language models :) ).

The team and I will be online responding to the comments, so drop any questions.

show comments
kylenessen

This seems like really great research, and the first time I’ve seen overwhelming praise on HN. Congrats!

I wanted to comment though that your title is not doing you any favors, and I suspect that is why this is not getting more traction (which it deserves). I fully expected some half baked GitHub repo, but instead found something truly awesome.

To use your own words, Neel, “ a very different type of computer use model” would have had me clicking faster. I’m not great at titles, however, and maybe there are better ideas out there.

Anyway, can’t wait to see how this develops! Especially looking forward to the CAD work.

show comments
clemvonstengel

I rly liked the point about ctrl-c only being able to be labelled retrocausally. I do think that with enough past context you should be able to know what was copied - in some sense the past does encode the future - but also an agentic decision is precisely the kind where the future is more informative than the past for reconstructing that decision.

It does make me wonder if you should have the inverse dynamics model split into specifically retrocausal and causal. You kind of do this already with the inverse and forward dynamics model, but the idea of a model that knows only about the future training in a feedback loop with a model that knows only about the past is kind of interesting.

I think you could just do a clever masking regime in your diffusion model to achieve the same effect without a whole architecture change.

show comments
theredsix

This is one of those hacker news posts that you stumble upon and see 2 genius ideas within the span of as many paragraphs. Thanks again for sharing the diffusion based labeling algorithm. Truly demonstrates a mastery and understanding of what diffusion is capable of.

cs702

At first glance, this looks incredible to me. The authors train one model on 40K hours of computer-use video, previously labeled by contractors with keyboard and mouse actions, then use that model, in effect, to label 11M hours of computer-use video, which they use to train the computer-action model. The key advance is in compression. Quoting from the OP:

> [previous models] burn a million tokens to understand just one minute of 30 FPS computer data. Our video encoder encodes nearly 2 hours of video in the same number of tokens—that’s 50x more token-efficient than the previous state-of-the-art and 100x more token-efficient than OpenAI’s encoder.

While I was already aware that there are people working on new, more efficient "world models," this is the first one I've seen in action. I'm a bit in shock at how good it is, quite frankly.

I've added the OP, as well as a related 2018 paper on Behavioral Cloning from Obervation (BCO) to my reading list.[a] So far, I've only skimmed the 2018 paper, but it's already evident that it's well-written. I'm no expert in deep RL, and I can understand it. BTW, "Behavioral Cloning from Obervation" is a really good name, with an easy-to-remember acronym.

Thank you for sharing this on HN.

[a] https://arxiv.org/abs/1805.01954

show comments
alyxya

This looks extremely impressive, really deserves more attention here.

Are the inverse dynamics and forward dynamics models trained separately? It sounds like if the inverse dynamics model is meant to extrapolate more training data, then perhaps all that means is it takes very little data to generalize directly with the forward dynamics model assuming the right architecture.

show comments
nextzck

I think you guys are on the right track here. I’d love to learn more about the math behind the FDM. I don’t think folks realize how behind we are on vision, thank you for your work here.

show comments
ripped_britches

Looks extremely impressive! Genuine question - why are you sharing your methods openly? I am grateful for it, but just curious your motivations.

vessenes

dammmmmmnnnn - lots to like here. I'm impressed with the 80,000 parallel website fuzzing desktops. And the 30hz (everything). Amazing.

aakashks

The video compression is very cool. And the small tricks like binning the mouse movements.

Wonder how much data is generalizable across different UIs? ie how good will the model be at using Figma if it’s never seen it before but has seen a lot of Photoshop

show comments
user-

Really really cool. I appreciate the article style a lot too.

piva00

Just wanted to say: this is might impressive research.

Really interesting breakdown, proper nerdsniped into this, thanks for the refreshing AI news outside of language models :)

sp1nningaway

May I suggest a driving demo in a parking lot with a mannequin instead of a real world video where it drives way too close to a pedestrian?

Otherwise, very cool and exciting!

show comments
rio_popper

Curious about the masked diffusion IDM choice. They mention CTC loss and cross-entropy both underperformed — I'd love to see ablations on that. The claim that typos were "extremely common" with non-causal cross-entropy is interesting but hand-wavy without numbers.

show comments
ennucore

The car thing is very impressive By the way, do you have plans to handle the computer’s audio output?

show comments
ClaireBookworm

What sort of fine tuning data was needed to allow the model to self-drive? One hour of video of someone driving, or extra labeling?

show comments
wasmainiac

Can it defeat captchas?

bananzamba

Very impressive stuff!

Can you prompt it or is it strictly Copilot-style prediction?

show comments
kdrag0n

what tasks can the model do out of the box? was each of the examples a different fine tuned model?

show comments
LorenDB

Nice that it can drive a car, but you could just use openpilot.

show comments
ennucore

How do you tokenize the mouse inputs?

show comments
bitwize

Looks like it's playing the special stages from Knuckles' Chaotix?

152334H

holy crap, this is so good. How did it get buried?

show comments
Obscura-

Amazing!

akoboldfrying

My tech-informed but ML-ignorant take: This will soon be the biggest thing since ChatGPT.