NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

95 points13 comments5 hours ago

linolevan

There was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead.

[0] https://www.alphaxiv.org/abs/2509.14786

show comments

kseniamorph

Curious about the baseline choice. modded-nanogpt was optimized for wall-clock speed, not data efficiency, so it seems like an unusual reference point for this kind of benchmark. Why not vanilla NanoGPT?

show comments

archermarks

Very cool idea. Interested to see how this progresses. One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.

show comments

lzaborowski

I like the idea of flipping the constraint. Most ML benchmarks assume unlimited data and limited compute, so people optimize for speed.

If high-quality training data becomes the real bottleneck, then the interesting question is how much signal you can extract from the same dataset when compute is cheap.

suddenlybananas

Reminds me a fair bit of the BabyLM challenge. It would be good to give them a shout-out and see how this challenge differs.

show comments

navvyeanand

Amazing job!

riajain2525

Super cool!