jjcm

1 bit with a FP16 scale factor every 128 bits. Fascinating that this works so well.

I tried a few things with it. Got it driving Cursor, which in itself was impressive - it handled some tool usage. Via cursor I had it generate a few web page tests.

On a monte carlo simulation of pi, it got the logic correct but failed to build an interface to start the test. Requesting changes mostly worked, but left over some symbols which caused things to fail. Required a bit of manual editing.

Tried a Simon Wilson pelican as well - very abstract, not recognizable at all as a bird or a bicycle.

Pictures of the results here: https://x.com/pwnies/status/2039122871604441213

There doesn't seem to be a demo link on their webpage, so here's a llama.cpp running on my local desktop if people want to try it out. I'll keep this running for a couple hours past this post: https://unfarmable-overaffirmatively-euclid.ngrok-free.dev

show comments
nl

I ran my custom agentic SQL debugging benchmark against it and I'm impressed.

Results: 8 passed, 0 failed, 17 errored out of 25

That puts it right between Qwen3.5-4B (7/25) and Nanbeige4.1-3B (9/25) for example, but it took only 200 seconds for the whole test. Qwen3.5 took 976 seconds and Nanbeige over 2000 (although both of these were on my 1070 so not quite the same hardware)

Granite 7B 4bit does the test in 199 seconds but only gets 4/25 correct.

See https://sql-benchmark.nicklothian.com/#all-data (click on the cells for the trace of each question)

Errors are bad tool calls (vs failures which is incorrect SQL)

I used @freakynit's runpod (thanks!)

[1] https://news.ycombinator.com/item?id=47597268

show comments
simonw

You can run this model on an iPhone via the latest update to this Locally AI app: https://apps.apple.com/us/app/locally-ai-local-ai-chat/id674...

For its size (1.2GB download) it's very impressive.

Here's a pelican it drew me running on my phone - the SVG comments are good, the image not so much: https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...

show comments
freakynit

Open access for next 5 hours (8GiB model, running on RTX 3090) or until server crashes or the this spot instance gets taken away :) =>

https://ofo1j9j6qh20a8-80.proxy.runpod.net

  ./build/bin/llama-server \
   -m ../Bonsai-8B.gguf \
   -ngl 999 \
   --flash-attn on \
   --host 0.0.0.0 \
   --port 80 \
   --ctx-size 65500 \
   --batch-size 512 \
   --ubatch-size 512 \
   --parallel 5 \
   --cont-batching \
   --threads 8 \
   --threads-batch 8 \
   --cache-type-k q4_0 \
   --cache-type-v q4_0 \
   --log-colors on
The server can serve 5 parallel request, with each request capped at around `13K` tokens...

A bit of of benchmarks I did:

1. Input: 700 tokens, ttfs: ~0 second, outputs: 1822 tokens ~190t/s

1. Input: 6400+ tokens, ttfs: ~2 second, outputs: 2012 tokens at ~135t/s

Vram usage was consistently at ~4GiB.

show comments
wild_egg

Don't have a GPU so tried the CPU option and got 0.6t/s on my old 2018 laptop using their llama.cpp fork.

Then found out they didn't implement AVX2 for their Q1_0_g128 CPU kernel. Added that and getting ~12t/s which isn't shabby for this old machine.

Cool model.

show comments
WaterRun

Feels a bit like gradually moving back toward analog circuits, step by step. There is less and less need for the precision that digital circuits provide.

show comments
alyxya

I expect the trend of large machine learning models to go towards bits rather than operating on floats. There's a lot of inefficiency in floats because typically they're something like normally distributed, which makes the storage and computation with weights inefficient when most values are clustered in a small range. The foundation of neural networks may be rooted in real valued functions, which are simulated with floats, but float operations are just bitwise operations underneath. The only issue is that GPUs operate on floats and standard ML theory works over real numbers.

show comments
drob518

I’m really curious how this scales up. Bonsai delivers an 8B model in 1.15 GB. How large would a 27B or 35B model be? Would it still retain the accuracy of those large models? If the scaling holds, we could see 100+B models in 64 GB of RAM.

show comments
druskacik

The 8B model response to my "Harry Potter knowledge-bench" question is too funny not to share.

> *Fathers of Harry and James Potter*: - Sirius Black is the *father* of *James Potter* (the older brother of Harry).

> - James Potter is *Harry's uncle* and the *older brother* of *Luna Lovegood*.

> - This means *Sirius and James are Harry's uncles*, though they are *father and brother*.

https://pastebin.com/WAAmFKfX

fxwin

I'm very skeptical of the advantage they're claiming here. The whitepaper [0] only compares these to full precision models, when the more interesting (and probably more meaningful) comparison would be with other quantized models with a similar memory footprint.

Especially considering that these models seem to more or less just be quantized variants of Qwen3 with custom kernels and other inference optimizations (?) rather than fine tuned or trained from scratch with a new architecture, I am very surprised (or suspicious rather) that they didn't do the obvious comparison with a quantized Qwen3.

Their (to my knowledge) new measure/definition of intelligence seems reasonable, but introducing something like this without thorough benchmarking + model comparison is even more of a red flag to me.

[0] https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-b...

show comments
vx_r

Eagerly waiting for mlx to merge 1bit quantization pr to try this out.

andai

Does anyone know how to run this on CPU?

Do I need to build their llama.cpp fork from source?

Looks like they only offer CUDA options in the release page, which I think might support CPU mode but refuses to even run without CUDA installed. Seems a bit odd to me, I thought the whole point was supporting low end devices!

Edit: 30 minutes of C++ compile time later, I got it running. Although it uses 7GB of RAM then hangs at Loading model. I thought this thing was less memory hungry than 4 bit quants?

Edit 2: Got the 4B version running, but at 0.1 tok/s and the output seemed to be nonsensical. For comparison I can run, on the same machine, qwen 3.5 4B model (at 4 bit quant) correctly and about 50x faster.

_fw

What’s the trade-off? If it’s smaller, faster and more efficient - is it worse performance? A layman here, curious to know.

show comments
Archit3ch

Doesn't Jevons paradox dictate larger 1-bit models?

show comments
ycui1986

i hope someone do a 100b 1-bit parameter model. that should fit into most 16GB graphics cards. local AI democratized.

plombe

Interesting post. Curious to know how they arrived at intelligence density = Negative log of the model's error rate divided by the model size.

show comments
andai

The site says 14x less memory usage. I'm a bit confused about that situation. The model file is indeed very small, but on my machine it used roughly the same RAM as 4 bit quants (on CPU).

Though I couldn't get actual English output from it, so maybe something went wrong while running it.

kent8192

Oh, boy. This good tool hates my LM Studio... The following message appears when I run Bonsai in my LM Studio. I think my settings have done something wrong. ``` Failed to load the model Error loading model. (Exit code: null). Please check the settings and try loading the model again. ```

show comments
afaik69

any tutorial on how to run this on linux cpu only?

syntaxing

Super interesting, building their llama cpp fork on my Jetson Orin Nano to test this out.

bilsbie

I can’t see how this is possible. You’re losing so much information.

show comments
ggamezar

Misses comparison with qwen 3.5, though mentioned qwen 3. Is there a reason why?

robonot

really impressive for the size. Curious to see what happens when someone trains a 100B+ model natively at 1-bit.

keyle

Extremely cool!

Can't wait to give it a spin with ollama, if ollama could list it as a model that would be helpful.

ariwilson

Very cool and works pretty well!

show comments
est

is this somewhat similar to Microsofot Bitnet?

yodon

Is Bonsai 1 Bit or 1.58 Bit?

show comments
wshell

What would be a good TTS to run with this?

show comments
stogot

What is the value of a 1 bit? For those that do not kno

show comments
simian1983

Ran this prompt: Share an interesting scientific discovery or historical fact that most people don't know about. Please explain why it's significant and its impact on our understanding.

Got this nonsense answer: Interesting Scientific Discovery: The Discovery of the "Boson" in 1924 by Charles K. Alton ("Chuck") Alton

What it is: In 1924, American physicist Charles K. Alton discovered the "boson", a particle that would later be named after the Italian physicist Enrico Fermi. This discovery was made while he was working at the University of Chicago, where he was studying the properties of neutrons and protons.

Why it's significant: The boson Alton discovered was the meson, a subatomic particle that plays a crucial role in the strong nuclear force that holds atomic nuclei together. However, Alton's original work was more focused on the neutron, and he mistakenly identified the boson as a neutron—a common mistake at the time. This was not a scientific error, but rather a misunderstanding of the data. His work, however, laid the groundwork for the later discovery of mesons by Enrico Fermi and Richard Feynman, which was a major breakthrough in the field of particle physics.

Impact on Understanding:

• Mesons are now known to be the carriers of the strong force, which is one of the four fundamental forces of nature. • This discovery helped physicists understand the structure of the nucleus and the nature of nuclear binding. • It was a key step in the development of quantum chromodynamics (QCD), which describes the strong force at the subatomic level. • Alton's work is often overlooked in the history of particle physics, yet it was a critical early step in the field.

Why it's interesting: This story highlights how even early scientific discoveries can be misinterpreted or overlooked—sometimes not because of a lack of intelligence, but due to the complexity of the data and the evolving nature of scientific understanding. It also shows how one scientist's work can be built upon by others, leading to major breakthroughs.

Bonus: Alton was not the only person involved. His work was built upon earlier research by Edward Teller and Leopold Rabi, and it was ultimately Fermi who discovered the meson in 1932 using the "storm cloud" method, where he used a cloud chamber to observe the effects of high-energy particles.

So, while the boson Alton discovered was initially misidentified, it was a crucial piece of the puzzle in the development of modern particle physics.

show comments
marak830

It's been a hell of a morning for llama heads - first this, then the claude drop and turboquant.

I'm currently setting this one up, if it works well with a custom LoRa ontop ill be able to run two at once for my custom memory management system :D

OutOfHere

How do I run this on Android?

show comments
hatthew

I feel like it's a little disingenuous to compare against full-precision models. Anyone concerned about model size and memory usage is surely already using at least an 8 bit quantization.

Their main contribution seems to be hyperparameter tuning, and they don't compare against other quantization techniques of any sort.