Ternary Bonsai: Top Intelligence at 1.58 Bits

freakynit

Open access for next 5 hours (Ternary-Bonsai-8B-Q2_0.gguf, running on RTX 3090) or until server crashes or the this spot instance gets taken away :) =>

https://uklkyvetsjf7qt-80.proxy.runpod.net

    ./build/bin/llama-server \
     -m ../Ternary-Bonsai-8B-Q2_0.gguf \
     -ngl 999 \
     --flash-attn on \
     --host 0.0.0.0 \
     --port 80 \
     --ctx-size 65500 \
     --batch-size 512 \
     --ubatch-size 512 \
     --parallel 5 \
     --cont-batching \
     --threads 8 \
     --threads-batch 8 \
     --cache-type-k q8_0 \
     --cache-type-v q8_0 \
     --log-colors on

# llama.cpp is forked one: https://github.com/PrismML-Eng/llama.cpp.git

# The server can serve 5 parallel request, with each request capped at around `13K` tokens...

# A bit of of benchmarks I did:

# 1. Input: 1001 tokens, ttfs: 0.3 second, outputs: 1618 tokens ~140t/s

# 2. Input: 9708 tokens, ttfs: 2.4 second, outputs: 2562 tokens at ~106t/s

# Vram usage was consistently at ~7GiB.

> https://huggingface.co/prism-ml/Ternary-Bonsai-8B-gguf/resol...

show comments

armanj

I did a quick benchmark & compared it with Qwen3.5: https://github.com/ArmanJR/PrismML-Bonsai-vs-Qwen3.5-Benchma...

in my results, accuracy-wise Ternary-Bonsai-8B is on par with Qwen3.5-4B. But in accuracy-per-byte, bonsai is the clear winner:

=> Ternary-Bonsai-1.7B achieved 65.1% from 462 MiB, beating Qwen3.5-0.8B by 12 points while being ~5% smaller on disk. => Ternary-Bonsai-4B is the accuracy-per-byte winner above 1 GiB. 83.0% from only 1.1 GiB, within 2 points of Qwen3.5-4B at 40% of the weight size.

they show strong promise on edge devices and where disk space is limited. I think this lab is worth watching.

philipp-gayret

Nice work, I applied my own benchmarking tools to it.

On my single NVidia Spark I get 173.3 tokens/s on baseline config, 372.4 tokens/s with added tuning/parallel options. Most notably time to first token is incredibly low, similar models take ~6000ms. Bonsai was 70ms (almost 100x reduction) with flash attention

Having said all that, gemma4-e4b-q4km did much better and I can achieve 70% of the tokens/s on the same machine, specifically in context of tool use and for running agents.

usernametaken29

I think it’s exciting to live in this quirky universe where we have simply accepted our hardware does weird and nonlinear stuff and that powers some math and that’s why your transform function works. Many people thought quantisation is not viable to the extent we see, but we clearly underestimated the effect of hardware on the actual non linearity of the models. Cool to see this pushed to the limits.

show comments

Animats

This makes sense. The 1-bit model implies needing 2x as many neurons, because you need an extra level to invert. But the ternary model still has a sign, just really low resolution.

(I've been reading the MMLU-Redux questions for electrical engineering. They're very funny. Fifty years ago they might have been relevant. The references to the Intel 8085 date this to the mid-1970s. Moving coil meters were still a big thing back then. Ward-Leonard drives still drove some elevators and naval guns. This is supposed to be the hand-curated version of the questions. Where do they get this stuff? Old exams?)

[1] https://github.com/aryopg/mmlu-redux/blob/main/outputs/multi...

swiftcoder

Does this sort of thing scale? Would a 30B or higher model see similar performance/memory gains under this scheme?

tiagod

> Fig IV: Throughput (toks/sec) and energy consumption (mWh/tok) across various hardware platforms.

I don't see any mWh/token figures in that chart.

zkmon

The raw math: File size is hard-linked to parameter count and quant type. Intelligence is sort of linked to parameter count. Parameter count dictates the hardware requirement. What't left for the labs is, compressing more intelligence into lower parameter count, or packing more of specialized intelligence or buying up more hardware. Those are the only 3 directions all models/labs are heading.

yodon

So excited to see this - the big advantage of 1.58 bits is there are no multiplications at inference time, so you can run them on radically simpler and cheaper hardware.

show comments

mchusma

Ever since I saw the first one of these one-bit models made by Microsoft, I thought this was a fascinating route. I assume that in practice, this is less helpful than it seems, just because there's every economic incentive in the world for the big AI labs to produce small, powerful, fast models. None of them seem to be using this technique, so it's interesting, but I suspect it's not quite working.

I also have yet to see any of these at a larger scale. For example, can you try one of these at 100 billion parameters?

mungoman2

This is very interesting and exciting, but IMHO the comparisons read as a bit disingenuous with the other models at 16 bit weights. The 16 bit releases of the others models are not optimized for size, making it difficult to take the comparison seriously.

Would be interesting to see a comparison to quantized versions of the other models. If this model beats the others also in a fair comparison it gives more credibility to it.

WatchDog

All of their benchmarks are against 16 bit models right?

Why aren't they comparing to 2/3/4 bit quants?

show comments

londons_explore

How is the research on training these models directly in their quantized state going?

That'll be the real game changer.

show comments

ericb

This is pretty cool! I would love to see an even larger models shrunk down.

If you got that into a couple gigs--what could you stuff into 20 gigs?

wmf

Yet again they're comparing against unquantized versions of other models. They would probably still win but by a much smaller size margin.

show comments

syntex

hallucinates in pretty much every answer

est

installed since last HN post. So Bonsai (1-bit) and Ternary-Bonsai are different?

Can it be run on browsers with WASM/WebGPU?

TimorousBestie

This model tends to be annoyingly literal. An example from earlier today:

>> What are some names like Llewelyn?

> Some names like Llewelyn are Llewelyn, Llewelyn, Llewelyn, (repeats several times), and Llewelyn.

gbgarbeb

When do we get 1100B Kimi K2.6 in 160 GB of memory at 1.125 bpw?

goofy_lemur

> On M4 Pro, Ternary Bonsai 8B runs at 82 toks/sec, roughly 5x faster than a 16-bit 8B model

Wow, if this is true, I am extremely impressed and excited!

I wonder about kv cache how much better it is as well!