glerk

If you play with these models long enough, you realize there is more to them than just "model X is smarter than model Y" or "model Y is cheaper than model Z". They are different tools and the prompting technique is different. It is very much like playing an instrument.

With Claude, you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative. Also (you might raise an eyebrow at this) being nice to Claude will be rewarded and being mean to Claude will be punished. Claude tends to mirror your tone more aggressively and you don't want to get into negative loops with it.

With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.

With Qwen, you have to give it a shape and let it fill it in. Qwen likes XML, JSON and lists. Qwen likes to be shown a bunch of examples of previous work.

This is not scientific at all, just vibes, YMMV.

show comments
skipants

I feel like it's the Emperor's new clothes reading this article and seeing the praise it's getting. This sentence doesn't even make sense:

> These products use very low level Linux primitives like containers, Kubernetes, Firecracker microVMs, and networked protocols.

Out of anything that is a "low level linux primitive" I could maybe argue that networking? protocols fit the bill.

And it's obviously fully AI-generated! Which I wouldn't even care about if I could actually trust the content, which I can't!

show comments
stego-tech

I still believe that the strength of AI is when it can be applied locally in a secure and private manner, rather than yet another cloud-based service you must pay for indefinitely even as it gets progressively worse to satiate the greed of corporate shareholders.

ChatGPT and Anthropic will never, ever get me to tie my Health Data to their systems, but I still believe in the capabilities of AI in identifying patterns from data I would otherwise overlook, and sorely want a local-only ecosystem where I can expose this data safely, privately, and securely to something like Qwen or Gemma for processing.

Same goes for Smart Homes, and Personal Assistants. The corporate approach of letting Company A access your data stored at Company B and processed by Companies D and E while also sold to Advertisers and Data Brokers with no way for you to extract or view it on your local hardware - just isn’t tenable for these sorts of intimate use cases. I want my data to be owned and controlled and exposed on my terms, to be used to improve my life first rather than someone else’s bottom line. I want technology to give me back more of my time and improve my outcomes again, and I’ve been burned enough by Big Tech in the past that I flatly reject any presumption of nobility or public good from their AI-as-a-Service business model.

The capability is there, and I definitely think the folks working to build local tooling that supports and unlocks the potential for local models are the ones in the right. I love seeing what they build.

show comments
selfawareMammal

I am not a worse player than Messi, I'm just a different player.

ttsiodras

Interesting article.

IMHO, the author could have done two things better:

- vllm instead of llama.cpp. With NVIDIA HW, there is huge difference in multi-user loads and caching with vllm; when he was complaining about what happens when more than one user uses the model, and about losing caching, I was "well, duh".

- The budget he used for a single card could have instead be put to far, far better use with SPARKs. I have access to a cluster of 2 x GX10 - total cost less than half what he paid, even today - and I am running vllm and Deepseek v4 Flash. The difference compared to any Qwen is tremendous - I've NEVER seen it loop, and in all my experiments so far, it's the most Sonnet-y model I've ever tried (antirez seems to agree, hence his ds4 fork).

If you're wondering about how I set it up in the 2 GX10s: https://forums.developer.nvidia.com/t/deepseek-v4-flash-offi...

Performance: 2K t/s prefill ( very useful for feeding tons of source code into its massive context window ) and around 50-60 tg/s in my coding sessions in the pi.dev harness. With the money the author paid, he could have bought 4 GX10s, and double both numbers ( vllm basically scales almost linearly with tensor parallelism ).

show comments
zmmmmm

That's a great write up.

The one thing I feel it seems to under estimate is the likelihood of improvement. Even the authors acknowledge it's not even worth comparing local models from a year ago to what we have now. In fact, people widely see Opus 4.5 in November last year - 8 months ago - as the first time agentic coding became viable broadly viable even with frontier hosted models.

So why would we lock in hard on any concept at this point of what a local model is and isn't good for? Whatever it is right now, it probably won't be that in a year. It might be naive optimism to think we'll ever get to long horizon tasks with models that run on consumer / pro grade hardware. But so far the naive optimists are winning.

show comments
hypfer

That was a lot of text for me still having no idea what the point of the author was (beside what I can infer from the headline that is).

I do however now know that they're a totally cool dude building stuff physically and as software + that other people give them money for it.

Does that have anything to do with the topic suggested by the headline? Not sure.

show comments
gpt5

This article is a good summary of local models. Unlike the way they are hyped sometimes, as fantastic tools for coding and agentic local work. The reality is that they are rather limited, would not do well on a long or complex task, and are prone to fall into loops, forget their tasks, etc. Not mentioned in the article is that they are also rather expensive - not just for the hardware cost, but also electricity. These 3090 and 5090 machines are pretty power hungry, and these models are pretty slow on these machines, making them consume more power per token.t

Where they shine is in your ability to control them, their privacy, their predictability (e.g. if you are doing a repetitive task, like classifying your photo/video library), and depending on your energy bill - their costs.

show comments
mistercheese

I’m not sure if I missed it, but I’m curious how you feel about cloud hosted models with ZDR policies? GLM5.2 or even Minimax M3 on Fireworks or Together ai should be still relatively/consistently cheap and private but a lot more capable and easier to setup?

show comments
barrkel

I found it interesting that vLLM was dismissed as slower than llama.cpp.

IME vLLM is quite a bit faster than llama.cpp but where it really wipes the floor with it is in batching concurrent load. The downside is that it is dramatically less flexible in terms of tweaking. It gives you very few options for running quantized weights. It takes a lot longer to start up because it optimizes the compute graph. So for single user experimentation on a model that's a bit too big for your box, vLLM is just going to be frustrating.

show comments
eurekin

> The model is running so hot, that it shoots past the goal and starts looping

later:

> My latest experiment was setting up vLLM (the gold standard for production and concurrent serving) and even with an NVLink (175GBP) and tensor parallelism turned on, it was 3 tokens/second slower than llama.cpp during generation for an equivalent setup.

In all my tests, getting vllm to run is worth it. It was the single biggest thing, that helped for looping issues, agents going whack and losing focus on the task, long context being essentially useless.

FP8 model, unquantized cache in vllm an you have a league better overall experience, with any other stack I tested. Then, you can actually focus on using the model for other things and stop tinkering with settings.

show comments
nessex

This is a great post that covers a lot of the recent ground. I have a very similar setup after a very similar journey, minus the RTX6000. Worth noting though that a lot of the recent changes make a single 3090/4090 much more viable here too. MTP and the recent improvements to kv quantization in particular, as well as model-specific template & quant fixes. I run a 4090 with the 4-bit quantized variant of the same model now and have had a great experience. Qwen3.5 was already a big step up, but with 3.6 and the rest of the improvements it's substantially more reliable as a daily use tool and I find myself reaching for hosted models a lot less. Feels like I could work entirely without them if they were to disappear without going back to typing every line of code myself.

To make 4-bit fit on one card with reasonable (100k+) context needs a bit more care though. And tuning can be highly specific to your machine, gpu and use-case. But I use a headless server, offload multi-modal to CPU, use fit-target to reduce wasted memory and use q8_0 kv since the 4090 performs well with it... In addition to most of the same config as the author elsewhere. I get 50-60tps generation with a power limit of 275W (450W is default), more than enough to offer a roughly an Opus-speed feedback loop.

I haven't seen many of the issues with looping the author mentions. But I did with Qwen3.5 and in particular other 4-bit quants in the past. But the difference is probably a mix of the improvements above, as well as habits changing to avoid cases where models will loop. For what I'm doing, it seems like I loop Qwen3.6 on the same kind of prompts I'll make Haiku or Sonnet loop on (the latter hide some of their existential loops behind "thinking"). Usually it's cause I was too vague about some aspect of what I'm wanting them to do or I forgot to include some context that smaller models just don't have access to in their smaller knowledge base. But at least for what I'm doing (Rust, React, kubernetes) it's not been a notable problem at all with the latest iteration of this whole stack. And knowledge of standard libraries and default k8s resource kinds has been almost flawless.

There's still plenty of more complex stuff where I'll choose to jump straight to Claude or GLM-5.2, but if it's not worth that jump I've stopped paying for the middle ground as it's usually not much better than just one more iteration through qwen.

All this to say, if you have a 3090/4090, feel free to give the same setup a go. It's come a long way in recent weeks.

piterrro

This is amazing but for everyone out there wanting to buy and build your own AI rig I recommend connecting to one of mamy inference providers and trying out different models themselves for a while. Costs pennies but can give you a nice preview of what you can get with your own rig. Just a friendly tip.

bee_rider

Tangential question (since they brought it up in the article) from someone not involved in AI performance optimization:

How big of a deal is looping, practically? Or, I mean, I see thinking models loop occasionally. But it seems to me that every token in the loop should be in the KV cache already, is there really no way to either power through a loop because of the 100% cache hit rate, or identify that you are in a loop that way? (As a human, when thinking hard I sometimes loop, but it is easy enough to identify…)

show comments
krzyk

3090 and 2x3090 are quite popular. But if you uses gigantic (for local models) context of 200k it will go south pretty quickly - any quantization of context quickly becomes the issue.

show comments
teh

I sometimes wonder how much of intelligence is being good with tools.

I feel pretty averagely smart but give me some good tooling like a good editor, a good type system, semantic grep, good testing and some solvers and I can actually deliver some work.

Maybe the trick isn't 500 billion parameters but a model super integrated with the task at hand for iteration and debugging?

FWIW the article really mirrors my own experience. I can run a small gemma4 for quick edits (and it's fast!) or data cleanup but for other tasks you do need a different tool (claude).

whazor

Would be interesting to use local models for:

- tool calling

- code base exploration

- anonymizing / abstracting your request

Such that your local AI communicates to frontier model like an expensive consultant giving high level advice.

I think due to the lower latency of a local model that this could be faster.

show comments
cptskippy

I've been running qwen3-5-9b-q4-k-m and qwen3-6-27b-q6-k simultaneously on an Intel Arc Pro B70 with a lot of success.

https://github.com/cptskippy/battlemage-llm-gateway

Opencode has been a huge productivity accelerator. I have two Hermes agents that I'm training to support my workflow with pretty good success. One is a personal assistant who manages my backlog and keeps me on task, follows up with me on items, and will put together research briefs. The other I use a general purpose coder and research and it's about 50:50 with the tasks I've given it. In fairness though, the task it failed at left me scratching my head to figure out as well.

show comments
zkmon

The seems to talk a lot about 27B. In my experience, I saw 35B-A3B to be equally good in quality and the MoE gave more tg/s.

show comments
watt

I find it strange that software people will accept this level of flakiness from the hardware. Normally you would just send the card back, and request a replacement.

> One of the cards would only show up if I crossed my fingers when turning it on. Even reboots wouldn't cure it - I had to A/C power off and remove the power cable each time for 30 seconds.

This is ridiculous. Of course we are living through supply crunch, but that card is clearly defective hardware.

show comments
wallkroft

>Local Qwen isn't a worse Opus >looks inside >local Qwen is not "near Opus levels"

itsthecourier

wanted sovereignty, bought a Blackwell for usd12k, discovered a billing issue in some customer and explains that will cover the card

I don't follow how it supports the decision of buying the card, I would even say using online SOTA models would had caught it earlier without usd12k and monthly electricity being spent

show comments
bethekidyouwant

“This rock is not a worse hammer its a different tool”

mystraline

> We've all heard people say that local Qwen 27B or 35-A3B is "near-Opus level"

Uh, so, yeah. Im running local Qwen, but Qwen3.5-122B using Krasis https://github.com/brontoguana/krasis

Its by far better than Opus.

In fact with a phone migration, I was using an OLD android 2fa app "andOTP". Backup files it emitted were JSON but not any sort of standard.

I needed the standards version using otpauth:// to upload in my current 2fa. And gave it to my local qwen3.5-122b.

It responded with a scary "you uploaded credentials to a public instance LLM! And, it emitted standards compliant URLs. The new app "Tokn" ingested just fine. When side by side was tested, everything was 100% correct.

I coukd have did it myself, but it was a one-off. And asking local Qwen worked perfectly. Took like 6 minutes. Would have taken me 1h.

wallkroft

>Local Qwen isn't a worse Opus >looks inside >local Qwen is not "near Opus levels

rsrsrs86

Chasing models for me it’s a big yellow flag

Means underinvesting in engineering

Look into it