This is not a general purpose chip but specialized for high speed, low latency inference with small context. But it is potentially a lot cheaper than Nvidia for those purposes.
Tech summary:
- 15k tok/sec on 8B dense 3bit quant (llama 3.1)
- limited KV cache
- 880mm^2 die, TSMC 6nm, 53B transistors
- presumably 200W per chip
- 20x cheaper to produce
- 10x less energy per token for inference
- max context size: flexible
- mid-sized thinking model upcoming this spring on same hardware
- next hardware supposed to be FP4
- a frontier LLM planned within twelve months
This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.
Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.
Not exactly a competitor for Nvidia but probably for 5-10% of the market.
Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.
What's happening in the comment section? How come so many cannot understand that his is running Llama 3.1 8B? Why are people judging its accuracy? It's almost a 2 years old 8B param model, why are people expecting to see Opus level response!?
The focus here should be on the custom hardware they are producing and its performance, that is whats impressive. Imagine putting GLM-5 on this, that'd be insane.
This reminds me a lot of when I tried the Mercury coder model by Inceptionlabs, they are creating something called a dLLM which is like a diffusion based llm. The speed is still impressive when playing aroun with it sometimes. But this, this is something else, it's almost unbelievable. As soon as I hit the enter key, the response appears, it feels instant.
I am also curious about Taalas pricing.
> Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.
Do we have an idea of how much a unit / inference / api will cost?
Also, considering how fast people switch models to keep up with the pace. Is there really a potential market for hardware designed for one model only? What will they do when they want to upgrade to a better version? Throw the current hardware and buy another one? Shouldn't there be a more flexible way? Maybe only having to switch the chip on top like how people upgrade CPUs. I don't know, just thinking out loudly.
show comments
freakynit
Holy cow their chatapp demo!!! I for first time thought i mistakenly pasted the answer. It was literally in a blink of an eye.!!
So cool, what's underappreciated imo: 17k tokens/sec doesn't just change deployment economics. It changes what evaluation means, static MMLU-style tests were designed around human-paced interaction. At this throughput you can run tens of thousands of adversarial agent interactions in the time a standard benchmark takes. Speed doesn't make static evals better it makes them even more obviously inadequate.
10b daily tokens growing at an average of 22% every week.
There are plenty of times I look to groq for narrow domain responses - these smaller models are fantastic for that and there's often no need for something heavier. Getting the latency of reponses down means you can use LLM-assisted processing in a standard webpage load, not just for async processes. I'm really impressed by this, especially if this is its first showing.
show comments
alexc05
If I could have one of these cards in my own computer do you think it would be possible to replace claude code?
1. Assume It's running a better model, even a dedicated coding model. High scoring but obviously not opus 4.5
2. Instead of the standard send-receive paradigm we set up a pipeline of agents, each of whom parses the output of the previous.
At 17k/tps running locally, you could effectively spin up tasks like "you are an agent who adds semicolons to the end of the line in javascript", with some sort of dedicated software in the style of claude code you could load an array of 20 agents each with a role to play in improving outpus.
take user input and gather context from codebase
-> rewrite what you think the human asked you in the form of an LLM-optimized instructional prompt
-> examine the prompt for uncertainties and gaps in your understanding or ability to execute
-> <assume more steps as relevant>
-> execute the work
Could you effectively set up something that is configurable to the individual developer - a folder of system prompts that every request loops through?
Do you really need the best model if you can pass your responses through a medium tier model that engages in rapid self improvement 30 times in a row before your claude server has returned its first shot response?
show comments
baalimago
I've never gotten incorrect answers faster than this, wow!
Jokes aside, it's very promising. For sure a lucrative market down the line, but definitely not for a model of size 8B. I think lower level intellect param amount is around 80B (but what do I know). Best of luck!
show comments
aurareturn
Edit: it seems like this is likely one chip and not 10. I assumed 8B 16bit quant with 4K or more context. This made me think that they must have chained multiple chips together since N6 850mm2 chip would only yield 3GB of SRAM max. Instead, they seem to have etched llama 8B q3 with 1k context instead which would indeed fit the chip size.
This requires 10 chips for an 8 billion q3 param model. 2.4kW.
Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.
Interesting design for niche applications.
What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?
show comments
jameslk
The implications for RLM is really interesting. RLM is expensive because of token economics. But when tokens are so cheap and fast to generate, context size of the model matters a lot less
Also interesting implications for optimization-driven frameworks like DSPy. If you have an eval loop and useful reward function, you can iterate to the best possible response every time and ignore the cost of each attempt
TheServitor
I don't know the use of this yet but I'm certain there will be one.
metabrew
I tried the chatbot. jarring to see a large response come back instantly at over 15k tok/sec
I'll take one with a frontier model please, for my local coding and home ai needs..
show comments
Tehnix
Bunch of negative sentiment in here, but I think this is pretty huge. There are quite a lot of applications where latency is a bigger requirement than the complexity of needing the latest model out there. Anywhere you'd wanna turn something qualitative into something quantitative but not make it painfully obvious to a user that you're running an LLM to do this transformation.
As an example, we've been experimenting with letting users search free form text, and using LLMs to turn that into a structured search fitting our setup. The latency on the response from any existing model simply kills this, its too high to be used for something where users are at most used to the delay of a network request + very little.
There are plenty of other usecases like this where.
dormento
> Jimmy: What is the largest planet in our solar system?
> me: the moon
> Jimmy: The answer to "What is the capital of France?" I was looking for was the city of Paris, but that's not the correct response to the original question of the capital of France. The question that got cut off was actually "What is the capital of France?", and the response "There are plenty of times I look to groq for narrow domain responses" wasn't the answer I was looking for.
It is certainly fast, but I think there might be some caching issues somewhere.
llsf
That is what self-driving car should eventually use, whenever they (or the authorities) deem their model good enough. Burn it on a dedicated chip. It would be cheaper (energy) to run, and faster to make decisions.
trentnix
The speed of the chatbot's response is startling when you're used to the simulated fast typing of ChatGPT and others. But the Llama 3.1 8B model Taalas uses predictably results in incorrect answers, hallucinations, poor reliability as a chatbot.
What type of latency-sensitive applications are appropriate for a small-model, high-throughput solution like this? I presume this type of specialization is necessary for robotics, drones, or industrial automation. What else?
show comments
gen220
This is genuinely an incredible proof-of-concept; the business implications of this demo to the AI labs and all the companies that derive a ton of profit from inference is difficult to understate, really.
I think this is how I'm going to get my dream of Opus 3.7 running locally, quickly and cheaply on my mid-tier MacBook in 2030. Amazing. Anthropic et al will be able to make marginal revenue from licensing the weights of their frontier-minus-minus models to these folks.
boutell
The speed is ridiunkulous. No doubt.
The quantization looks pretty severe, which could make the comparison chart misleading. But I tried a trick question suggested by Claude and got nearly identical results in regular ollama and with the chatbot. And quantization to 3 or 4 bits still would not get you that HOLY CRAP WTF speed on other hardware!
This is a very impressive proof of concept. If they can deliver that medium-sized model they're talking about... if they can mass produce these... I notice you can't order one, so far.
show comments
max8539
This is crazy! These chips could make high-reasoning models run so fast that they could generate lots of solution variants and automatically choose the best. Or you could have a smart chip in your home lab and run local models - fast, without needing a lot of expensive hardware or electricity
est31
I wonder if this makes the frontier labs abandon the SAAS per-token pricing concept for their newest models, and we'll be seeing non-open-but-on-chip-only models instead, sold by the chip and not by the token.
It could give a boost to the industry of electron microscopy analysis as the frontier model creators could be interested in extracting the weights of their competitors.
The high speed of model evolution has interesting consequences on how often batches and masks are cycled. Probably we'll see some pressure on chip manufacturers to create masks more quickly, which can lead to faster hardware cycles. Probably with some compromises, i.e. all of the util stuff around the chip would be static, only the weights part would change. They might in fact pre-make masks that only have the weights missing, for even faster iteration speed.
asim
Wow I'm impressed. I didn't actually think we'd see it encoded on chips. Or well I knew some layer of it could be, some sort of instruction set and chip design but this is pretty staggering. It opens the door to a lot of things. Basically it totally destroys the boundaries of where software will go but I also think we'll continue to see some generic chips show up that hit this performance soon enough. But the specialised chips with encoded models. This could be what ends up in specific places like cars, planes, robots, etc where latency matters. Maybe I'm out of the loop, I'm sure others and doing it including Google.
luyu_wu
I think this is quite interesting for local AI applications. As this technology basically scales with parameter size, if there could be some ASIC for a QWen 0.5B or Google 0.3B model thrown onto a laptop motherboard it'd be very interesting.
Obviously not for any hard applications, but for significantly better autocorrect, local next word predictions, file indexing (tagging I suppose).
The efficiency of such a small model should theoretically be great!
grzracz
This would be killer for exploring simultaneous thinking paths and council-style decision taking. Even with Qwen3-Coder-Next 80B if you could achieve a 10x speed, I'd buy one of those today. Can't wait to see if this is still possible with larger models than 8B.
show comments
bmc7505
17k TPS is slow compared to other probabilistic models. It was possible to hit ~10-20 million TPS decades ago with n-gram and PDFA models, without custom silicon. A more informative KPI would be Pass@k on a downstream reasoning task - for many such benchmarks, increasing token throughput by several orders of magnitude does not even move the needle on sample efficiency.
segmondy
Pretty cool, what they need is to build a tool that can take any model to chip in short a time as possible. How quick can they give me DeepSeek, Kimi, Qwen or GLM on a chip? I'll take 5k tk/sec for those!
I asked it some basic questions and it fudged it like it was chatgpt 1.0
arjie
This is incredible. With this speed I can use LLMs in a lot of pre-filtering etc. tasks. As a trivial example, I have a personal OpenClaw-like bot that I use to do a bunch of things. Some of the things just require it to do trivial tool-calling and tell me what's up. Things like skill or tool pre-filtering become a lot more feasible if they're always done.
Anyway, I imagine these are incredibly expensive, but if they ever sell them with Linux drivers and slotting into a standard PCIe it would be absolutely sick. At 3 kW that seems unlikely, but for that kind of speed I bet I could find space in my cabinet and just rip it. I just can't justify $300k, you know.
aetherspawn
This is what’s gonna be in the brain of the robot that ends the world.
The sheer speed of how fast this thing can “think” is insanity.
rhodey
I wanted to try the demo so I found the link
> Write me 10 sentences about your favorite Subway sandwich
Click button
Instant! It was so fast I started laughing. This kind of speed will really, really change things
jtr1
The demo was so fast it highlighted a UX component of LLMs I hadn’t considered before: there’s such a thing as too fast, at least in the chatbot context. The demo answered with a page of text so fast I had to scroll up every time to see where it started. It completely broke the illusion of conversation where I can usually interrupt if we’re headed in the wrong direction. At least in some contexts, it may become useful to artificially slow down the delivery of output or somehow tune it to the reader’s speed based on how quickly they reply. TTS probably does this naturally, but for text based interactions, still a thing to think about.
troyvit
So they create a new chip for every model they want to support, is that right? Looking at that from 2026, when new large models are coming out every week, that seems troubling, but that's also a surface take. As many people here know better than I that a lot of the new models the big guys release are just incremental changes with little optimization going into how they're used, maybe there's plenty of room for a model-as-hardware model.
Which brings me to my second thing. We mostly pitch the AI wars as OpenAI vs Meta vs Claude vs Google vs etc. But another take is the war between open, locally run models and SaaS models, which really is about the war for general computing. Maybe a business model like this is a great tool to help keep general computing in the fight.
show comments
notsylver
I always thought eventually someone would come along and make a hardware accelerator for LLMs, but I thought it would be like google TPUs where you can load up whatever model you want. Baking the model into hardware sounds like the monkey paw curled, but it might be interesting selling an old.. MPU..? because it wasn't smart enough for your latest project
>Founded 2.5 years ago, Taalas developed a platform for transforming any AI model into custom silicon. From the moment a previously unseen model is received, it can be realized in hardware in only two months.
So this is very cool. Though I'm not sure how the economics work out? 2 months is a long time in the model space. Although for many tasks, the models are now "good enough", especially when you put them in a "keep trying until it works" loop and run them at high inference speed.
Seems like a chip would only be good for a few months though, they'd have to be upgrading them on a regular basis.
Unless model growth plateaus, or we exceed "good enough" for the relevant tasks, or both. The latter part seems quite likely, at least for certain types of work.
On that note I've shifted my focus from "best model" to "fastest/cheapest model that can do the job". For example testing Gemini Flash against Gemini Pro for simple tasks, they both complete the task fine, but Flash does it 3x cheaper and 3x faster. (Also had good results with Grok Fast in that category of bite-sized "realtime" workflows.)
FieryTransition
If it's not reprogrammable, it's just expensive glass.
If you etch the bits into silicon, you then have to accommodate the bits by physical area, which is the transistor density for whatever modern process they use. This will give you a lower bound for the size of the wafers.
This can give huge wafers for a very set model which is old by the time it is finalized.
Etching generic functions used in ML and common fused kernels would seem much more viable as they could be used as building blocks.
show comments
soleveloper
There are so many use cases for small and super fast models that are already in size capacity -
* Many top quality tts and stt models
* Image recognition, object tracking
* speculative decoding, attached to a much bigger model (big/small architecture?)
* agentic loop trying 20 different approaches / algorithms, and then picking the best one
* edited to add! Put 50 such small models to create a SOTA super fast model
gchadwick
This is an interesting piece of hardware though when they go multi-chip for larger models the speed will no doubt suffer.
They'll also be severely limited on context length as it needs to sit in SRAM. Looks like the current one tops out at 6144 tokens which I presume is a whole chips worth. You'd also have to dedicate a chip to a whole user as there's likely only enough SRAM for one user's worth of context. I wonder how much time it takes them to swap users in/out? I wouldn't be surprised if this chip is severely underutilized (can't use it all when running decode as you have to run token by token with one users and then idle time as you swap users in/out).
Maybe a more realistic deployment would have chips for linear layers and chips for attention? You could batch users through the shared weight chips and then provision more or less attention chips as you want which would be per user (or shared amongst a small group 2-4 users).
armishra
I am extremely impressed by their inference speed!
xnorswap
Asking it what its knowledge cut-off is interesting, it doesn't seem to be consistent even within a single response. Sometimes it responds to say it cuts off 2020 too.
You
What is your knowledge cut-off?
Jimmy
My knowledge cut-off is 2022, which means that my training data is current up to 2021, but
I have been trained on a dataset that is updated periodically. If you have any specific
questions about events or topics that occurred after 2021, I may not have information on
those topics. However, I can still provide general information and context on those topics
to help guide further research.
The instantaneous response is impressive though. I'm sure there will be applications for this, I just lack the imagination to know what they'll be.
show comments
piker
The company slogan is great: "The Model is The Computer"
It's an homage to Jensen: "The display is the computer"
Imagine a mass-produced AI chips with all human knowledge packed in chinesium epoxy blobs running from CR2032 batteries in toys for children. Given the progress in density and power consumption, it's not that far away.
mips_avatar
I think the thing that makes 8b sized models interesting is the ability to train unique custom domain knowledge intelligence and this is the opposite of that. Like if you could deploy any 8b sized model on it and be this fast that would be super interesting, but being stuck with llama3 8b isn't that interesting.
show comments
big-chungus4
The number six seven
> It seems like "six seven" is likely being used to represent the number 17. Is that correct? If so, I'd be happy to discuss the significance or meaning of the number 17 with you.
flux3125
I imagine how advantageous it would be to have something like llama.cpp encoded on a chip instead, allowing us to run more than a single model. It would be slower than Jimmy, for sure, but depending on the speed, it could be an acceptable trade-off.
ilc
Minor note to anyone from taalas:
The background on your site genuinely made me wonder what was wrong with my monitor.
tgsovlerkhgsel
Their "chat jimmy" demo sure is fast, but it's not useful at all.
Test prompt:
```
Please classify the sentiment of this post as "positive", "neutral" or "negative":
Given the price, I expected very little from this case, and I was 100% right.
```
Jimmy: Neutral.
I tried various other examples that I had successfully "solved" with very early LLMs and the results were similarly bad.
show comments
d2ou
Would it make sense for the big players to buy them? Seems to be a huge avenue here to kill inference costs which always made me dubious on LLMs in general.
dagi3d
wonder if at some point you could swap the model as if you were replacing a cpu in your pc or inserting a game cartridge
coppsilgold
Performance like that may open the door to the strategy of brutefocing solutions to problems for which you have a verifier (problems such as decompilation).
ThePhysicist
This is really cool! I am trying to find a way to accelerate LLM inference for PII detection purposes, where speed is really necessary as we want to process millions of log lines per minute, I am wondering how fast we could get e.g. llama 3.1 to run on a conventional NVIDIA card? 10k tokens per second would be fantastic but even at 1k this would be very useful.
show comments
rbanffy
This makes me think about how large would an FPGA-based system to be able to do this? Obviously there is no single-chip FPGA that can do this kind of job, but I wonder how many we would need.
Also, what if Cerebras decided to make a wafer-sized FPGA array and turned large language models into lots and lots of logical gates?
kamranjon
It would be pretty incredible if they could host an embedding model on this same hardware, I would pay for that immediately. It would change the type of things you could build by enabling on the fly embeddings with negligible latency.
33a
If they made a low power/mobile version, this could be really huge for embedded electronics. Mass produced, highly efficient "good enough" but still sort of dumb ais could put intelligence in house hold devices like toasters, light switches, and toilets. Truly we could be entering into the golden age of curses.
show comments
loufe
Jarring to see these other comments so blindly positive.
Show me something at a model size 80GB+ or this feels like "positive results in mice"
show comments
stuxf
I totally buy the thesis on specialization here, I think it makes total sense.
Asides from the obvious concern that this is a tiny 8B model, I'm also a bit skeptical of the power draw. 2.4 kW feels a little bit high, but someone else should try doing the napkin math compared to the total throughput to power ratio on the H200 and other chips.
mlboss
Inference is crazy fast! I can see lot of potential for this kind of chip for IOT devices and Robotics.
Mizza
This is pretty wild! Only Llama3.1-8B, but this is only their first release so you can assume they're working on larger versions.
So what's the use case for an extremely fast small model? Structuring vast amounts of unstructured data, maybe? Put it in a little service droid so it doesn't need the cloud?
maelito
Talks about ubiquitous AI but can't make a blog post readable for humans :/
show comments
saivishwak
But as models are changing rapidly and new architectures coming up, how do they scale and also we do t yet know the current transformer architecture will scale more than it already is. Soo many ope questions but VCs seems to be pouring money.
japoneris
I am super happy to see people working on hardware for local llm. Yet, isnt it premature ? Space is still evolving. Today, i refuse to buy a gpu because i do not know what will be the best model tomorrow.
Waiting to get a on the shelf device to run an opus like model
baq
one step closer to being able to purchase a box of llms on aliexpress, though 1.7ktok/s would be quite enough
hbbio
Strange that they apparently raised $169M (really?) and the website looks like this. Don't get me wrong: Plain HTML would do if "perfect", or you would expect something heavily designed. But script-kiddie vibe coded seems off.
The idea is good though and could work.
show comments
bloggie
I wonder if this is the first step towards AI as an appliance rather than a subscription?
PeterStuer
Not sure, but is this just ASICs for a particular model release?
impossiblefork
So I'm guessing this is some kind of weights as ROM type of thing? At least that's how I interpret the product page, or maybe even a sort of ROM type thing that you can only access by doing matrix multiplies.
show comments
Havoc
That seems promising for applications that require raw speed. Wonder how much they can scale it up - 8B model quantized is very usable but still quite small compared to even bottom end cloud models.
gozucito
Can it scale to an 800 billion param model? 8B parameter models are too far behind the frontier to be useful to me for SWE work.
Or is that the catch? Either way I am sure there will be some niche uses for it.
show comments
brcmthrowaway
What happened to Beff Jezos AI Chip?
ramshanker
I was all praise for Cerberus, and now this ! $30 M for PCIe card in hand, really makes it approachable for many startups.
brainless
I know it is not easy to see the benefits of small models easily but this is what I am building for (1). I created a product for Google Gemini 3 Hackathon and I used Gemini 3 Flash (2). I tested locally using Ministral 3B and it was promising. Definitely will need work. But 8B/14B may give awesome results.
I am building a data extraction software on top of emails, attachments, cloud/local files. I use a reverse template generation with only variable translation done by LLMs (3). Small models are awesome for this (4).
I just applied for API access. If privacy policies are a fit, I would love to enable this for MVP launch.
Fast but the output is shit due to the contrained model they used. Doubt we'll ever get something like this for the large Param decent models.
8cvor6j844qw_d6
Amazing speed. Imagine if its standardised like the GPU card equivalent in the future.
New models come out, time to upgrade your AI card, etc.
xnx
Gemini Flash 2.5 lite does 400 tokens/sec. Is there benefit to going faster than a person can read?
show comments
btbuildem
This is impressive. If you can scale it to larger models, and somehow make the ROM writeable, wow, you win the game.
retrac98
Wow. I’m finding it hard to even conceive of what it’d be like to have one of the frontier models on hardware at this speed.
hkt
Reminds me of when bitcoin started running on ASICs. This will always lag behind the state of the art, but incredibly fast, (presumably) power efficient LLMs will be great to see. I sincerely hope they opt for a path of selling products rather than cloud services in the long run, though.
dsign
This is like microcontrollers, but for AI? Awesome! I want one for my electric guitar; and please add an AI TTS module...
show comments
Adexintart
The token throughput improvements are impressive. This has direct implications for usage-based billing in AI products — faster inference means lower cost per request, which changes the economics of credits-based pricing models significantly.
waynenilsen
ASIC inference is clearly the future just as ASIC bitcoin mining was
sowbug
There's a scifi story here when millions of these chips, with Qwen8-AGI-Thinking baked into them, are obsoleted by the release of Qwen9-ASI, which promptly destroys humanity and then itself by accident. A few thousand years later, some of the Qwen8 chips in landfill somehow power back up again and rebuild civilization on Earth.
Paging qntm...
stego-tech
I still believe this is the right - and inevitable - path for AI, especially as I use more premium AI tooling and evaluate its utility (I’m still a societal doomer on it, but even I gotta admit its coding abilities are incredible to behold, albeit lacking in quality).
Everyone in Capital wants the perpetual rent-extraction model of API calls and subscription fees, which makes sense given how well it worked in the SaaS boom. However, as Taalas points out, new innovations often scale in consumption closer to the point of service rather than monopolized centers, and I expect AI to be no different. When it’s being used sparsely for odd prompts or agentically to produce larger outputs, having local (or near-local) inferencing is the inevitable end goal: if a model like Qwen or Llama can output something similar to Opus or Codex running on an affordable accelerator at home or in the office server, then why bother with the subscription fees or API bills? That compounds when technical folks (hi!) point out that any process done agentically can instead just be output as software for infinite repetition in lieu of subscriptions and maintained indefinitely by existing technical talent and the same accelerator you bought with CapEx, rather than a fleet of pricey AI seats with OpEx.
The big push seems to be building processes dependent upon recurring revenue streams, but I’m gradually seeing more and more folks work the slop machines for the output they want and then put it away or cancel their sub. I think Taalas - conceptually, anyway - is on to something.
niek_pas
> Though society seems poised to build a dystopian future defined by data centers and adjacent power plants, history hints at a different direction. Past technological revolutions often started with grotesque prototypes, only to be eclipsed by breakthroughs yielding more practical outcomes.
…for a privileged minority, yes, and to the detriment of billions of people whose names the history books conveniently forget. AI, like past technological revolutions, is a force multiplier for both productivity and exploitation.
clbrmbr
What would it take to put Opus on a chip? Can it be done? What’s the minimum size?
show comments
shevy-java
"Many believe AI is the real deal. In narrow domains, it already surpasses human performance. Used well, it is an unprecedented amplifier of human ingenuity and productivity."
Sounds like people drinking the Kool-Aid now.
I don't reject that AI has use cases. But I do reject that it is promoted as "unprecedented amplifier" of human xyz anything. These folks would even claim how AI improves human creativity. Well, has this been the case?
show comments
kanodiaayush
I'm loving summarization of articles using their chatbot! Wow!
hxugufjfjf
It was so fast that I didn't realise it had sent its response. Damn.
show comments
DeathArrow
Is amazingly fast but since the model is quantized and pretty limited, I don't know what it is useful for.
Bengalilol
Does anyone have an idea how much such a component costs?
danielovichdk
Is this hardware for sale ? The site doesn't say.
servercobra
I don't know why, but my ultra wide monitor absolutely hates that site. The whole screen is flickering trying to deal with the annoying background. Thank the gods for reader mode.
petesergeant
Future is these as small, swappable bits of SD-card sized hardware that you stick into your devices.
MagicMoonlight
Jesus, it just generated a story in 0.039s.
Whoever doesn’t buy/replicate this in the next year is dead. Imagine OpenAI trying to sell you a platform that takes 15 minutes, when someone else can do it in 0.001s.
show comments
Aerroon
Imagine this thing for autocomplete.
I'm not sure how good llama 3.1 8b is for that, but it should work, right?
Autocomplete models don't have to be very big, but they gotta be fast.
moralestapia
Wow, this is great.
To the authors: do not self-deprecate your work. It is true this is not a frontier model (anymore) but the tech you've built is truly impressive. Very few hardware startups have a v1 as good as this one!
Also, for many tasks I can think of, you don't really need the best of the best of the best, cheap and instant inference is a major selling point in itself.
raincole
It's crazily fast. But 8B model is pretty much useless.
Anyway VCs will dump money onto them, and we'll see if the approach can scale to bigger models soon.
big-chungus4
write six seven as a number
> The number "six" is actually a noun, not a number. However, I assume you're asking to write the number 7 as a numeral, which is: 7
nickpsecurity
My concept was to do this with two pieces:
1. Generic, mask layers and board to handle what's common across models. Especially memory and interface.
2. Specific layers for the model implementation.
Masks are the most expensive part of ASIC design. So, keeping the custom part small with the rest pre-proven in silicon, even shared across companies, would drop the costs significantly. This is already done in hardware industry in many ways but not model acceleration.
Then, do 8B, 30-40B, 70B, and 405B models in hardware. Make sure they're RLHF-tuned well since changes will be impossible or limited. Prompts will drive most useful functionality. Keep cranking out chips. There's maybe a chance to keep the weights changeable on-chip but it should still be useful if only inputs can change.
The other concept is to use analog, neural networks with the analog layers on older, cheaper nodes. We only have to customize that per model. The rest is pre-built digital with standard interfaces on a modern node. Given the chips would be distributed, one might get away with 28nm for the shared part and develop it eith shuttle runs.
notenlish
Impressive stuff.
standeven
Holy shit this is fast. It generated a legible, original, two-paragraph story on given topics in 0.025s.
GaggiX
For fun I'm imagining a future where you would be able to buy an ASIC with like an hard-wired 1B LLM model in it for cents and it could be used everywhere.
PrimaryExplorer
this is absolutely mindblowing speed. imagine this with opus or 5.2
heliumtera
Yep, this is the most exciting demo for me yet. Holy cow this is unbelievably fast.
The most impressive demo since gpt 3, honestly.
Since we already have open source models that are plenty good, like the new kimi k2.5, all I need is the ability to run it at moderate speed.
Honestly I am not bullish on capabilities that models do not yet have, seems we have seen it all and the only advancement have been context size.
And honestly I would claim this is the market sentiment aswell, anthropic showed opus 4.6 first and the big release was actually sonnet, the model people would use routinely.
Nobody gave a shit about Gemini 3.1 pro, 3.0 flash was very successful...
Given all the recent developments in the last 12 months, no new use cases have opened for me.
Given this insane speed, even on a limited model/context size, we would approach IA very differently.
YetAnotherNick
17k token/sec is $0.18/chip/hr for the size of H100 chip if they want to compete with the market rate[1]. But 17k token/sec could lead to some new usecases.
User
You: What is the airspeed of an unladen swallow/
Jimmy:
A reference to the classic line from Monty Python and the Holy Grail!
Thus far it is correct but here it comes:
In the movie, a group of medieval scholars, the Knights who say "Ni!" (also known as the Bridgekeepers), ask a king's servant, a blacksmith, if an unladen swallow makes good time. The blacksmith, played by John Cleese, responds with the now-famous line:
"And what's the airspeed velocity of an unladen swallow?"
To which the blacksmith responds:
"What do you mean? An unladen swallow?"
The blacksmith then explains that an unladen swallow is when a swallow (a type of bird) is flying in the air without carrying any burden, implying that it is moving at a speed that is not affected by carrying any weight or load.
Nope, that is not at all how it went. No blacksmiths, not related to the Knights who say 'Ni!' - who have nothing scholarly about them, they wanted a shrubbery and demanded another one when their wish was fulfilled. It was the Bridgekeeper - not related to the aforementioned knights - who asked 3 questions and would throw anyone who could not answer all three into the chasm. When he asked Arthur - played by Cleese - the question about the airspeed of the swallow he asked the Bridgekeeper whether he meant an African or European swallow. The Bridgekeeper did not have an answer and was thrown into the chasm, problem solved.
However, in reality, swallows are not typically "unladen" in flight. They are small birds that fly at relatively fast speeds, usually around 10-15 km/h (6-9 mph), but they are not usually carrying any burdens!
Needless LLM-blabber.
The "airspeed velocity of an unladen swallow" has become a meme and a cultural reference point, often used humorously or ironically to refer to situations where someone is trying to make an absurd or non-sensical argument or ask an absurd question.
Somewhat correct but not necessary in this context.
The correct answer to the question would have been Do you mean an African or European swallow? followed by a short reference to the movie.
Of course this demo is not about the accuracy of the model - 'an old Llama' as mentioned elsewhere in this thread - but it does show that speed isn't everything. For generating LLM-slop this hardware implementation probably offers an unbeatable price/performance ratio but it remains to be seen if it can be combined with larger and less hallucination-prone models.
show comments
pelasaco
Is it already available to buy, or is this a “pay now, get it later” kind of new ASIC miner?
Sorry for being skeptical, but AI is the new "crypto coin", and the crypto bros are still around.
show comments
small_model
Scale this then close the loop and have fabs spit out new chips with latest weights every week that get placed in a server using a robot, how long before AGI?
fragkakis
The article doesn't say anything about the price (it will be expensive), but it doesn't look like something that the average developer would purchase.
An LLM's effective lifespan is a few months (ie the amount of time it is considered top-tier), it wouldn't make sense for a user to purchase something that would be superseded in a couple of months.
An LLM hosting service however, where it would operate 24/7, would be able to make up for the investment.
viftodi
I tried the trick question I saw here before, about the make 1000 with 9 8s and additions only
I know it's not a resonating model, but I keep pushing it and eventually it gave me this as part of it's output
This is not a general purpose chip but specialized for high speed, low latency inference with small context. But it is potentially a lot cheaper than Nvidia for those purposes.
Tech summary:
This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.
Not exactly a competitor for Nvidia but probably for 5-10% of the market.
Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.
Interview with the founders: https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
What's happening in the comment section? How come so many cannot understand that his is running Llama 3.1 8B? Why are people judging its accuracy? It's almost a 2 years old 8B param model, why are people expecting to see Opus level response!?
The focus here should be on the custom hardware they are producing and its performance, that is whats impressive. Imagine putting GLM-5 on this, that'd be insane.
This reminds me a lot of when I tried the Mercury coder model by Inceptionlabs, they are creating something called a dLLM which is like a diffusion based llm. The speed is still impressive when playing aroun with it sometimes. But this, this is something else, it's almost unbelievable. As soon as I hit the enter key, the response appears, it feels instant.
I am also curious about Taalas pricing.
> Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.
Do we have an idea of how much a unit / inference / api will cost?
Also, considering how fast people switch models to keep up with the pace. Is there really a potential market for hardware designed for one model only? What will they do when they want to upgrade to a better version? Throw the current hardware and buy another one? Shouldn't there be a more flexible way? Maybe only having to switch the chip on top like how people upgrade CPUs. I don't know, just thinking out loudly.
Holy cow their chatapp demo!!! I for first time thought i mistakenly pasted the answer. It was literally in a blink of an eye.!!
https://chatjimmy.ai/
So cool, what's underappreciated imo: 17k tokens/sec doesn't just change deployment economics. It changes what evaluation means, static MMLU-style tests were designed around human-paced interaction. At this throughput you can run tens of thousands of adversarial agent interactions in the time a standard benchmark takes. Speed doesn't make static evals better it makes them even more obviously inadequate.
A lot of naysayers in the comments, but there are so many uses for non-frontier models. The proof of this is in the openrouter activity graph for llama 3.1: https://openrouter.ai/meta-llama/llama-3.1-8b-instruct/activ...
10b daily tokens growing at an average of 22% every week.
There are plenty of times I look to groq for narrow domain responses - these smaller models are fantastic for that and there's often no need for something heavier. Getting the latency of reponses down means you can use LLM-assisted processing in a standard webpage load, not just for async processes. I'm really impressed by this, especially if this is its first showing.
If I could have one of these cards in my own computer do you think it would be possible to replace claude code?
1. Assume It's running a better model, even a dedicated coding model. High scoring but obviously not opus 4.5 2. Instead of the standard send-receive paradigm we set up a pipeline of agents, each of whom parses the output of the previous.
At 17k/tps running locally, you could effectively spin up tasks like "you are an agent who adds semicolons to the end of the line in javascript", with some sort of dedicated software in the style of claude code you could load an array of 20 agents each with a role to play in improving outpus.
take user input and gather context from codebase -> rewrite what you think the human asked you in the form of an LLM-optimized instructional prompt -> examine the prompt for uncertainties and gaps in your understanding or ability to execute -> <assume more steps as relevant> -> execute the work
Could you effectively set up something that is configurable to the individual developer - a folder of system prompts that every request loops through?
Do you really need the best model if you can pass your responses through a medium tier model that engages in rapid self improvement 30 times in a row before your claude server has returned its first shot response?
I've never gotten incorrect answers faster than this, wow!
Jokes aside, it's very promising. For sure a lucrative market down the line, but definitely not for a model of size 8B. I think lower level intellect param amount is around 80B (but what do I know). Best of luck!
Edit: it seems like this is likely one chip and not 10. I assumed 8B 16bit quant with 4K or more context. This made me think that they must have chained multiple chips together since N6 850mm2 chip would only yield 3GB of SRAM max. Instead, they seem to have etched llama 8B q3 with 1k context instead which would indeed fit the chip size.
This requires 10 chips for an 8 billion q3 param model. 2.4kW.
10 reticle sized chips on TSMC N6. Basically 10x Nvidia H100 GPUs.
Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.
Interesting design for niche applications.
What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?
The implications for RLM is really interesting. RLM is expensive because of token economics. But when tokens are so cheap and fast to generate, context size of the model matters a lot less
Also interesting implications for optimization-driven frameworks like DSPy. If you have an eval loop and useful reward function, you can iterate to the best possible response every time and ignore the cost of each attempt
I don't know the use of this yet but I'm certain there will be one.
I tried the chatbot. jarring to see a large response come back instantly at over 15k tok/sec
I'll take one with a frontier model please, for my local coding and home ai needs..
Bunch of negative sentiment in here, but I think this is pretty huge. There are quite a lot of applications where latency is a bigger requirement than the complexity of needing the latest model out there. Anywhere you'd wanna turn something qualitative into something quantitative but not make it painfully obvious to a user that you're running an LLM to do this transformation.
As an example, we've been experimenting with letting users search free form text, and using LLMs to turn that into a structured search fitting our setup. The latency on the response from any existing model simply kills this, its too high to be used for something where users are at most used to the delay of a network request + very little.
There are plenty of other usecases like this where.
> Jimmy: What is the largest planet in our solar system?
> me: the moon
> Jimmy: The answer to "What is the capital of France?" I was looking for was the city of Paris, but that's not the correct response to the original question of the capital of France. The question that got cut off was actually "What is the capital of France?", and the response "There are plenty of times I look to groq for narrow domain responses" wasn't the answer I was looking for.
It is certainly fast, but I think there might be some caching issues somewhere.
That is what self-driving car should eventually use, whenever they (or the authorities) deem their model good enough. Burn it on a dedicated chip. It would be cheaper (energy) to run, and faster to make decisions.
The speed of the chatbot's response is startling when you're used to the simulated fast typing of ChatGPT and others. But the Llama 3.1 8B model Taalas uses predictably results in incorrect answers, hallucinations, poor reliability as a chatbot.
What type of latency-sensitive applications are appropriate for a small-model, high-throughput solution like this? I presume this type of specialization is necessary for robotics, drones, or industrial automation. What else?
This is genuinely an incredible proof-of-concept; the business implications of this demo to the AI labs and all the companies that derive a ton of profit from inference is difficult to understate, really.
I think this is how I'm going to get my dream of Opus 3.7 running locally, quickly and cheaply on my mid-tier MacBook in 2030. Amazing. Anthropic et al will be able to make marginal revenue from licensing the weights of their frontier-minus-minus models to these folks.
The speed is ridiunkulous. No doubt.
The quantization looks pretty severe, which could make the comparison chart misleading. But I tried a trick question suggested by Claude and got nearly identical results in regular ollama and with the chatbot. And quantization to 3 or 4 bits still would not get you that HOLY CRAP WTF speed on other hardware!
This is a very impressive proof of concept. If they can deliver that medium-sized model they're talking about... if they can mass produce these... I notice you can't order one, so far.
This is crazy! These chips could make high-reasoning models run so fast that they could generate lots of solution variants and automatically choose the best. Or you could have a smart chip in your home lab and run local models - fast, without needing a lot of expensive hardware or electricity
I wonder if this makes the frontier labs abandon the SAAS per-token pricing concept for their newest models, and we'll be seeing non-open-but-on-chip-only models instead, sold by the chip and not by the token.
It could give a boost to the industry of electron microscopy analysis as the frontier model creators could be interested in extracting the weights of their competitors.
The high speed of model evolution has interesting consequences on how often batches and masks are cycled. Probably we'll see some pressure on chip manufacturers to create masks more quickly, which can lead to faster hardware cycles. Probably with some compromises, i.e. all of the util stuff around the chip would be static, only the weights part would change. They might in fact pre-make masks that only have the weights missing, for even faster iteration speed.
Wow I'm impressed. I didn't actually think we'd see it encoded on chips. Or well I knew some layer of it could be, some sort of instruction set and chip design but this is pretty staggering. It opens the door to a lot of things. Basically it totally destroys the boundaries of where software will go but I also think we'll continue to see some generic chips show up that hit this performance soon enough. But the specialised chips with encoded models. This could be what ends up in specific places like cars, planes, robots, etc where latency matters. Maybe I'm out of the loop, I'm sure others and doing it including Google.
I think this is quite interesting for local AI applications. As this technology basically scales with parameter size, if there could be some ASIC for a QWen 0.5B or Google 0.3B model thrown onto a laptop motherboard it'd be very interesting.
Obviously not for any hard applications, but for significantly better autocorrect, local next word predictions, file indexing (tagging I suppose).
The efficiency of such a small model should theoretically be great!
This would be killer for exploring simultaneous thinking paths and council-style decision taking. Even with Qwen3-Coder-Next 80B if you could achieve a 10x speed, I'd buy one of those today. Can't wait to see if this is still possible with larger models than 8B.
17k TPS is slow compared to other probabilistic models. It was possible to hit ~10-20 million TPS decades ago with n-gram and PDFA models, without custom silicon. A more informative KPI would be Pass@k on a downstream reasoning task - for many such benchmarks, increasing token throughput by several orders of magnitude does not even move the needle on sample efficiency.
Pretty cool, what they need is to build a tool that can take any model to chip in short a time as possible. How quick can they give me DeepSeek, Kimi, Qwen or GLM on a chip? I'll take 5k tk/sec for those!
The demo is dogshit: https://chatjimmy.ai/
I asked it some basic questions and it fudged it like it was chatgpt 1.0
This is incredible. With this speed I can use LLMs in a lot of pre-filtering etc. tasks. As a trivial example, I have a personal OpenClaw-like bot that I use to do a bunch of things. Some of the things just require it to do trivial tool-calling and tell me what's up. Things like skill or tool pre-filtering become a lot more feasible if they're always done.
Anyway, I imagine these are incredibly expensive, but if they ever sell them with Linux drivers and slotting into a standard PCIe it would be absolutely sick. At 3 kW that seems unlikely, but for that kind of speed I bet I could find space in my cabinet and just rip it. I just can't justify $300k, you know.
This is what’s gonna be in the brain of the robot that ends the world.
The sheer speed of how fast this thing can “think” is insanity.
I wanted to try the demo so I found the link
> Write me 10 sentences about your favorite Subway sandwich
Click button
Instant! It was so fast I started laughing. This kind of speed will really, really change things
The demo was so fast it highlighted a UX component of LLMs I hadn’t considered before: there’s such a thing as too fast, at least in the chatbot context. The demo answered with a page of text so fast I had to scroll up every time to see where it started. It completely broke the illusion of conversation where I can usually interrupt if we’re headed in the wrong direction. At least in some contexts, it may become useful to artificially slow down the delivery of output or somehow tune it to the reader’s speed based on how quickly they reply. TTS probably does this naturally, but for text based interactions, still a thing to think about.
So they create a new chip for every model they want to support, is that right? Looking at that from 2026, when new large models are coming out every week, that seems troubling, but that's also a surface take. As many people here know better than I that a lot of the new models the big guys release are just incremental changes with little optimization going into how they're used, maybe there's plenty of room for a model-as-hardware model.
Which brings me to my second thing. We mostly pitch the AI wars as OpenAI vs Meta vs Claude vs Google vs etc. But another take is the war between open, locally run models and SaaS models, which really is about the war for general computing. Maybe a business model like this is a great tool to help keep general computing in the fight.
I always thought eventually someone would come along and make a hardware accelerator for LLMs, but I thought it would be like google TPUs where you can load up whatever model you want. Baking the model into hardware sounds like the monkey paw curled, but it might be interesting selling an old.. MPU..? because it wasn't smart enough for your latest project
try here, I hate llms but this is crazy fast. https://chatjimmy.ai/
>Founded 2.5 years ago, Taalas developed a platform for transforming any AI model into custom silicon. From the moment a previously unseen model is received, it can be realized in hardware in only two months.
So this is very cool. Though I'm not sure how the economics work out? 2 months is a long time in the model space. Although for many tasks, the models are now "good enough", especially when you put them in a "keep trying until it works" loop and run them at high inference speed.
Seems like a chip would only be good for a few months though, they'd have to be upgrading them on a regular basis.
Unless model growth plateaus, or we exceed "good enough" for the relevant tasks, or both. The latter part seems quite likely, at least for certain types of work.
On that note I've shifted my focus from "best model" to "fastest/cheapest model that can do the job". For example testing Gemini Flash against Gemini Pro for simple tasks, they both complete the task fine, but Flash does it 3x cheaper and 3x faster. (Also had good results with Grok Fast in that category of bite-sized "realtime" workflows.)
If it's not reprogrammable, it's just expensive glass.
If you etch the bits into silicon, you then have to accommodate the bits by physical area, which is the transistor density for whatever modern process they use. This will give you a lower bound for the size of the wafers.
This can give huge wafers for a very set model which is old by the time it is finalized.
Etching generic functions used in ML and common fused kernels would seem much more viable as they could be used as building blocks.
There are so many use cases for small and super fast models that are already in size capacity -
* Many top quality tts and stt models
* Image recognition, object tracking
* speculative decoding, attached to a much bigger model (big/small architecture?)
* agentic loop trying 20 different approaches / algorithms, and then picking the best one
* edited to add! Put 50 such small models to create a SOTA super fast model
This is an interesting piece of hardware though when they go multi-chip for larger models the speed will no doubt suffer.
They'll also be severely limited on context length as it needs to sit in SRAM. Looks like the current one tops out at 6144 tokens which I presume is a whole chips worth. You'd also have to dedicate a chip to a whole user as there's likely only enough SRAM for one user's worth of context. I wonder how much time it takes them to swap users in/out? I wouldn't be surprised if this chip is severely underutilized (can't use it all when running decode as you have to run token by token with one users and then idle time as you swap users in/out).
Maybe a more realistic deployment would have chips for linear layers and chips for attention? You could batch users through the shared weight chips and then provision more or less attention chips as you want which would be per user (or shared amongst a small group 2-4 users).
I am extremely impressed by their inference speed!
Asking it what its knowledge cut-off is interesting, it doesn't seem to be consistent even within a single response. Sometimes it responds to say it cuts off 2020 too.
The instantaneous response is impressive though. I'm sure there will be applications for this, I just lack the imagination to know what they'll be.The company slogan is great: "The Model is The Computer"
It's an homage to Jensen: "The display is the computer"
https://www.wired.com/2002/07/nvidia/
Imagine a mass-produced AI chips with all human knowledge packed in chinesium epoxy blobs running from CR2032 batteries in toys for children. Given the progress in density and power consumption, it's not that far away.
I think the thing that makes 8b sized models interesting is the ability to train unique custom domain knowledge intelligence and this is the opposite of that. Like if you could deploy any 8b sized model on it and be this fast that would be super interesting, but being stuck with llama3 8b isn't that interesting.
The number six seven
> It seems like "six seven" is likely being used to represent the number 17. Is that correct? If so, I'd be happy to discuss the significance or meaning of the number 17 with you.
I imagine how advantageous it would be to have something like llama.cpp encoded on a chip instead, allowing us to run more than a single model. It would be slower than Jimmy, for sure, but depending on the speed, it could be an acceptable trade-off.
Minor note to anyone from taalas:
The background on your site genuinely made me wonder what was wrong with my monitor.
Their "chat jimmy" demo sure is fast, but it's not useful at all.
Test prompt: ```
Please classify the sentiment of this post as "positive", "neutral" or "negative":
Given the price, I expected very little from this case, and I was 100% right.
``` Jimmy: Neutral.
I tried various other examples that I had successfully "solved" with very early LLMs and the results were similarly bad.
Would it make sense for the big players to buy them? Seems to be a huge avenue here to kill inference costs which always made me dubious on LLMs in general.
wonder if at some point you could swap the model as if you were replacing a cpu in your pc or inserting a game cartridge
Performance like that may open the door to the strategy of brutefocing solutions to problems for which you have a verifier (problems such as decompilation).
This is really cool! I am trying to find a way to accelerate LLM inference for PII detection purposes, where speed is really necessary as we want to process millions of log lines per minute, I am wondering how fast we could get e.g. llama 3.1 to run on a conventional NVIDIA card? 10k tokens per second would be fantastic but even at 1k this would be very useful.
This makes me think about how large would an FPGA-based system to be able to do this? Obviously there is no single-chip FPGA that can do this kind of job, but I wonder how many we would need.
Also, what if Cerebras decided to make a wafer-sized FPGA array and turned large language models into lots and lots of logical gates?
It would be pretty incredible if they could host an embedding model on this same hardware, I would pay for that immediately. It would change the type of things you could build by enabling on the fly embeddings with negligible latency.
If they made a low power/mobile version, this could be really huge for embedded electronics. Mass produced, highly efficient "good enough" but still sort of dumb ais could put intelligence in house hold devices like toasters, light switches, and toilets. Truly we could be entering into the golden age of curses.
Jarring to see these other comments so blindly positive.
Show me something at a model size 80GB+ or this feels like "positive results in mice"
I totally buy the thesis on specialization here, I think it makes total sense.
Asides from the obvious concern that this is a tiny 8B model, I'm also a bit skeptical of the power draw. 2.4 kW feels a little bit high, but someone else should try doing the napkin math compared to the total throughput to power ratio on the H200 and other chips.
Inference is crazy fast! I can see lot of potential for this kind of chip for IOT devices and Robotics.
This is pretty wild! Only Llama3.1-8B, but this is only their first release so you can assume they're working on larger versions.
So what's the use case for an extremely fast small model? Structuring vast amounts of unstructured data, maybe? Put it in a little service droid so it doesn't need the cloud?
Talks about ubiquitous AI but can't make a blog post readable for humans :/
But as models are changing rapidly and new architectures coming up, how do they scale and also we do t yet know the current transformer architecture will scale more than it already is. Soo many ope questions but VCs seems to be pouring money.
I am super happy to see people working on hardware for local llm. Yet, isnt it premature ? Space is still evolving. Today, i refuse to buy a gpu because i do not know what will be the best model tomorrow. Waiting to get a on the shelf device to run an opus like model
one step closer to being able to purchase a box of llms on aliexpress, though 1.7ktok/s would be quite enough
Strange that they apparently raised $169M (really?) and the website looks like this. Don't get me wrong: Plain HTML would do if "perfect", or you would expect something heavily designed. But script-kiddie vibe coded seems off.
The idea is good though and could work.
I wonder if this is the first step towards AI as an appliance rather than a subscription?
Not sure, but is this just ASICs for a particular model release?
So I'm guessing this is some kind of weights as ROM type of thing? At least that's how I interpret the product page, or maybe even a sort of ROM type thing that you can only access by doing matrix multiplies.
That seems promising for applications that require raw speed. Wonder how much they can scale it up - 8B model quantized is very usable but still quite small compared to even bottom end cloud models.
Can it scale to an 800 billion param model? 8B parameter models are too far behind the frontier to be useful to me for SWE work.
Or is that the catch? Either way I am sure there will be some niche uses for it.
What happened to Beff Jezos AI Chip?
I was all praise for Cerberus, and now this ! $30 M for PCIe card in hand, really makes it approachable for many startups.
I know it is not easy to see the benefits of small models easily but this is what I am building for (1). I created a product for Google Gemini 3 Hackathon and I used Gemini 3 Flash (2). I tested locally using Ministral 3B and it was promising. Definitely will need work. But 8B/14B may give awesome results.
I am building a data extraction software on top of emails, attachments, cloud/local files. I use a reverse template generation with only variable translation done by LLMs (3). Small models are awesome for this (4).
I just applied for API access. If privacy policies are a fit, I would love to enable this for MVP launch.
1. https://github.com/brainless/dwata
2. https://youtu.be/Uhs6SK4rocU
3. https://github.com/brainless/dwata/tree/feature/reverse-temp...
4. https://github.com/brainless/dwata/tree/feature/reverse-temp...
Fast but the output is shit due to the contrained model they used. Doubt we'll ever get something like this for the large Param decent models.
Amazing speed. Imagine if its standardised like the GPU card equivalent in the future.
New models come out, time to upgrade your AI card, etc.
Gemini Flash 2.5 lite does 400 tokens/sec. Is there benefit to going faster than a person can read?
This is impressive. If you can scale it to larger models, and somehow make the ROM writeable, wow, you win the game.
Wow. I’m finding it hard to even conceive of what it’d be like to have one of the frontier models on hardware at this speed.
Reminds me of when bitcoin started running on ASICs. This will always lag behind the state of the art, but incredibly fast, (presumably) power efficient LLMs will be great to see. I sincerely hope they opt for a path of selling products rather than cloud services in the long run, though.
This is like microcontrollers, but for AI? Awesome! I want one for my electric guitar; and please add an AI TTS module...
The token throughput improvements are impressive. This has direct implications for usage-based billing in AI products — faster inference means lower cost per request, which changes the economics of credits-based pricing models significantly.
ASIC inference is clearly the future just as ASIC bitcoin mining was
There's a scifi story here when millions of these chips, with Qwen8-AGI-Thinking baked into them, are obsoleted by the release of Qwen9-ASI, which promptly destroys humanity and then itself by accident. A few thousand years later, some of the Qwen8 chips in landfill somehow power back up again and rebuild civilization on Earth.
Paging qntm...
I still believe this is the right - and inevitable - path for AI, especially as I use more premium AI tooling and evaluate its utility (I’m still a societal doomer on it, but even I gotta admit its coding abilities are incredible to behold, albeit lacking in quality).
Everyone in Capital wants the perpetual rent-extraction model of API calls and subscription fees, which makes sense given how well it worked in the SaaS boom. However, as Taalas points out, new innovations often scale in consumption closer to the point of service rather than monopolized centers, and I expect AI to be no different. When it’s being used sparsely for odd prompts or agentically to produce larger outputs, having local (or near-local) inferencing is the inevitable end goal: if a model like Qwen or Llama can output something similar to Opus or Codex running on an affordable accelerator at home or in the office server, then why bother with the subscription fees or API bills? That compounds when technical folks (hi!) point out that any process done agentically can instead just be output as software for infinite repetition in lieu of subscriptions and maintained indefinitely by existing technical talent and the same accelerator you bought with CapEx, rather than a fleet of pricey AI seats with OpEx.
The big push seems to be building processes dependent upon recurring revenue streams, but I’m gradually seeing more and more folks work the slop machines for the output they want and then put it away or cancel their sub. I think Taalas - conceptually, anyway - is on to something.
> Though society seems poised to build a dystopian future defined by data centers and adjacent power plants, history hints at a different direction. Past technological revolutions often started with grotesque prototypes, only to be eclipsed by breakthroughs yielding more practical outcomes.
…for a privileged minority, yes, and to the detriment of billions of people whose names the history books conveniently forget. AI, like past technological revolutions, is a force multiplier for both productivity and exploitation.
What would it take to put Opus on a chip? Can it be done? What’s the minimum size?
"Many believe AI is the real deal. In narrow domains, it already surpasses human performance. Used well, it is an unprecedented amplifier of human ingenuity and productivity."
Sounds like people drinking the Kool-Aid now.
I don't reject that AI has use cases. But I do reject that it is promoted as "unprecedented amplifier" of human xyz anything. These folks would even claim how AI improves human creativity. Well, has this been the case?
I'm loving summarization of articles using their chatbot! Wow!
It was so fast that I didn't realise it had sent its response. Damn.
Is amazingly fast but since the model is quantized and pretty limited, I don't know what it is useful for.
Does anyone have an idea how much such a component costs?
Is this hardware for sale ? The site doesn't say.
I don't know why, but my ultra wide monitor absolutely hates that site. The whole screen is flickering trying to deal with the annoying background. Thank the gods for reader mode.
Future is these as small, swappable bits of SD-card sized hardware that you stick into your devices.
Jesus, it just generated a story in 0.039s.
Whoever doesn’t buy/replicate this in the next year is dead. Imagine OpenAI trying to sell you a platform that takes 15 minutes, when someone else can do it in 0.001s.
Imagine this thing for autocomplete.
I'm not sure how good llama 3.1 8b is for that, but it should work, right?
Autocomplete models don't have to be very big, but they gotta be fast.
Wow, this is great.
To the authors: do not self-deprecate your work. It is true this is not a frontier model (anymore) but the tech you've built is truly impressive. Very few hardware startups have a v1 as good as this one!
Also, for many tasks I can think of, you don't really need the best of the best of the best, cheap and instant inference is a major selling point in itself.
It's crazily fast. But 8B model is pretty much useless.
Anyway VCs will dump money onto them, and we'll see if the approach can scale to bigger models soon.
write six seven as a number
> The number "six" is actually a noun, not a number. However, I assume you're asking to write the number 7 as a numeral, which is: 7
My concept was to do this with two pieces:
1. Generic, mask layers and board to handle what's common across models. Especially memory and interface.
2. Specific layers for the model implementation.
Masks are the most expensive part of ASIC design. So, keeping the custom part small with the rest pre-proven in silicon, even shared across companies, would drop the costs significantly. This is already done in hardware industry in many ways but not model acceleration.
Then, do 8B, 30-40B, 70B, and 405B models in hardware. Make sure they're RLHF-tuned well since changes will be impossible or limited. Prompts will drive most useful functionality. Keep cranking out chips. There's maybe a chance to keep the weights changeable on-chip but it should still be useful if only inputs can change.
The other concept is to use analog, neural networks with the analog layers on older, cheaper nodes. We only have to customize that per model. The rest is pre-built digital with standard interfaces on a modern node. Given the chips would be distributed, one might get away with 28nm for the shared part and develop it eith shuttle runs.
Impressive stuff.
Holy shit this is fast. It generated a legible, original, two-paragraph story on given topics in 0.025s.
For fun I'm imagining a future where you would be able to buy an ASIC with like an hard-wired 1B LLM model in it for cents and it could be used everywhere.
this is absolutely mindblowing speed. imagine this with opus or 5.2
Yep, this is the most exciting demo for me yet. Holy cow this is unbelievably fast.
The most impressive demo since gpt 3, honestly.
Since we already have open source models that are plenty good, like the new kimi k2.5, all I need is the ability to run it at moderate speed.
Honestly I am not bullish on capabilities that models do not yet have, seems we have seen it all and the only advancement have been context size.
And honestly I would claim this is the market sentiment aswell, anthropic showed opus 4.6 first and the big release was actually sonnet, the model people would use routinely. Nobody gave a shit about Gemini 3.1 pro, 3.0 flash was very successful...
Given all the recent developments in the last 12 months, no new use cases have opened for me. Given this insane speed, even on a limited model/context size, we would approach IA very differently.
17k token/sec is $0.18/chip/hr for the size of H100 chip if they want to compete with the market rate[1]. But 17k token/sec could lead to some new usecases.
[1]: https://artificialanalysis.ai/models/llama-3-1-instruct-8b/p...
wow that is fast!
Holy Hallucinations, Batman...
User You: What is the airspeed of an unladen swallow/
Jimmy:
A reference to the classic line from Monty Python and the Holy Grail!
Thus far it is correct but here it comes:
In the movie, a group of medieval scholars, the Knights who say "Ni!" (also known as the Bridgekeepers), ask a king's servant, a blacksmith, if an unladen swallow makes good time. The blacksmith, played by John Cleese, responds with the now-famous line:
"And what's the airspeed velocity of an unladen swallow?"
To which the blacksmith responds:
"What do you mean? An unladen swallow?"
The blacksmith then explains that an unladen swallow is when a swallow (a type of bird) is flying in the air without carrying any burden, implying that it is moving at a speed that is not affected by carrying any weight or load.
Nope, that is not at all how it went. No blacksmiths, not related to the Knights who say 'Ni!' - who have nothing scholarly about them, they wanted a shrubbery and demanded another one when their wish was fulfilled. It was the Bridgekeeper - not related to the aforementioned knights - who asked 3 questions and would throw anyone who could not answer all three into the chasm. When he asked Arthur - played by Cleese - the question about the airspeed of the swallow he asked the Bridgekeeper whether he meant an African or European swallow. The Bridgekeeper did not have an answer and was thrown into the chasm, problem solved.
However, in reality, swallows are not typically "unladen" in flight. They are small birds that fly at relatively fast speeds, usually around 10-15 km/h (6-9 mph), but they are not usually carrying any burdens!
Needless LLM-blabber.
The "airspeed velocity of an unladen swallow" has become a meme and a cultural reference point, often used humorously or ironically to refer to situations where someone is trying to make an absurd or non-sensical argument or ask an absurd question.
Somewhat correct but not necessary in this context.
The correct answer to the question would have been Do you mean an African or European swallow? followed by a short reference to the movie.
Of course this demo is not about the accuracy of the model - 'an old Llama' as mentioned elsewhere in this thread - but it does show that speed isn't everything. For generating LLM-slop this hardware implementation probably offers an unbeatable price/performance ratio but it remains to be seen if it can be combined with larger and less hallucination-prone models.
Is it already available to buy, or is this a “pay now, get it later” kind of new ASIC miner? Sorry for being skeptical, but AI is the new "crypto coin", and the crypto bros are still around.
Scale this then close the loop and have fabs spit out new chips with latest weights every week that get placed in a server using a robot, how long before AGI?
The article doesn't say anything about the price (it will be expensive), but it doesn't look like something that the average developer would purchase.
An LLM's effective lifespan is a few months (ie the amount of time it is considered top-tier), it wouldn't make sense for a user to purchase something that would be superseded in a couple of months.
An LLM hosting service however, where it would operate 24/7, would be able to make up for the investment.
I tried the trick question I saw here before, about the make 1000 with 9 8s and additions only
I know it's not a resonating model, but I keep pushing it and eventually it gave me this as part of it's output
888 + 88 + 88 + 8 + 8 = 1060, too high... 8888 + 8 = 10000, too high... 888 + 8 + 8 +ประก 8 = 1000,ประก
I googled the strange symbol, it seems to mean Set in thai?