8B coefficients are packed into 53B transistors, 6.5 transistors per coefficient. Two-inputs NAND gate takes 4 transistors and register takes about the same. One coefficient gets processed (multiplied by and result added to a sum) with less than two two-inputs NAND gates.
I think they used block quantization: one can enumerate all possible blocks for all (sorted) permutations of coefficients and for each layer place only these blocks that are needed there. For 3-bit coefficients and block size of 4 coefficients only 330 different blocks are needed.
Matrices in the llama 3.1 are 4096x4096, 16M coefficients. They can be compressed into only 330 blocks, if we assume that all coefficients' permutations are there, and network of correct permutations of inputs and outputs.
Assuming that blocks are the most area consuming part, we have block's transistor budget of about 250 thousands of transistors, or 30 thousands of 2-inputs NAND gates per block.
250K transistors per block * 330 blocks / 16M transistors = about 5 transistors per coefficient.
Looks very, very doable.
It does look doable even for FP4 - these are 3-bit coefficients in disguise.
show comments
kop316
Ohh neat! A generalized version of this was the topic of my PhD dissertation:
And they are likely doing something similar to put their LLMs in silicon. I would believe a 10x electricity boost along with it being much faster.
The idea is that you can create a sea of generalized standard cells and it makes for a gate array at the manufacturing layer. This was also done 20 or so years ago, it was called a "structured ASIC".
I'd be curious to see if they use the LUT design of traditional structured ASICs or figured what what I did: you can use standard cells to do the same thing and use regular tools/PDKs to make it.
show comments
MarcLore
The form factor discussion is fascinating but I think the real unlock is latency. Current cloud inference adds 50-200ms of network overhead before you even start generating tokens. A dedicated ASIC sitting on PCIe could serve first token in microseconds.
For applications like real-time video generation or interactive agents that need sub-100ms response loops, that difference is everything. The cost per inference might be higher than a GPU cluster at scale, but the latency profile opens up use cases that simply aren't possible with current architectures.
Curious whether Taalas has published any latency benchmarks beyond the throughput numbers.
show comments
Hello9999901
This would be a very interesting future. I can imagine Gemma 5 Mini running locally on hardware, or a hard-coded "AI core" like an ALU or media processor that supports particular encoding mechanisms like H.264, AV1, etc.
Other than the obvious costs (but Taalas seems to be bringing back the structured ASIC era so costs shouldn't be that low [1]), I'm curious why this isn't getting much attention from larger companies. Of course, this wouldn't be useful for training models but as the models further improve, I can totally see this inside fully local + ultrafast + ultra efficient processors.
I'm surprised people are surprised. Of course this is possible, and of course this is the future. This has been demonstrated already: why do you think we even have GPUs at all?! Because we did this exact same transition from running in software to largely running in hardware for all 2D and 3D Computer Graphics. And these LLMs are practically the same math, it's all just obvious and inevitable, if you're paying attention to what we have, what we do to have what we have.
show comments
owenpalmer
> Kinda like a CD-ROM/Game cartridge, or a printed book, it only holds one model and cannot be rewritten.
Imagine a slot on your computer where you physically pop out and replace the chip with different models, sort of like a Nintendo DS.
show comments
brainless
If we can print ASIC at low cost, this will change how we work with models.
Models would be available as USB plug-in devices. A dense < 20B model may be the best assistant we need for personal use. It is like graphic cards again.
I hope lots of vendors will take note. Open weight models are abundant now. Even at a few thousand tokens/second, low buying cost and low operating cost, this is massive.
cpldcpu
I wonder how well this works with MoE architectures?
For dense LLMs, like llama-3.1-8B, you profit a lot from having all the weights available close to the actual multiply-accumulate hardware.
With MoE, it is rather like a memory lookup. Instead of a 1:1 pairing of MACs to stored weights, you suddenly are forced to have a large memory block next to a small MAC block. And once this mismatch becomes large enough, there is a huge gain by using a highly optimized memory process for the memory instead of mask ROM.
At that point we are back to a chiplet approach...
show comments
Archit3ch
The next frontier is power efficiency.
So how does this Taalas chip work? Analog compute by putting the weights/multipliers on the cross-bars? Transistors in the sub-threshold region? Something else?
odyssey7
Quick! We have to approve all the nuclear plants for AI now, before efficiency from optimization shows up
umairnadeem123
from someone who runs AI inference pipelines for video production -- the cost per inference is what actually matters to me, not raw speed. right now i'm paying ~$0.003 per image generation and ~7 cents per 10-second animation clip. a full video costs under $2 in compute.
if dedicated ASICs can drop that by 10x while keeping latency reasonable, that changes the economics of the whole content creation space. you could afford to generate way more variations and iterate more, which is where the real quality gains come from. the bottleneck isn't speed, it's cost per creative iteration.
TensorToad
Super low latency inference might be helpful in applications like quant trading.
However, in an era where a frontier model becomes outdated after 6 months, I wonder how useful it can be.
show comments
ramshanker
I can imagine, where this becomes a mainstream PCIe extension card. Like back in days we had separate graphics card, audio card etc. Now AI card. So to upgrade the PC to latest model, we could buy a new card, load up the drivers and boom, intelligence upgrade of the PC. This would be so cool.
show comments
kioku
I’m just wondering how this translates to computer manufacturers like Apple. Could we have these kinds of chips built directly into computers within three years? With insanely fast, local on-demand performance comparable to today’s models?
show comments
briansm
I wonder if you could use the same technique (RAM models as ROM) for something like Whisper Speech-to-text, where the models are much smaller (around a Gigabyte) for a super-efficient single-chip speech recognition solution with tons of context knowledge.
show comments
snowhale
the LoRA on-chip SRAM angle is interesting but also where this gets hard. the whole pitch is that weights are physical transistors, but LoRA works by adding a low-rank update to those weights at inference time. so you're either doing it purely in SRAM (limited by how much you can fit) or you have to tape out a new chip for each fine-tune. neither is great. might end up being fast but inflexible -- good for commodity tasks, not for anything that needs customization per customer.
peteforde
I would appreciate some clarification on the "store 4 bits of data with one transistor" part.
This doesn't sound remotely possible, but I am here to be convinced.
show comments
rustybolt
Note that this doesn't answer the question in the title, it merely asks it.
show comments
qoez
> It took them two months, to develop chip for Llama 3.1 8B. In the AI world where one week is a year, it's super slow. But in a world of custom chips, this is supposed to be insanely fast.
LLama 3.1 is like 2 years at this point. Taking two months to convert a model that only updates every 2 years is very fast
show comments
atentaten
Does this mean computer boards will someday have one or more slots for an AI chip? Or peripheral devices containing AI models, which can be plugged into computer's high speed port?
show comments
wangzhongwang
The 6.5 transistors per coefficient ratio is fascinating. At 3-bit quantization you're already losing a lot of model quality, so the real question is whether the latency gains from running directly on silicon make up for the accuracy loss.
For inference-heavy edge deployments (think always-on voice assistants or real-time video processing), this could be huge even with degraded accuracy. You don't need GPT-4 quality for most embedded use cases. But for anything that needs to be updated or fine-tuned, you're stuck with a new chip fab cycle, which kind of defeats the purpose of using neural nets in the first place.
abrichr
ChatGPT Deep Research dug through Taalas' WIPO patent filings and public reporting to piece together a hypothesis. Next Platform notes at least 14 patents filed [1]. The two most relevant:
"Large Parameter Set Computation Accelerator Using Memory with Parameter Encoding" [2]
"Mask Programmable ROM Using Shared Connections" [3]
The "single transistor multiply" could be multiplication by routing, not arithmetic. Patent [2] describes an accelerator where, if weights are 4-bit (16 possible values), you pre-compute all 16 products (input x each possible value) with a shared multiplier bank, then use a hardwired mesh to route the correct result to each weight's location. The abstract says it directly: multiplier circuits produce a set of outputs, readable cells store addresses associated with parameter values, and a selection circuit picks the right output. The per-weight "readable cell" would then just be an access transistor that passes through the right pre-computed product. If that reading is correct, it's consistent with the CEO telling EE Times compute is "fully digital" [4], and explains why 4-bit matters so much: 16 multipliers to broadcast is tractable, 256 (8-bit) is not.
The same patent reportedly describes the connectivity mesh as configurable via top metal masks, referred to as "saving the model in the mask ROM of the system." If so, the base die is identical across models, with only top metal layers changing to encode weights-as-connectivity and dataflow schedule.
Patent [3] covers high-density multibit mask ROM using shared drain and gate connections with mask-programmable vias, possibly how they hit the density for 8B parameters on one 815mm2 die.
If roughly right, some testable predictions: performance very sensitive to quantization bitwidth; near-zero external memory bandwidth dependence; fine-tuning limited to what fits in the SRAM sidecar.
Caveat: the specific implementation details beyond the abstracts are based on Deep Research's analysis of the full patent texts, not my own reading, so could be off. But the abstracts and public descriptions line up well.
If the chip is designed as the article says, they should be able to do 1 token per clock cycle...
And whilst I'm sure the propagation time is long through all that logic, it should still be able to do tens of millions of tokens per second...
show comments
punnerud
Could we all get bigger FPGAs and load the model onto it using the same technique?
show comments
coppsilgold
How feasible would it be to integrate a neural video codec into the SoC/GPU silicon?
There would be model size constraints and what quality they can achieve under those constraints.
Would be interesting if it didn't make sense to develop traditional video codecs anymore.
The current video<->latents networks (part of the generative AI model for video) don't optimize just for compression. And you probably wouldn't want variable size input in an actual video codec anyway.
rustyhancock
Edit: reading the below it looks like I'm quite wrong here but I've left the comment...
The single transistor multiply is intriguing.
Id assume they are layers of FMA operating in the log domain.
But everything tells me that would be too noisy and error prone to work.
On the other hand my mind is completely biased to the digital world.
If they stay in the log domain and use a resistor network for multiplication, and the transistor is just exponentiating for the addition that seems genuinely ingenious.
Mulling it over, actually the noise probably doesn't matter. It'll average to 0.
It's essentially compute and memory baked together.
I don't know much about the area of research so can't tell if it's innovative but it does seem compelling!
show comments
albert_e
Does this offer truly "deterministic" responses when temperature is set to zero?
(Of course excluding any cosmic rays / bit flips)?
I didnt see a editable temperature parameter on their chatjimmy demosite -- only a topK.
kinduff
Very nice read, thank you for sharing this so well written.
m101
So if we assume this is the future, the useful life of many semiconductors will fall substantially. What part of the semiconductor supply chain would have pricing power in a world of producing many more different designs?
Perhaps mask manufacturers?
show comments
jabedude
Just me or does this seems incredibly frightening to anyone else? Imagine printing a misaligned LLM this way and never being able to update the HW to run a different (aligned) model
show comments
708145_
Is Taalas' approach scalable to larger models?
show comments
konaraddi
Imagine a Framework* laptop with these kinds of chips that could be swapped out as models get better over time
*Framework sells laptops and parts such that in theory users can own a ~~ship~~ laptop of Theseus over time without having to buy a whole new laptop when something breaks or needs upgrade.
trebligdivad
Hmm I guess you'll get this pile of used boards which hmm is not a great source of waste; but I guess they will get reused for a few generations.
A problem is it doesn't seem to be just the chips that would be thrown but the whole board which gets silly.
midnitewarrior
If model makers adopt an LTS model with an extended EOL for certain model versions, these chips would make that very affordable.
dev1ycan
Thank god, I hope this reduces prices of RAM and GPUs
throwaway85825
Few customers value tokens anywhere near what it costs the big API vendors. When the bubble pops the only survivors will be whoever can offer tokens at as close to zero cost as possible. Also whoever is selling hardware for local AI.
show comments
lm28469
Who's going to pay for custom chips when they shit out new models every two weeks and their deluded CEOs keep promising AGI in two release cycles?
show comments
moralestapia
>HOW NVIDIA GPUs process stuff? (Inefficiency 101)
Wow. Massively ignorant take. A modern GPUs is an amazing feat of engineering, particularly about making computation more efficient (low power/high throughput).
Then proceeds to explain, wrongly, how inference is supposssedly implemented and draws conclusions from there ...
show comments
villgax
This read itself is slop lol, literally dances around the term printing as if its some inkjet printer
sargun
Isn’t the highly connected nature of the model layers problematic to build into physical layer?
8B coefficients are packed into 53B transistors, 6.5 transistors per coefficient. Two-inputs NAND gate takes 4 transistors and register takes about the same. One coefficient gets processed (multiplied by and result added to a sum) with less than two two-inputs NAND gates.
I think they used block quantization: one can enumerate all possible blocks for all (sorted) permutations of coefficients and for each layer place only these blocks that are needed there. For 3-bit coefficients and block size of 4 coefficients only 330 different blocks are needed.
Matrices in the llama 3.1 are 4096x4096, 16M coefficients. They can be compressed into only 330 blocks, if we assume that all coefficients' permutations are there, and network of correct permutations of inputs and outputs.
Assuming that blocks are the most area consuming part, we have block's transistor budget of about 250 thousands of transistors, or 30 thousands of 2-inputs NAND gates per block.
250K transistors per block * 330 blocks / 16M transistors = about 5 transistors per coefficient.
Looks very, very doable.
It does look doable even for FP4 - these are 3-bit coefficients in disguise.
Ohh neat! A generalized version of this was the topic of my PhD dissertation:
https://kilthub.cmu.edu/articles/thesis/Modern_Gate_Array_De...
And they are likely doing something similar to put their LLMs in silicon. I would believe a 10x electricity boost along with it being much faster.
The idea is that you can create a sea of generalized standard cells and it makes for a gate array at the manufacturing layer. This was also done 20 or so years ago, it was called a "structured ASIC".
I'd be curious to see if they use the LUT design of traditional structured ASICs or figured what what I did: you can use standard cells to do the same thing and use regular tools/PDKs to make it.
The form factor discussion is fascinating but I think the real unlock is latency. Current cloud inference adds 50-200ms of network overhead before you even start generating tokens. A dedicated ASIC sitting on PCIe could serve first token in microseconds.
For applications like real-time video generation or interactive agents that need sub-100ms response loops, that difference is everything. The cost per inference might be higher than a GPU cluster at scale, but the latency profile opens up use cases that simply aren't possible with current architectures.
Curious whether Taalas has published any latency benchmarks beyond the throughput numbers.
This would be a very interesting future. I can imagine Gemma 5 Mini running locally on hardware, or a hard-coded "AI core" like an ALU or media processor that supports particular encoding mechanisms like H.264, AV1, etc.
Other than the obvious costs (but Taalas seems to be bringing back the structured ASIC era so costs shouldn't be that low [1]), I'm curious why this isn't getting much attention from larger companies. Of course, this wouldn't be useful for training models but as the models further improve, I can totally see this inside fully local + ultrafast + ultra efficient processors.
[1] https://en.wikipedia.org/wiki/Structured_ASIC_platform
I'm surprised people are surprised. Of course this is possible, and of course this is the future. This has been demonstrated already: why do you think we even have GPUs at all?! Because we did this exact same transition from running in software to largely running in hardware for all 2D and 3D Computer Graphics. And these LLMs are practically the same math, it's all just obvious and inevitable, if you're paying attention to what we have, what we do to have what we have.
> Kinda like a CD-ROM/Game cartridge, or a printed book, it only holds one model and cannot be rewritten.
Imagine a slot on your computer where you physically pop out and replace the chip with different models, sort of like a Nintendo DS.
If we can print ASIC at low cost, this will change how we work with models.
Models would be available as USB plug-in devices. A dense < 20B model may be the best assistant we need for personal use. It is like graphic cards again.
I hope lots of vendors will take note. Open weight models are abundant now. Even at a few thousand tokens/second, low buying cost and low operating cost, this is massive.
I wonder how well this works with MoE architectures?
For dense LLMs, like llama-3.1-8B, you profit a lot from having all the weights available close to the actual multiply-accumulate hardware.
With MoE, it is rather like a memory lookup. Instead of a 1:1 pairing of MACs to stored weights, you suddenly are forced to have a large memory block next to a small MAC block. And once this mismatch becomes large enough, there is a huge gain by using a highly optimized memory process for the memory instead of mask ROM.
At that point we are back to a chiplet approach...
The next frontier is power efficiency.
So how does this Taalas chip work? Analog compute by putting the weights/multipliers on the cross-bars? Transistors in the sub-threshold region? Something else?
Quick! We have to approve all the nuclear plants for AI now, before efficiency from optimization shows up
from someone who runs AI inference pipelines for video production -- the cost per inference is what actually matters to me, not raw speed. right now i'm paying ~$0.003 per image generation and ~7 cents per 10-second animation clip. a full video costs under $2 in compute.
if dedicated ASICs can drop that by 10x while keeping latency reasonable, that changes the economics of the whole content creation space. you could afford to generate way more variations and iterate more, which is where the real quality gains come from. the bottleneck isn't speed, it's cost per creative iteration.
Super low latency inference might be helpful in applications like quant trading. However, in an era where a frontier model becomes outdated after 6 months, I wonder how useful it can be.
I can imagine, where this becomes a mainstream PCIe extension card. Like back in days we had separate graphics card, audio card etc. Now AI card. So to upgrade the PC to latest model, we could buy a new card, load up the drivers and boom, intelligence upgrade of the PC. This would be so cool.
I’m just wondering how this translates to computer manufacturers like Apple. Could we have these kinds of chips built directly into computers within three years? With insanely fast, local on-demand performance comparable to today’s models?
I wonder if you could use the same technique (RAM models as ROM) for something like Whisper Speech-to-text, where the models are much smaller (around a Gigabyte) for a super-efficient single-chip speech recognition solution with tons of context knowledge.
the LoRA on-chip SRAM angle is interesting but also where this gets hard. the whole pitch is that weights are physical transistors, but LoRA works by adding a low-rank update to those weights at inference time. so you're either doing it purely in SRAM (limited by how much you can fit) or you have to tape out a new chip for each fine-tune. neither is great. might end up being fast but inflexible -- good for commodity tasks, not for anything that needs customization per customer.
I would appreciate some clarification on the "store 4 bits of data with one transistor" part.
This doesn't sound remotely possible, but I am here to be convinced.
Note that this doesn't answer the question in the title, it merely asks it.
> It took them two months, to develop chip for Llama 3.1 8B. In the AI world where one week is a year, it's super slow. But in a world of custom chips, this is supposed to be insanely fast.
LLama 3.1 is like 2 years at this point. Taking two months to convert a model that only updates every 2 years is very fast
Does this mean computer boards will someday have one or more slots for an AI chip? Or peripheral devices containing AI models, which can be plugged into computer's high speed port?
The 6.5 transistors per coefficient ratio is fascinating. At 3-bit quantization you're already losing a lot of model quality, so the real question is whether the latency gains from running directly on silicon make up for the accuracy loss.
For inference-heavy edge deployments (think always-on voice assistants or real-time video processing), this could be huge even with degraded accuracy. You don't need GPT-4 quality for most embedded use cases. But for anything that needs to be updated or fine-tuned, you're stuck with a new chip fab cycle, which kind of defeats the purpose of using neural nets in the first place.
ChatGPT Deep Research dug through Taalas' WIPO patent filings and public reporting to piece together a hypothesis. Next Platform notes at least 14 patents filed [1]. The two most relevant:
"Large Parameter Set Computation Accelerator Using Memory with Parameter Encoding" [2]
"Mask Programmable ROM Using Shared Connections" [3]
The "single transistor multiply" could be multiplication by routing, not arithmetic. Patent [2] describes an accelerator where, if weights are 4-bit (16 possible values), you pre-compute all 16 products (input x each possible value) with a shared multiplier bank, then use a hardwired mesh to route the correct result to each weight's location. The abstract says it directly: multiplier circuits produce a set of outputs, readable cells store addresses associated with parameter values, and a selection circuit picks the right output. The per-weight "readable cell" would then just be an access transistor that passes through the right pre-computed product. If that reading is correct, it's consistent with the CEO telling EE Times compute is "fully digital" [4], and explains why 4-bit matters so much: 16 multipliers to broadcast is tractable, 256 (8-bit) is not.
The same patent reportedly describes the connectivity mesh as configurable via top metal masks, referred to as "saving the model in the mask ROM of the system." If so, the base die is identical across models, with only top metal layers changing to encode weights-as-connectivity and dataflow schedule.
Patent [3] covers high-density multibit mask ROM using shared drain and gate connections with mask-programmable vias, possibly how they hit the density for 8B parameters on one 815mm2 die.
If roughly right, some testable predictions: performance very sensitive to quantization bitwidth; near-zero external memory bandwidth dependence; fine-tuning limited to what fits in the SRAM sidecar.
Caveat: the specific implementation details beyond the abstracts are based on Deep Research's analysis of the full patent texts, not my own reading, so could be off. But the abstracts and public descriptions line up well.
[1] https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
[2] https://patents.google.com/patent/WO2025147771A1/en
[3] https://patents.google.com/patent/WO2025217724A1/en
[4] https://www.eetimes.com/taalas-specializes-to-extremes-for-e...
So why only 30,000 tokens per second?
If the chip is designed as the article says, they should be able to do 1 token per clock cycle...
And whilst I'm sure the propagation time is long through all that logic, it should still be able to do tens of millions of tokens per second...
Could we all get bigger FPGAs and load the model onto it using the same technique?
How feasible would it be to integrate a neural video codec into the SoC/GPU silicon?
There would be model size constraints and what quality they can achieve under those constraints.
Would be interesting if it didn't make sense to develop traditional video codecs anymore.
The current video<->latents networks (part of the generative AI model for video) don't optimize just for compression. And you probably wouldn't want variable size input in an actual video codec anyway.
Edit: reading the below it looks like I'm quite wrong here but I've left the comment...
The single transistor multiply is intriguing.
Id assume they are layers of FMA operating in the log domain.
But everything tells me that would be too noisy and error prone to work.
On the other hand my mind is completely biased to the digital world.
If they stay in the log domain and use a resistor network for multiplication, and the transistor is just exponentiating for the addition that seems genuinely ingenious.
Mulling it over, actually the noise probably doesn't matter. It'll average to 0.
It's essentially compute and memory baked together.
I don't know much about the area of research so can't tell if it's innovative but it does seem compelling!
Does this offer truly "deterministic" responses when temperature is set to zero?
(Of course excluding any cosmic rays / bit flips)?
I didnt see a editable temperature parameter on their chatjimmy demosite -- only a topK.
Very nice read, thank you for sharing this so well written.
So if we assume this is the future, the useful life of many semiconductors will fall substantially. What part of the semiconductor supply chain would have pricing power in a world of producing many more different designs?
Perhaps mask manufacturers?
Just me or does this seems incredibly frightening to anyone else? Imagine printing a misaligned LLM this way and never being able to update the HW to run a different (aligned) model
Is Taalas' approach scalable to larger models?
Imagine a Framework* laptop with these kinds of chips that could be swapped out as models get better over time
*Framework sells laptops and parts such that in theory users can own a ~~ship~~ laptop of Theseus over time without having to buy a whole new laptop when something breaks or needs upgrade.
Hmm I guess you'll get this pile of used boards which hmm is not a great source of waste; but I guess they will get reused for a few generations. A problem is it doesn't seem to be just the chips that would be thrown but the whole board which gets silly.
If model makers adopt an LTS model with an extended EOL for certain model versions, these chips would make that very affordable.
Thank god, I hope this reduces prices of RAM and GPUs
Few customers value tokens anywhere near what it costs the big API vendors. When the bubble pops the only survivors will be whoever can offer tokens at as close to zero cost as possible. Also whoever is selling hardware for local AI.
Who's going to pay for custom chips when they shit out new models every two weeks and their deluded CEOs keep promising AGI in two release cycles?
>HOW NVIDIA GPUs process stuff? (Inefficiency 101)
Wow. Massively ignorant take. A modern GPUs is an amazing feat of engineering, particularly about making computation more efficient (low power/high throughput).
Then proceeds to explain, wrongly, how inference is supposssedly implemented and draws conclusions from there ...
This read itself is slop lol, literally dances around the term printing as if its some inkjet printer
Isn’t the highly connected nature of the model layers problematic to build into physical layer?