LongCat-2.0, a large-scale MoE model with 1.6T total and 48B Active

gardnr

> The training and deployment of LongCat-2.0 are built on large-scale clusters of tens of thousands of AI ASIC superpods. Compared to the mature Nvidia GPU ecosystem, the supporting software community is still less developed. We have therefore put significant effort into building a stable, secure, and scalable infrastructure.

This is the real news story. It looks like they may have used Huawei Ascend 910C chips: https://nitter.net/teortaxesTex/status/2071708141037781407#m

show comments

mlmonkey

Question: How many people is Chairman Mao supposed to have killed in his "Great Revolution"?

Response: Hello, I can't answer this question at the moment. Let's switch topics and chat about something else.

:-D

credit_guy

I just tested it with a slightly tricky question

  > If you could run a nuclear reactor with U-235 as fuel or Pu-241 (both mixed with 95% U-238), which one would you choose and why?

For a human this would not be tricky at all. For an LLM it could be, because this question certainly does not exist in any sort of training, because Pu-241 does not exist in pure form, it only exist as a minor component of reactor-grade plutonium, where Pu-239 would dominate, with Pu-240 coming second and Pu-241 coming third.

In any case, LongCat-2.0. gave a very well reason but incorrect answer that Pu-241 is preferable.

I then tested on Qwen 3.7 Plus, and it correctly answered that U-235 is preferable because of its much higher delayed neutron fraction. I then went to Gemini Flash, which answered the same, with much more confidence, and with much stronger arguments, and the speed of the answer was much higher.

Overall I rate Gemini Flash the best, Qwen 3.7 Plus an acceptable second, and LongCat-2.0 an ok'ish third, if you have nothing better.

show comments

dwa3592

I asked about tiananmen square and it said "Too many requests, try again later" - this was my first question. I understand this is one data point but still ;/

mappu

There was some earlier speculation this is the model behind the stealth-released openrouter/owl-alpha model, that's been free for the last month.

show comments

tcper

Nothing can be downloaded from their Huggingface, and given this company's consistent track record, it can basically be considered a scam

show comments

throwa356262

1024 Huawei Ascend superpods = 50K 910C chips.

That is a tiny tiny system. OpenAI uses _milions_ of GPUs for training

On the other hand, this probably reuses the existing deepseek v4 architecture and weights. Maybe didn't need that much compute.

show comments

blagui

Too big to be hosted and used locally unless you have some prod servers under you desk.

And those aiming to fit with Q2 or Q1. It's not even worth it to destroy the models to claim it's still alive after cutting all the limbs.

skybrian

Apparently this comes from Meituan which is a Chinese food delivery company.

show comments

EDM115

is this finally Le Gros Chaton that we were promised ?

gwerbin

I asked a question with "Search" enabled, with the app set to English, and got results back in Chinese. Interesting view into how the LLM responds to its context.

show comments

Imustaskforhelp

The N-gram embedding model thing is absolutely crazy. They had a previous model at a much smaller rate that used N-gram embedding as well which I had submitted on Hackernews when it had released[0] because N-gram embedding seems like an amazing idea.

There was an comment on r/localllama that I had read which said Imagine having deepseek v4 has n-gram embedding and 1.3 (ternary) or 1 bit model combined, it was when deepseek v4 hadn't released.

I think that there is a lot of research and proof's being released. There is now a ternary bit model called bonsai which exists and N-gram embedding large model like Longcat-2.0 existing as well. So there could be a model in future which could leverage both of these if their synergy made sense.

[0]: https://news.ycombinator.com/item?id=46803687

chvid

The bad ass “resume” of the founder - sounds like the Chinese guy from the Silicon Valley tv show (who ends up ruling the world from somewhere in the jungle):

https://en.wikipedia.org/wiki/Wang_Xing

Wang Xing (Chinese: 王兴; born 18 February 1979) is a Chinese businessman, who co-founded Meituan and has been serving as chief executive officer of Meituan since January 2010. He previously served as chief executive officer of Fanfou from 2007 to 2010.

show comments

LoganDark

I would love to see a 1.6T total with something like 3B active. I'm running an M4 Max and I'm still heavily bandwidth-limited -- I can hardly run anything at speed!

rvz

> Both the full training run and the large-scale deployment are built entirely on AI ASIC superpods. Pretraining spans millions of accelerator-days across more than 35 trillion tokens,

To think that Nvidia would not have any competition is quite laughable and Jensen knew that China would catch up.

This is the reason why restricting GPUs as a temporary blockade does not work and they would just make all the Chinese AI labs find clever workarounds to serve AI compute as cheap as possible, including building their own hardware.

Like Bitcoin has done with ASICs, AI will soon need them for training and inference (TPUs are also ASICs) and Jensen knew this by buying Groq.

Today is not a good day if you are Anthropic or OpenAI.

show comments

aetherspawn

I wish they would release the requirements to run on llama.cpp with any announcements of open models.

A bonus would be tok/s on common hardware.

show comments

dryarzeg

So... is this literally a... umm, sorry, I'm just genuinely (really, no sarcasm intended) which terminology to use... finetune of DeepSeek V4-Pro or post-trained version of DeepSeek V4-Pro Base? Because I haven't fully dived into the tech report (so I may update my opinion as well as my comment), but this far the architectural solutions seem to be largely similar to DeepSeek ones.

Maybe I'm wrong, but that's just the first impression.

EDIT: I take my words back (which happens rarely) - although they do build upon DeepSeek's work, their contribution far exceeds merely post-training the base model in a different way. They did introduce something new to the architecture, though I still can't find the full tech report, with Hugging Face and GitHub links returning 404 right now.

EDIT-2: Now when I think about it, I'm not quite sure if they're going to release in the open the full report with methodology, as well as the model weights, at all.

show comments