Alibaba Cloud claims to reduce Nvidia GPU used for serving unpopular models by 82% (emphasis mine)
> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found
Instead of 1192 GPUs they now use 213 for serving those requests.
show comments
djoldman
Key paragraph:
> However, a small handful of models such as Alibaba’s Qwen and DeepSeek are most popular for inference, with most other models only sporadically called upon. This leads to resource inefficiency, with 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found.
The US attempt to slow down China's technological development succeeds on the basis of preventing China from directly following the same path, but may backfire in the sense it forces innovation by China in a different direction. The overall outcome for us all may be increase efficiency as a result of this forced innovation, especially if Chinese companies continue to open source their advances, so we may in the end have reason to thank the US for their civilisational gate keeping
show comments
braza
Does someone know if there's some equivalent of those engineering/research blogs for Chinese companies?
I used to follow the ones from Western companies, but honestly, after some point in time, I would like to see some cases from what I consider is a good benchmark for everyone that does not work in FAANG in terms of engineering.
show comments
ddelnano
Does anyone know how their KV cache sync mechanism compares to newer P2P communication layers like nixl, uccl p2p, etc.?
The authors mention that NCCL and Ray initialization were too slow (see quote below), but from the description it sounds like they’ve reimplemented a layer that’s increasingly being standardized by frameworks like nixl and uccl.
> Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.
checker659
They are working with tiny models. Not sure how well it'd scale to bigger models (if at all).
show comments
jeffybefffy519
I still think nVidia has the most to loose in the AI race, optimisations like this will continue coupled with better ASIC's.
ibejoeb
Sounds like this virtual GPU is a separate scheduler. I wonder what kind of latency is introduced by marshaling all that data around.
catigula
Sounds like they stopped doing something stupid.
shoeb00m
Would this make cloud providers running low volume fine-tuned models more economically viable?
lnxg33k1
Lots of shareholders here, move along, there is nothing to read
throwaway48476
Its easy enough for a a well resourced entity to take a pre trained model and deploy it on new hardware to save on the NVDA tax. It's far less likely for research and model training to happen outside the mature NVDA ecosystem.
mighmi
To what extent is this practice applicable to other loads?
show comments
wslh
How feasible is that in an horizon of 5 years new optimized "equations" will cut the need for more GPUs?
show comments
nickysielicki
> Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.
I mean, it really shouldn't take tens of seconds for those initialization(s) to occur. There's no good fundamental reason that it should take that long. It's just bloat.
t0lo
Is this another nail in the gpu/ai stock market bubble coffin?
Alibaba Cloud claims to reduce Nvidia GPU used for serving unpopular models by 82% (emphasis mine)
> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found
Instead of 1192 GPUs they now use 213 for serving those requests.
Key paragraph:
> However, a small handful of models such as Alibaba’s Qwen and DeepSeek are most popular for inference, with most other models only sporadically called upon. This leads to resource inefficiency, with 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found.
better link https://www.tomshardware.com/tech-industry/semiconductors/al...
paper https://dl.acm.org/doi/10.1145/3731569.3764815
The US attempt to slow down China's technological development succeeds on the basis of preventing China from directly following the same path, but may backfire in the sense it forces innovation by China in a different direction. The overall outcome for us all may be increase efficiency as a result of this forced innovation, especially if Chinese companies continue to open source their advances, so we may in the end have reason to thank the US for their civilisational gate keeping
Does someone know if there's some equivalent of those engineering/research blogs for Chinese companies?
I used to follow the ones from Western companies, but honestly, after some point in time, I would like to see some cases from what I consider is a good benchmark for everyone that does not work in FAANG in terms of engineering.
Does anyone know how their KV cache sync mechanism compares to newer P2P communication layers like nixl, uccl p2p, etc.?
The authors mention that NCCL and Ray initialization were too slow (see quote below), but from the description it sounds like they’ve reimplemented a layer that’s increasingly being standardized by frameworks like nixl and uccl.
> Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.
They are working with tiny models. Not sure how well it'd scale to bigger models (if at all).
I still think nVidia has the most to loose in the AI race, optimisations like this will continue coupled with better ASIC's.
Sounds like this virtual GPU is a separate scheduler. I wonder what kind of latency is introduced by marshaling all that data around.
Sounds like they stopped doing something stupid.
Would this make cloud providers running low volume fine-tuned models more economically viable?
Lots of shareholders here, move along, there is nothing to read
Its easy enough for a a well resourced entity to take a pre trained model and deploy it on new hardware to save on the NVDA tax. It's far less likely for research and model training to happen outside the mature NVDA ecosystem.
To what extent is this practice applicable to other loads?
How feasible is that in an horizon of 5 years new optimized "equations" will cut the need for more GPUs?
> Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.
I mean, it really shouldn't take tens of seconds for those initialization(s) to occur. There's no good fundamental reason that it should take that long. It's just bloat.
Is this another nail in the gpu/ai stock market bubble coffin?
[flagged]