Alibaba Cloud claims to reduce Nvidia GPU used for serving unpopular models by 82% (emphasis mine)
> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found
Instead of 1192 GPUs they now use 213 for serving those requests.
show comments
djoldman
Key paragraph:
> However, a small handful of models such as Alibaba’s Qwen and DeepSeek are most popular for inference, with most other models only sporadically called upon. This leads to resource inefficiency, with 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found.
The US attempt to slow down China's technological development succeeds on the basis of preventing China from directly following the same path, but may backfire in the sense it forces innovation by China in a different direction. The overall outcome for us all may be increase efficiency as a result of this forced innovation, especially if Chinese companies continue to open source their advances, so we may in the end have reason to thank the US for their civilisational gate keeping
show comments
braza
Does someone know if there's some equivalent of those engineering/research blogs for Chinese companies?
I used to follow the ones from Western companies, but honestly, after some point in time, I would like to see some cases from what I consider is a good benchmark for everyone that does not work in FAANG in terms of engineering.
show comments
shoeb00m
Would this make cloud providers running low volume fine-tuned models more economically viable?
checker659
They are working with tiny models. Not sure how well it'd scale to bigger models (if at all).
show comments
ibejoeb
Sounds like this virtual GPU is a separate scheduler. I wonder what kind of latency is introduced by marshaling all that data around.
mighmi
To what extent is this practice applicable to other loads?
show comments
throwaway48476
Its easy enough for a a well resourced entity to take a pre trained model and deploy it on new hardware to save on the NVDA tax. It's far less likely for research and model training to happen outside the mature NVDA ecosystem.
catigula
Sounds like they stopped doing something stupid.
lnxg33k1
Lots of shareholders here, move along, there is nothing to read
t0lo
Is this another nail in the gpu/ai stock market bubble coffin?
wslh
How feasible is that in an horizon of 5 years new optimized "equations" will cut the need for more GPUs?
Alibaba Cloud claims to reduce Nvidia GPU used for serving unpopular models by 82% (emphasis mine)
> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found
Instead of 1192 GPUs they now use 213 for serving those requests.
Key paragraph:
> However, a small handful of models such as Alibaba’s Qwen and DeepSeek are most popular for inference, with most other models only sporadically called upon. This leads to resource inefficiency, with 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found.
better link https://www.tomshardware.com/tech-industry/semiconductors/al...
paper https://dl.acm.org/doi/10.1145/3731569.3764815
The US attempt to slow down China's technological development succeeds on the basis of preventing China from directly following the same path, but may backfire in the sense it forces innovation by China in a different direction. The overall outcome for us all may be increase efficiency as a result of this forced innovation, especially if Chinese companies continue to open source their advances, so we may in the end have reason to thank the US for their civilisational gate keeping
Does someone know if there's some equivalent of those engineering/research blogs for Chinese companies?
I used to follow the ones from Western companies, but honestly, after some point in time, I would like to see some cases from what I consider is a good benchmark for everyone that does not work in FAANG in terms of engineering.
Would this make cloud providers running low volume fine-tuned models more economically viable?
They are working with tiny models. Not sure how well it'd scale to bigger models (if at all).
Sounds like this virtual GPU is a separate scheduler. I wonder what kind of latency is introduced by marshaling all that data around.
To what extent is this practice applicable to other loads?
Its easy enough for a a well resourced entity to take a pre trained model and deploy it on new hardware to save on the NVDA tax. It's far less likely for research and model training to happen outside the mature NVDA ecosystem.
Sounds like they stopped doing something stupid.
Lots of shareholders here, move along, there is nothing to read
Is this another nail in the gpu/ai stock market bubble coffin?
How feasible is that in an horizon of 5 years new optimized "equations" will cut the need for more GPUs?