Nowadays you get TTS, STT, text & image generation and image editing should also be possible. Besides being able to run via rocm, vulkan or on CPU, GPU and NPU. Quite a lot of options. They have a quite good and pragmatic pace in development. Really recommend this for AMD hardware!
Edit: OpenAI and i think nowaday ollama compatible endpoints allow me to use it in VSCode Copilot as well as i.e. Open Web UI. More options are shown in their docs.
show comments
Caum
Been running local LLMs on my 7900 XTX for months and the ROCm experience has been... rough. The fact that AMD is backing an official inference server that handles the driver/dependency maze is huge. My biggest question is NPU support - has anyone actually gotten meaningful throughput from the Ryzen AI NPU vs just using the dGPU? In my testing the NPU was mostly a bottleneck for anything beyond tiny models.
show comments
moconnor
Is... is this named because they have a lemon they're trying to make the most of?
show comments
sensitiveCal
Feels like this is sitting somewhere between Ollama and something like LM Studio, but with a stronger focus on being a unified “runtime” rather than just model serving.
The interesting part to me isn’t just local inference, but how much orchestration it’s trying to handle (text, image, audio, etc). That’s usually where things get messy when running models locally.
Curious how much of this is actually abstraction vs just bundling multiple tools together. Also wondering if the AMD/NPU optimizations end up making it less portable compared to something like Ollama in practice.
show comments
zozbot234
Note that the NPU models/kernels this uses are proprietary and not available as open source. It would be nice to develop more open support for this hardware.
show comments
rpdillon
Been running lemonade for some time on my Strix Halo box. It dispatches out to other backends that they include, like diffusion and llama. I actually don't like their combined server, and what I use instead is their llama CPP build for ROCm.
But I'm not doing anything with images or audio. I get about 50 tokens a second with GPT OSS 120B. As others have pointed out, the NPU is used for low-powered, small models that are "always on", so it's not a huge win for the standard chatbot use case.
show comments
JSR_FDED
I’ve read the website and the news announcement, and I still don’t understand what it is. An alternative to LM Studio? Does it support MLX or metal on Macs? I’m assuming it will optimize things for AMD, but are you at a disadvantage using other GPUs?
show comments
jmillikin
Surprising that the Linux setup instructions for the server component don't include Docker/Podman as an option, its Snap/PPA for Ubuntu and RPM for Fedora.
Maybe the assumption is that container-oriented users can build their own if given native packages?
show comments
bravetraveler
A fun observation: pulling models sends ~200mbit of progress updates to your browser
nijave
Anyone compare to ollama? I had good success with latest ollama with ROCm 7.4 on 9070 XT a few days ago
show comments
cpburns2009
Just in case anyone isn't aware. NPUs are low power, slow, and meant for small models.
show comments
spencer9714
I’m currently optimizing FLUX to run on a cluster of consumer 8GB VRAM cards (RTX 4060s). I noticed Lemonade emphasizes NPU and GPU orchestration. Have you found that offloading the 'aesthetic scoring' or 'text encoding' to the NPU significantly frees up VRAM for the main diffusion process, or is the overhead of moving tensors back and forth too high on consumer hardware?
freedomben
Neat, they have rpm, deb, and a companion AppImage desktop app[1]! Surprised I wasn't aware of this project before. Definitely going to give it a try.
Maybe it's a language barrier problem, but "by AMD" makes me think its a project distributed by AMD. Is that actually the case? I'm not seeing any reason to believe it is.
show comments
kouunji
I’m looking forward to trying this currently Strix halo’s npu isn’t accessible if you’re running Linux, and previously I don’t think lemonade was either. If this opens up the npu that would be great! Resolute raccoon is adding npu support as well.
show comments
ilaksh
Cool but is there a reason they can't just make PRs for vLLM and llama.cpp? Or have their own forks if they take too long to merge?
show comments
syntaxing
Wow this is super interesting. This creates a local “Gemini” front end and all. This is more or less a generative AI aggregator where it installs multiple services for different gen modes. I’m excited to try this out on my strix halo. The biggest issue I had is image and audio gen so this seems like a great option.
pantalaimon
It's pretty annoying that you need vendor specific APIs and a large vendor specific stack to do anything with those NPUs.
This way software adoption will be very limited.
LowLevelKernel
Which specific NPU’s?
robotswantdata
Forget all the vibe coded slop or Ollama. Lemonade is the real deal and very good, been using about a year now.
AMD are doing gods work here
metalliqaz
my most powerful system is Ryzen+Radeon, so if there are tools that do all the hard work of making AI tools work well on my hardware, I'm all for it. I find it very frustrating to get LLMs, diffusion, etc. working fast on AMD. It's way too much work.
9dc
so... what does it do? i dont get it Lol
show comments
shubhamgarg86
the unified api is interesting, but i've found that 'openai compatible' can be leaky. when i switched a rag agent from openai to a local server, my function calling broke even though
luxuryballs
this is funny I’m working on building an AI project called lemonade right now
I have been using lemonade for nearly a year already. On Strix Halo I am using nothing else - although kyuz0's toolboxes are also nice (https://kyuz0.github.io/amd-strix-halo-toolboxes/)
Nowadays you get TTS, STT, text & image generation and image editing should also be possible. Besides being able to run via rocm, vulkan or on CPU, GPU and NPU. Quite a lot of options. They have a quite good and pragmatic pace in development. Really recommend this for AMD hardware!
Edit: OpenAI and i think nowaday ollama compatible endpoints allow me to use it in VSCode Copilot as well as i.e. Open Web UI. More options are shown in their docs.
Been running local LLMs on my 7900 XTX for months and the ROCm experience has been... rough. The fact that AMD is backing an official inference server that handles the driver/dependency maze is huge. My biggest question is NPU support - has anyone actually gotten meaningful throughput from the Ryzen AI NPU vs just using the dGPU? In my testing the NPU was mostly a bottleneck for anything beyond tiny models.
Is... is this named because they have a lemon they're trying to make the most of?
Feels like this is sitting somewhere between Ollama and something like LM Studio, but with a stronger focus on being a unified “runtime” rather than just model serving.
The interesting part to me isn’t just local inference, but how much orchestration it’s trying to handle (text, image, audio, etc). That’s usually where things get messy when running models locally.
Curious how much of this is actually abstraction vs just bundling multiple tools together. Also wondering if the AMD/NPU optimizations end up making it less portable compared to something like Ollama in practice.
Note that the NPU models/kernels this uses are proprietary and not available as open source. It would be nice to develop more open support for this hardware.
Been running lemonade for some time on my Strix Halo box. It dispatches out to other backends that they include, like diffusion and llama. I actually don't like their combined server, and what I use instead is their llama CPP build for ROCm.
https://github.com/lemonade-sdk/llamacpp-rocm
But I'm not doing anything with images or audio. I get about 50 tokens a second with GPT OSS 120B. As others have pointed out, the NPU is used for low-powered, small models that are "always on", so it's not a huge win for the standard chatbot use case.
I’ve read the website and the news announcement, and I still don’t understand what it is. An alternative to LM Studio? Does it support MLX or metal on Macs? I’m assuming it will optimize things for AMD, but are you at a disadvantage using other GPUs?
Surprising that the Linux setup instructions for the server component don't include Docker/Podman as an option, its Snap/PPA for Ubuntu and RPM for Fedora.
Maybe the assumption is that container-oriented users can build their own if given native packages?
A fun observation: pulling models sends ~200mbit of progress updates to your browser
Anyone compare to ollama? I had good success with latest ollama with ROCm 7.4 on 9070 XT a few days ago
Just in case anyone isn't aware. NPUs are low power, slow, and meant for small models.
I’m currently optimizing FLUX to run on a cluster of consumer 8GB VRAM cards (RTX 4060s). I noticed Lemonade emphasizes NPU and GPU orchestration. Have you found that offloading the 'aesthetic scoring' or 'text encoding' to the NPU significantly frees up VRAM for the main diffusion process, or is the overhead of moving tensors back and forth too high on consumer hardware?
Neat, they have rpm, deb, and a companion AppImage desktop app[1]! Surprised I wasn't aware of this project before. Definitely going to give it a try.
[1]: https://github.com/lemonade-sdk/lemonade/releases/tag/v10.0....
Maybe it's a language barrier problem, but "by AMD" makes me think its a project distributed by AMD. Is that actually the case? I'm not seeing any reason to believe it is.
I’m looking forward to trying this currently Strix halo’s npu isn’t accessible if you’re running Linux, and previously I don’t think lemonade was either. If this opens up the npu that would be great! Resolute raccoon is adding npu support as well.
Cool but is there a reason they can't just make PRs for vLLM and llama.cpp? Or have their own forks if they take too long to merge?
Wow this is super interesting. This creates a local “Gemini” front end and all. This is more or less a generative AI aggregator where it installs multiple services for different gen modes. I’m excited to try this out on my strix halo. The biggest issue I had is image and audio gen so this seems like a great option.
It's pretty annoying that you need vendor specific APIs and a large vendor specific stack to do anything with those NPUs.
This way software adoption will be very limited.
Which specific NPU’s?
Forget all the vibe coded slop or Ollama. Lemonade is the real deal and very good, been using about a year now.
AMD are doing gods work here
my most powerful system is Ryzen+Radeon, so if there are tools that do all the hard work of making AI tools work well on my hardware, I'm all for it. I find it very frustrating to get LLMs, diffusion, etc. working fast on AMD. It's way too much work.
so... what does it do? i dont get it Lol
the unified api is interesting, but i've found that 'openai compatible' can be leaky. when i switched a rag agent from openai to a local server, my function calling broke even though
this is funny I’m working on building an AI project called lemonade right now