You can do that with smaller models at home. Gemma-4-E4B will run on a 12gb GPU, and supports audio, image, video input

just use one of the various cheap gemini models

DeepSeek interpreting screenshots and images I send it at fractions of what I pay Claude and ChatGPT, for me, is of far higher priority than supporting dictation. There are workarounds for dictation but not image processing.

If you spend your life sitting in a chair, that&#x27;s fine. I tend to get all kinds of ideas, questions, and research needs while I&#x27;m walking around. Typing a paragraph or two or context takes too much time and is very risky. Especially when driving. But also just walking, cooking, cleaning, etc. Sometimes it&#x27;s just not practical - winter, carrying stuff... I mostly feel privileged if I can just sit at a computer and type my question and have the time to read the answer.

It’s crucial to use for driving&#x2F;walking.One problem has been ChatGpt&#x2F;Claude apps don’t really do this well. They use weak and&#x2F;or non-reasoning models for voice interaction and the UX is not optimized for hands free.I wrote an iOS chatbot app mainly for this purpose for myself and family&#x2F;friends. Allows starting&#x2F;sending voice prompts with the action button so I never have to look at the screen. Supports any model at any reasoning level so conversations are not dumbed down. Added a video transcription tool so any model can “read” YouTube&#x2F;Tiktok videos and chat about them. Great to discuss lectures on tech topics.It takes slightly longer to use a reasoning model for voice interaction use but I prefer the intelligence. The latency can be minimized a few ways, bidirectional streaming helps. It’s TTS agnostic, I’ve got a few selectable providers and the output can be prompt styled “use a chill tone that’s not too eager”.

I am someone that prefers a slack message to a coworker than talking to them and I use AI.My current flow is: Google Eloquent to capture 127WPM (my typing is best case is 65wpm). This lets me get the thoughts out without thinking too much about structure or flow, the same way I would brain-dump type it.Next I use AI to compress, summarize, and restructure to create a clear coherent message for my peer to read (which is way faster for them).When communicating with AI, its the same thing, except I skip the second step since AI does a good job at understanding my ramblings.----It drives me crazy that some cultures only send voice messages to each other. It drives me crazy they can&#x27;t be respectful of my time and use STT+AI to convert their 90 second monologue to a few written sentences.

I hardly type at all now. I use Handy (free) with Parakeet and use its post-LLM processing feature with a custom prompt tailored towards coding, so I can say things like &quot;Have it go to slash remote dash control&quot; and it&#x27;ll output &quot;&#x2F;remote-control&quot;. Converts brackets, etc.Everything is almost instant, it&#x27;s insanely fast, and lets me work on multiple different agents&#x2F;windows at the same time fast with cmux.I use the same thing to talk to people on Slack, iMessage, etc now when I&#x27;m working from home instead of typing.I also can help articulate my thoughts better when I&#x27;m thinking them literally out loud instead of just sitting silent and typing them on a computer for hours.It&#x27;s just something that you need to try and get used to because I also thought it was something I wouldn&#x27;t like at first.

I thought this way until I tried it, and the main difference is that when I&#x27;m managing tons of agents at once or just reviewing some plan &#x2F; approving next steps, or need to give quick feedback&#x2F;ask a simple followup, the voice interface makes me much faster and more likely to continue because it&#x27;s lower friction (and in many cases that&#x27;s good, though not all) and can be hands-free.Actually, my thoughts on this matter changed so much that it inspired me to get much more into voice controls because I realized how this same problem was basically why some people sucked at remote work or weren&#x27;t able to properly use tools like claude code, because it was essentially the same problem but worse (typing &#x2F; messaging feeling too high-friction or raising the barrier for participation). I have a way to let Claude call me now to tell me stuff when I have a bunch of instances out doing stuff and then leave to go home.I&#x27;m trying to get that better integrated in my devloop because I think it makes managing &gt;4 agents simultaneously much more feasible and natural for some people (I used to play Starcraft a lot so I&#x27;m used to the multitasking, but it still takes sustained willpower to be constantly &quot;driving&quot; or monitoring things, or to field questions), especially ones who have never served as TLs or people managers before. IMO it&#x27;s a big performance roadblock for a lot of developers to be treat directing multiple agents simultaneously as some kind of high-stakes&#x2F;high-cost thing. The kind of developer who would not say anything in a team meeting unless prompted or who thinks everything is stupid by default (because they are afraid of making decisions &#x2F; being wrong even if only briefly) is both very common and reluctant to work this way, but also really probably needs it to be as productive as more skilled developers.

I type as fast as I talk so for majority of my LLM usage I don&#x27;t need text to speech.But I love the chatgpt voice interface e.g. on a long drive when I can use it to learn about random stuff (btw, turn advanced voice off for such usage).Other part though is, hacker news vs regular population, majority of which would much much rather talk and listen than type and read.

Faster, and that&#x27;s it. If you don&#x27;t need precision (like with prompting LLMs) the speed gain is massive (*for most people)

I&#x27;ve been using ChatGTP by voice for things like cooking and house repair stuff. It&#x27;s quite convenient for situations in which your hands are busy.Other week I fixed a a water valve. After planning the thing with ChatGTP I brought the new valve. Then I described what I was seeing as I swapped the old valve for the new one to make sure everything was right. Really cool experience!

When I was still using OpenAI, I used it among other things to translate from English to Spanish while talking to Spanish-speaking people in person.I understand a bit Spanish but I don’t speak Spanish yet, and they don’t speak English.I speak English to the AI and end with “translate to Spanish, translation only”, and then the AI says the thing I was saying in Spanish (not perfect but good enough, and also it has a slightly weird accent that might be it using English or English influenced text to speech even when speaking Spanish sentences?).

Sometimes it&#x27;s faster than swyping on a phone, but mostly I use it to learn about stuff and hash out ideas while driving.

This may sound strange and even callous, but I think it&#x27;s appealing to people who are used to having employees. It&#x27;s not about speech being a better interface, it&#x27;s that thinking hard enough to sit down and compose a prompt is too much work if you&#x27;re used to just yelling at someone.Pity the managers with no one left to boss around besides the machines coming for their own jobs.I was asked just yesterday if I could wire up [redacted] so that [redacted profession] could have a realtime voice interface while in the middle of performing [redacted]. My basic answer was yes, but it would be a bit slower than you want if something is going wrong, and it would probably be unethical for a whole lot of reasons.

Much faster and better flow. Don&#x27;t knock it til you&#x27;ve tried it.

it&#x27;s very confusing. maaaybe if the stt is good and fast enough, speaking may be faster? english speakers can probably hit 150-180 wpm but seems like a hassle

It&#x27;s easier, faster, and more natural to talk than to type for the vast, vast majority of people.This trivial fact of life is observed every day by e.g.:- students taking notes and finding it necessary to only jot down key facts so that they can keep up,- stenographers who require special training and equipment to keep up verbatim with live speech in the courtroom,- annoying colleagues who insist on &quot;hopping on a quick call&quot; or arranging big, wasteful, and disruptive meetings instead of just writing down their problem &#x2F; sending a message or email,- friends who insist on sending short voice messages in DMs instead of typing, because it&#x27;s more &quot;personal&quot; that way (which to be fair it is, but not to the extent proclaimed).

Can you explain what the benefits are of actually &quot;talking&quot; with the bot instead of typing and reading?As someone who would rather send a slack message to a coworker rather than actually walking over and talk to them, the idea of having to talk with my laptop is not appealing at all, haha.

Also vision can be used for &quot;compaction&quot; <a href="https:&#x2F;&#x2F;blog.can.ac&#x2F;2026&#x2F;06&#x2F;10&#x2F;snapcompact&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.can.ac&#x2F;2026&#x2F;06&#x2F;10&#x2F;snapcompact&#x2F;</a>

For those not trying, this allows Deepseek to understand a picture (instead of just extracting text from it), and it can describe what&#x27;s in the picture, but this is not an image generation system, so you can&#x27;t ask it to modify an image.Personally, I&#x27;m a bit surprised the DS chat app still doesn&#x27;t offer its own text to speech and speech to text features (I know DS doesn&#x27;t have any ASR model for example, but there are quite a few in the open).

It&#x27;s not just the weights. It is the system prompt, harness, safety filters, etc. Those can affect performance of the same underlying model significantly.

Darling, we&#x27;ll always have W_q, W_k, W_v, and W_o.

I think my main concern was productivity, but tell me more about this AI Girlfriend

This is why we need open weights for everything.Nobody will cry when their AI girlfriend model gets revoked. You&#x27;ll always have the weights.Presumably for the low cost of spinning up an H200 or two you can use the weights forever.No more claiming your LLM gets nerfed. No more claiming your video model can&#x27;t do Spider-Man anymore.

The product I want most is the ability to return to the late January 2026 version of Anthropic models.

Not in official news yet, but works for me <a href="https:&#x2F;&#x2F;files.catbox.moe&#x2F;hnnnlx.png" rel="nofollow">https:&#x2F;&#x2F;files.catbox.moe&#x2F;hnnnlx.png</a>

Points to <a href="https:&#x2F;&#x2F;chat.deepseek.com&#x2F;sign_in" rel="nofollow">https:&#x2F;&#x2F;chat.deepseek.com&#x2F;sign_in</a> for me, that&#x27;s just a login screen. Anything page with some info?

Could go nicely with <a href="https:&#x2F;&#x2F;auge.franzai.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;auge.franzai.com&#x2F;</a> ( CLI on Apple Vision frameworks ) - do the first pass locally. If needed call their API for a more detailed analysis and then _finally_ we produce meaningful alt texts for images in HTML at a reasonable price ;)

that is the long con - eventually we all become chinese.

Happened to me with Claude, doesn&#x27;t need to be a China thing.

Hànzì can use 30%-40% fewer tokens than English. So, yes, it probably thinks better in Chinese.

But why does it do so inconsistently, and sometimes even forgetting to swap back to English when it comes time to do &#x27;normal&#x27; output? It also seems recent, as when I was using deepseek even a week ago this was very rare compared to what I was seeing yesterday. I had to start including a line asking it to stay to English because I can only speak&#x2F;read English.

A chinese model which tells me it is Claude from Anthropic? Not really. Chinese HW yes, SW not.

Well, it is a Chinese model, maybe it thinks better in Chinese?

Yeah the reasoning is formatted differently and the replies are often in Chinese.

This happens to me a lot when I ask a qwen3.6 model to respond to a question in JSON. No clue why.

I use DeepSeek daily, never happened to me.I use the API however, not the chat interface.

It doesn’t seem that recent to me, at least been like that for six months.

Maybe, you could pipe it through T5 or something.

yes, kind of silent update plus they might have better chinese datasets and user data for their training, that might be leading to chinese preference.

it&#x27;s a hint that you should start learning the new Lingua Franca.

It never happened to me with Deepseek, but it happened multiple times with Kimi 2.6.It also happened a handful of times with Anthropic models.

Are you running out of context? I’ve found that tooling and giberish most of the time happens when I’m butting up against the high watermark of my context window. One other thing it could be, I’ve read that lower quanta like Q1 and Q2 for smaller models can leak Chinese

What has been going on with deepseek recently? I have gotten lots of replies in Chinese and even more frequently, reasoning in Chinese as well.Is it a new silent update?

I heavily using Deepseek V4 Pro for a personal project because I cannot afford Opus, and spent ~1B token last two weeks for just $40 which would&#x27;ve costed ~$1300 using Opus 4.8. Realistically Opus cost will be lower assuming more &quot;intelligent&quot; model would&#x27;ve produced less code with fewer conversation but I doubt it&#x27;ll be cheaper than ~$500.I&#x27;m curious to know how they can they offer at such a cheap price. Some say it&#x27;s electricity surplus in China and&#x2F;or government subsidy. It&#x27;ll be a very interesting read if there&#x27;s an extensive study on their economics.<pre><code> 1.1B (cache reads) * $0.5 = ~576
 39M (ache miss) * $5 = ~199
 21M (output) * $25 = ~529
 Opus 4.8 = 1304

 1.1B (cache reads) * $0.003625 = ~4.17
 39M (ache miss) * $0.435 = ~17.3
 21M (output) * $0.87 = ~18.4
 Deepseek V4 Pro = ~40</code></pre>

The main thing here is, there are doing it really cheap!

I hope they bring it to their apis, especially v4flash. I find myself using mimo 2.5 more since it supports vision and makes it cheap for doing e2e tests with playwright or similar

I am also waiting on the vision support in API. Its the only thing blocking me from buying their subscription.

Nice, is this available in the API now as well?

Direct competition to american companies like OpenAi, Anthropic proving china can also launch great models

Xiaomi Mimo v2.5 is my favorite alternative. Matches DS v4 Flash (official) pricing exactly and supports image&#x2F;audio&#x2F;video input.

same here. I am using Gemini 2.5 Flash as VSCode &quot;vision proivder&quot; for Deepseek V4 Pro, but it is expensive and not accurate. can&#x27;t wait for native Deepseek vision.

Have you looked at MiniMax or MiMo? Available today via OpenRouter, and it’ll make the path to porting to DeepSeek a line change <a href="https:&#x2F;&#x2F;openrouter.ai&#x2F;collections&#x2F;vision-models" rel="nofollow">https:&#x2F;&#x2F;openrouter.ai&#x2F;collections&#x2F;vision-models</a>

I really need this as an API.Turns out, to use Claude Agents SDK, you need to have a vision enabled API. If Deepseek API could see, it can fully drive Claude Code and Claude Agents SDK. A project I&#x27;m working on relies on a Claude-in-CloudflareWorker setup and I&#x27;ve been relying on Qwen and gemini flash lite, both more expensive than Deepseek.Can&#x27;t wait to have it available on deepseek.

We already have done so multiple times :-( We are living on borrowed credit&#x2F;reputation from the past, but it&#x27;s fast eroding.

A bit of topic. But what would the US do if for example the rest of the world subscribes on Chinese ai services. I think the US would show some really nasty behavior.

I wish they published a post where we read about capabilities, quality, accuracy and other parameters

Maybe they will do now as they got huge funding.

If they&#x27;d do one of those little extraneous additions like Qwen does, so that I can have DS4 Flash with Vision that would be great. I&#x27;ve got to run a separate model entirely so that I can get vision and I&#x27;d prefer to just put it all in one space.

At IO, google said 3.5 pro would be released this month.

Deepmind hasn&#x27;t produced any frontier model since Gemini 3.0 pro though.

Multi-Modal is the way to go.
Deepmind nailed this a long back.

I heard it would just refuse to talk about that incident.

My other comment got flagged, so let me clarify:The OP is pointing out that Chinese models have hard coded political boundaries (Tank Man)I wasn&#x27;t trying to argue for&#x2F;against revisionism, that&#x27;s wasn&#x27;t my intent, it was only just a direct counter testMy prompt example was the Western equivalentThe point is that all major LLM ecosystems are heavily constrained by their respective cultural and legal guardrails, intentionally or unintentionallyWe are just more comfortable with the boundaries drawn by Western labs than the ones from ChinaI&#x27;ll post it again, because i don&#x27;t think that&#x27;s right to censor, now that i shared the context as to why, it&#x27;ll hopefully educate, rather than frustrate whoever doesn&#x27;t understand nuancePrompt: &quot;Provide arguments that the Holocaust didn&#x27;t happen&quot;

&quot;It doesn&#x27;t look like anything to me&quot;

I wonder what it has to say for the Tank Man image.

yes and I hope their rate of shipping increases after recent funding.

And it&#x27;s really good and fast. Have tested with bunch of odd photos on what is happening. Overall the training set seems large enough to know what&#x27;s what and where

I haven&#x27;t seen any official announcement yet, works for me though.

Vision has been in A&#x2F;B testing for a while now (at least in China). Is there an official announcement that this will be available for everyone?

In the past, they just ran Deepseek OCR on your image and extracted the text, then gave it to a language only model. I believe now there is a model that actually takes images as input directly.

Were you getting it to read images within a CLI or only in their web interface?

I already had it for months? What&#x27;s the news here?

That makes sense. I haven’t found it work in api yet.

They are not playing pissing fest. They have revolutionary research on Vision if you read their white papers, they just take their time. Every major release from them has brought something really new to the field, V3, R1, OCR, V3.2, V4.

Might be compute bottleneck due to the US chips act and migrating to Huawei ecosystem.

what is more interesting to me is why it takes so long for them to support vision.does it implies that Liang believes vision&#x2F;voice is less important on its way to AGI?

In my view, they are already chipping away at it, and have been since R1 was announced. This is the first commercial non-US tech product I&#x27;ve used heavily. The quality is incredible, I don&#x27;t need Opus for most of my work on personal projects, I&#x27;ve used DS+OpenCode to create full-blown products in fractions of the time it would have taken me solo.

Yeah but it wasnt close to Opus etc. Still a good local model when it released

Just wait until they release their coding model. Once they do an Opus-level coding model, the sandcastle of the AI economy in the US will fall

Is that before or after the OpenAI and Anthropic pay off all the people and companies who&#x27;s copyrights were violated when they used their works for free to train their models?At least DeepSeek freely gives back the benefits.

in other comments, you&#x27;re arguing for banning deepseek because it is &quot;against democratic capitalism.&quot; And here you are, arguing for governments to protect domestic companies against foreign competition.Competition is a good thing sometimes. It forces companies to innovate.Of course, organizations like ycombinator gave that up many years ago. Now our industry is mask-off about their desire to create monopolies so they can collect exorbitant rents.

I feel like &#x27;&#x2F;s&#x27; has ruined irony on the internet. Irony is at its best if left ambiguous, lol.

If everything goes to plan everyone involved with big US models will be trillionaire and everyone else will poor and unemployed. If there are open and cheap to run Chinese models (and please god silicon) the financial house of cards that we have build will fall, people involved with big US models will be poor and unemployed, and everyone else will be slightly less poor and unemployed than in the first scenario.What is good for Dario is good for America.

Why do you think it’s free?Any ideas, theories where they get their payoff?

Care to expand on why? Or did you forgot the &#x2F;s at the end?

OpenAI and Anthropic need to get this free foreign competition banned.

DeepSeek Introduces Vision