This is great, and I'm not knocking it, but every time I see these apps it reminds me of my phone.
My 2021 Google Pixel 6, when offline, can transcribe speech to text, and also corrects things contextually. it can make a mistake, and as I continue to speak, it will go back and correct something earlier in the sentence. What tech does Google have shoved in there that predates Whisper and Qwen by five years? And why do we now need a 1Gb of transformers to do it on a more powerful platform?
show comments
atlgator
This thread is a support group for people who have each independently built the same macOS speech-to-text app.
On Linux, there's access to the latest Cohere Transcribe model and it works very, very well. Requires a GPU though. Larger local models generally shouldn't require a subordinate model for clean up.
Have you compared WhisperKit to faster-whisper or similar? You might be able to run turbov3 successfully and negate the need for cleanup.
Incidentally, waiting for Apple to blow this all up with native STT any day now. :)
show comments
primaprashant
Speech-to-text has become integral part of my dev flow especially for dictating detailed prompts to LLMs and coding agents.
I have collected the best open-source voice typing tools categorized by platform in this awesome-style GitHub repo. Hope you all find this useful!
I got it to transcribe this: "Create tests and ensure all tests pass" and instead of transcribing exactly what I said it outputs nonsense around "I am a large language model and I cannot create and execute tests".
Other than that issue I like it.
charlietran
Thank you for sharing, I appreciate the emphasis on local speed and privacy. As a current user of Hex (https://github.com/kitlangton/Hex), which has similar goals, what are your thoughts on how they compare?
parhamn
I see a lot of whisper stuff out there. Are these the same old OpenAI whispers or have they been updated heavily?
I've been using parakeet v3 which is fantastic (and tiny). Confused why we're still seeing whisper out there, there's been a lot of development.
show comments
ericmcer
I see quite a few of these, the killer feature to me will be one that fine tunes the model based on your own voice.
E.G. if your name is `Donold` (pronounced like Donald) there is not a transcription model in existence that will transcribe your name correctly. That means forget inputting your name or email ever, it will never output it correctly.
Combine that with any subtleties of speech you have, or industry jargon you frequently use and you will have a much more useful tool.
We have a ton of options for "predict the most common word that matches this audio data" but I haven't found any "predict MY most common word" setups.
show comments
konaraddi
That’s awesome! Do you know how it compares to Handy? Handy is open source and local only too. It’s been around a while and what I’ve been using.
Speecg-to-text is basically AI version of Todo app that we used to build every week when new frontend framework would release.
jwr
I currently use MacWhisper and it is quite good, but it's great to see an alternative, especially as I've been looking to use more recent models!
I hope there will be a way to plug in other models: I currently work mostly with Whisper Large. Parakeet is slightly worse for non-English languages. But there are better recent developments.
ipsum2
Parakeet is significantly more accurate and faster than Whisper if it supports your language.
show comments
ianmurrays
I had Claude make this hammerspoon config + daemon that does pretty much the same, in case anyone is interested.
What do you actually use for STT, particularly if you prize performance over privacy and are comfortable using your own API keys?
I was on WhisperFlow for a while until the trial ran out, and I'm really tempted to subscribe. I don't think I can go back to a local solution after that, the performance difference is insane.
show comments
snickell
Can somebody help me understand how they use these, I feel like I'm missing something or I'm bad at something?
I only spent 10 minutes with Handy, and a similar amount of time with SuperWhisper, so pretty ignorant. I tried it both with composing this comment, and in a programming session with Codex. I was slightly frustrated to not be hands free, instead of typing, my hands were having to press and release a talk button (option-space in handy, right-command in superwhisper), but then I couldn't submit, so I still had to click enter with Codex.
Additionally, for composing this message, I'm using the keyboard a ton because there's no way I can find to correct text I've typed. Do other people get really reliable and don't need backspace anymore? Or.... what text do you not care enough to edit? Notes maybe?
My point of comparison is using Dragon like 15 years ago. TBH, while the recognition is better (much better) on handy/superwhisper, everything else felt MUCH worse. With dragon, you are (were?) totally hands free, you see text as you say it, and you could edit text really easily vocally when it made a mistake (which it did a fair bit, admittedly). And you could press enter and pretty functionally navigate w/o a keyboard too.
Its weird to see all these apps, and they all have the same limitations?
bambushu
nice to see this running fully local. what model size are you shipping as default, and what's the cold-start time on Apple Silicon? I've been using Whisper locally for meeting transcription and the biggest friction point is always endpoint detection - knowing when you've stopped talking vs pausing to think. curious how you handle that with hold-to-talk.
fiatpandas
The clean up prompt needs adjusting. If your transcription is first person and in the voice of talking to an AI assistant, it really wants to “answer” you, completing ignoring its instructions. I fiddled with the prompt but couldn’t figure out how to make it not want to act like an AI assistant.
__mharrison__
Cool, I've been doing a lot of "coding" (and other typing tasks) recently by tapping a button on my Stream Deck. It starts recording me until I tap it again. At which point, it transcribes the recording and plops it into the paste buffer.
The button next to it pastes when I press it. If I press it again, it hits the enter command.
You can get a lot done with two buttons.
show comments
ghm2199
I've been using handy since a month and its awesome. I mainly use it with coding agents or when I don't want to type into text boxes. How is this different?
Part of the reason handy is awesome is because it uses some of the same rust infra for integrating with the model, so that actually makes it possible to use the code as a library in android or iOS. I have an android app that runs on a local model on the phone too using this.
I like that openwhisper lets me do on device and set a remote provider.
mathis
If you don't feel like downloading a large model, you can also use `yap dictate`. Yap leverages the built-in models exposed though Speech.framework on macOS 26 (Tahoe).
Not sure why I should use this instead of the baked-in OS dictation features (which I use almost daily--just double-tap the world key, and you're there). What's the advantage?
show comments
hyperhello
Feature request or beg: let me play a speech video and transcribe it for me.
show comments
pdyc
interesting, i wanted something like this but i am on linux so i modified whisper example to run on cli. Its quite basic, uses ctrl+alt+s to start/stop, when you stop it copies text to clipboard that's it. Now its my daily driver https://github.com/newbeelearn/whisper.cpp
jannniii
Oh dear, why does it not use apfel for cleanup? No model download necessary…
janalsncm
I think the jab at the bottom of the readme is referring to whispr flow?
love seeing more local-first tools like this. feels like theres been a real shift since the codebeautify breach last year, people are actually thinking about where there data goes now. nice work on keeping it all on device
tito
This is great. I'm typing this message now using Ghost Pepper. What benefits have you seen from the OCR screen sharing step?
Supercompressor
I've been looking for the opposite - wanting to dump text and it be read to me, coherently. Anyone have good recommendations?
show comments
imazio
is this the support group for people building speech-to-text apps?
How does this compare with Superwhisper, which is otherwise excellent but not cheap?
guzik
Sadly the app doesn't work. There is no popup asking for microphone permission.
EDIT: I see there is an open issue for that on github
show comments
aristech
Great job.
How about the supported languages?
System languages gets recognised?
show comments
gegtik
how does this compare to macos built in siri TTS, in quality and in privacy?
show comments
purplehat_
Hi Matt, there's lots of speech-to-text programs out there with varying levels of quality. 100% local is admirable but it's always a tradeoff and users have to decide for themselves what's worth it.
Would you consider making available a video showing someone using the app?
show comments
vaulpann
very cool - huge open source drop!
thatxliner
why isn't the cleanup done on the transcription (as opposed to screen record)
dakila5
MacWhisper is also a good one
douglaswlance
does it input the text as soon as it hears it? or does it wait until the end?
sorkhabi
Well done
romeroej
always mac. when windows? why can you just make things multios
This is great, and I'm not knocking it, but every time I see these apps it reminds me of my phone.
My 2021 Google Pixel 6, when offline, can transcribe speech to text, and also corrects things contextually. it can make a mistake, and as I continue to speak, it will go back and correct something earlier in the sentence. What tech does Google have shoved in there that predates Whisper and Qwen by five years? And why do we now need a 1Gb of transformers to do it on a more powerful platform?
This thread is a support group for people who have each independently built the same macOS speech-to-text app.
Nice one! For Linux folks, I developed https://github.com/goodroot/hyprwhspr.
On Linux, there's access to the latest Cohere Transcribe model and it works very, very well. Requires a GPU though. Larger local models generally shouldn't require a subordinate model for clean up.
Have you compared WhisperKit to faster-whisper or similar? You might be able to run turbov3 successfully and negate the need for cleanup.
Incidentally, waiting for Apple to blow this all up with native STT any day now. :)
Speech-to-text has become integral part of my dev flow especially for dictating detailed prompts to LLMs and coding agents.
I have collected the best open-source voice typing tools categorized by platform in this awesome-style GitHub repo. Hope you all find this useful!
https://github.com/primaprashant/awesome-voice-typing
https://handy.computer/ already exists?
I got it to transcribe this: "Create tests and ensure all tests pass" and instead of transcribing exactly what I said it outputs nonsense around "I am a large language model and I cannot create and execute tests".
Other than that issue I like it.
Thank you for sharing, I appreciate the emphasis on local speed and privacy. As a current user of Hex (https://github.com/kitlangton/Hex), which has similar goals, what are your thoughts on how they compare?
I see a lot of whisper stuff out there. Are these the same old OpenAI whispers or have they been updated heavily?
I've been using parakeet v3 which is fantastic (and tiny). Confused why we're still seeing whisper out there, there's been a lot of development.
I see quite a few of these, the killer feature to me will be one that fine tunes the model based on your own voice.
E.G. if your name is `Donold` (pronounced like Donald) there is not a transcription model in existence that will transcribe your name correctly. That means forget inputting your name or email ever, it will never output it correctly.
Combine that with any subtleties of speech you have, or industry jargon you frequently use and you will have a much more useful tool.
We have a ton of options for "predict the most common word that matches this audio data" but I haven't found any "predict MY most common word" setups.
That’s awesome! Do you know how it compares to Handy? Handy is open source and local only too. It’s been around a while and what I’ve been using.
https://github.com/cjpais/handy
Speecg-to-text is basically AI version of Todo app that we used to build every week when new frontend framework would release.
I currently use MacWhisper and it is quite good, but it's great to see an alternative, especially as I've been looking to use more recent models!
I hope there will be a way to plug in other models: I currently work mostly with Whisper Large. Parakeet is slightly worse for non-English languages. But there are better recent developments.
Parakeet is significantly more accurate and faster than Whisper if it supports your language.
I had Claude make this hammerspoon config + daemon that does pretty much the same, in case anyone is interested.
https://github.com/ianmurrays/hammerspoon/blob/main/stt.lua
What do you actually use for STT, particularly if you prize performance over privacy and are comfortable using your own API keys?
I was on WhisperFlow for a while until the trial ran out, and I'm really tempted to subscribe. I don't think I can go back to a local solution after that, the performance difference is insane.
Can somebody help me understand how they use these, I feel like I'm missing something or I'm bad at something?
I only spent 10 minutes with Handy, and a similar amount of time with SuperWhisper, so pretty ignorant. I tried it both with composing this comment, and in a programming session with Codex. I was slightly frustrated to not be hands free, instead of typing, my hands were having to press and release a talk button (option-space in handy, right-command in superwhisper), but then I couldn't submit, so I still had to click enter with Codex.
Additionally, for composing this message, I'm using the keyboard a ton because there's no way I can find to correct text I've typed. Do other people get really reliable and don't need backspace anymore? Or.... what text do you not care enough to edit? Notes maybe?
My point of comparison is using Dragon like 15 years ago. TBH, while the recognition is better (much better) on handy/superwhisper, everything else felt MUCH worse. With dragon, you are (were?) totally hands free, you see text as you say it, and you could edit text really easily vocally when it made a mistake (which it did a fair bit, admittedly). And you could press enter and pretty functionally navigate w/o a keyboard too.
Its weird to see all these apps, and they all have the same limitations?
nice to see this running fully local. what model size are you shipping as default, and what's the cold-start time on Apple Silicon? I've been using Whisper locally for meeting transcription and the biggest friction point is always endpoint detection - knowing when you've stopped talking vs pausing to think. curious how you handle that with hold-to-talk.
The clean up prompt needs adjusting. If your transcription is first person and in the voice of talking to an AI assistant, it really wants to “answer” you, completing ignoring its instructions. I fiddled with the prompt but couldn’t figure out how to make it not want to act like an AI assistant.
Cool, I've been doing a lot of "coding" (and other typing tasks) recently by tapping a button on my Stream Deck. It starts recording me until I tap it again. At which point, it transcribes the recording and plops it into the paste buffer.
The button next to it pastes when I press it. If I press it again, it hits the enter command.
You can get a lot done with two buttons.
I've been using handy since a month and its awesome. I mainly use it with coding agents or when I don't want to type into text boxes. How is this different?
Part of the reason handy is awesome is because it uses some of the same rust infra for integrating with the model, so that actually makes it possible to use the code as a library in android or iOS. I have an android app that runs on a local model on the phone too using this.
Would also like to know how it compares to https://github.com/openwhispr/openwhispr
I like that openwhisper lets me do on device and set a remote provider.
If you don't feel like downloading a large model, you can also use `yap dictate`. Yap leverages the built-in models exposed though Speech.framework on macOS 26 (Tahoe).
Project repo: https://github.com/finnvoor/yap
Interesting, I'm surprised you went with Whisper, I found Parakeet (v2) to be a lot more accurate and faster, but maybe it's just my accent.
I implemented fully local hands free coding with Parakeet and Kokoro: https://github.com/getpaseo/paseo
Not sure why I should use this instead of the baked-in OS dictation features (which I use almost daily--just double-tap the world key, and you're there). What's the advantage?
Feature request or beg: let me play a speech video and transcribe it for me.
interesting, i wanted something like this but i am on linux so i modified whisper example to run on cli. Its quite basic, uses ctrl+alt+s to start/stop, when you stop it copies text to clipboard that's it. Now its my daily driver https://github.com/newbeelearn/whisper.cpp
Oh dear, why does it not use apfel for cleanup? No model download necessary…
I think the jab at the bottom of the readme is referring to whispr flow?
https://wisprflow.ai/new-funding
love seeing more local-first tools like this. feels like theres been a real shift since the codebeautify breach last year, people are actually thinking about where there data goes now. nice work on keeping it all on device
This is great. I'm typing this message now using Ghost Pepper. What benefits have you seen from the OCR screen sharing step?
I've been looking for the opposite - wanting to dump text and it be read to me, coherently. Anyone have good recommendations?
is this the support group for people building speech-to-text apps?
I built https://yakki.ai
No regrets so far! XP
How does this compare with Superwhisper, which is otherwise excellent but not cheap?
Sadly the app doesn't work. There is no popup asking for microphone permission.
EDIT: I see there is an open issue for that on github
Great job. How about the supported languages? System languages gets recognised?
how does this compare to macos built in siri TTS, in quality and in privacy?
Hi Matt, there's lots of speech-to-text programs out there with varying levels of quality. 100% local is admirable but it's always a tradeoff and users have to decide for themselves what's worth it.
Would you consider making available a video showing someone using the app?
very cool - huge open source drop!
why isn't the cleanup done on the transcription (as opposed to screen record)
MacWhisper is also a good one
does it input the text as soon as it hears it? or does it wait until the end?
Well done
always mac. when windows? why can you just make things multios