Removing 'um' from a recording is harder than it sounds

alyssamazz

Doug is a friend, but I actually use this so figured I’d chime in.

I make online course content and used to lose close to a full day cutting filler out of every hour or so of recording. This gets me maybe 70% of that time back. On whether you should even cut them, I don’t think it’s clear cut. With non-native English speakers especially, the um is usually a real pause before they say something that matters, and cutting it makes them choppy or changes what they meant. Most of the time though it’s just padding. That matters more for courses than it sounds like it should, because a common complaint I get is how long courses are, so any dead air I can pull out is time I give back to people.

Anyway this is in my workflow now. Still messing with the settings to get it right, but I like to mess with my stack and this focuses on this step for me.

wzdd

It’s a nice engineering approach, but I’m interested in the motivation. Um and ah is distracting in a transcript, where you can naturally pause to take in information; in speech however it can serve as a focusing point to indicate the next part is important. See https://medium.com/better-humans/dont-worry-about-saying-um-... for example. The weirdly obsessive zeal that orgs like Toastmasters have about eliminating them is weird.

Disfluencies aren’t necessarily bad even if the word starts with “dis”!

show comments

heroprotagonist

Not to promote something, but Wispr Flow does that for me automatically if I trigger a setting for it..

While it's a commercial product with a subscription, I spent a long time on the free tier not even hitting their limits until I started using it so extensively that I wanted to pay for it.

And I've used Whisper in the past, mostly for tinkering. I tried it for a couple of use cases but haven't touched the base project in a while. But I do regularly use Faster-Whisper-XXL, an open source project based on Whisper, for subtitle generation.

Though, for subtitle generation, I decided to support the project and mainly use the non-public build of Faster-Whisper-XXL Pro built for donators to the open source project.

The extra features smooth out the subtitle editing process very substantially. Toss in "--roformer_overlap 0.125 --roformer_vram 16 --best_of 15 --ff_vocal_extract mb-roformer --vad_method pyannote_v3" to the cli parameters (and sometimes --realign) and you have much less work to do in SubtitleEdit or Tero Subtitler afterwards to clean it up.

show comments

ghaff

When I was doing podcasts regularly, it made me acutely aware of various people's speech mannerisms. (Somewhat similarly, recording a lot of videos during COVID made me very aware of a variety of my own mannerisms--especially overactive hand motions.)

supernes

This approach seems kind of backwards to me. Why try to detect everything except the thing you're trying to remove instead of either sampling a few uhs and ums and treating them as noise to be silenced (with a sharp crossfade to the noise floor that doesn't interrupt speech flow) or finetuning a model to detect them specifically for full automation?

show comments

1317

Looks interesting, would be a nicer article though if there was a demo with before/after to show the results, and why the previous ideas didn't work

for something dealing with audio you do need to play the audio really

chrismorgan

I think the “What it won’t touch” section shows why the entire concept is unsound. Here it is with a different first sentence, and (other than the third sentence no longer matching erm’s reality) it’s perfectly coherent:

> It leaves um, uh, er and elongated versions (ummmm, uhhhhh) alone. Those sound like fillers but they’re doing real work in the sentence, and cutting them automatically would change what someone said. The rule erm follows: only remove things that are sound, not language.

> It also doesn’t touch repeated words, false starts, or long thinking pauses. Those aren’t noise on top of the speech; they are the speech, just messier than the speaker would like. Cleaning them up is an editorial decision about which take to keep, and erm doesn’t have an opinion about that.

Think about it. Cleaning these things-that-can-be-just-sounds-but-can-also-very-much-be-load-bearing up is an editorial decision. At the very least, you need to judge based on the surrounding content whether the removal of an um would change the meaning at all; and I don’t think text alone is adequate for that.

show comments

rbbydotdev

I wonder if with enough input data and transcription you could “fingerprint” where a speaker personality has habits of interjecting “ums” leading to more hardy analysis. Novel approach, but gets me thinking

rindalir

This is fascinating! I'm going to try this on a certain clip from Jurassic Park.

boodleboodle

This resonates with our crusade to eradicate Ums once and for all.

- Ums Considered Harmful: https://hamanlp.org/research/ums/

- Related paper: https://hamanlp.org/SIGBOVIK_2026.pdf

ralferoo

The title of the article is wrong. It's not that removing 'um' from a recording is hard, it's that not removing everything else in the recording while doing so is.

show comments

alok-g

I would love to see support for videos and removal of custom filler words (I say 'basically' and 'like' a lot and have so far failed to improve myself on this).

show comments

lavaman131

This is great, I've tried out automated podcast editing tools before and they cut too aggressively in my experience. What are you thinking about doing next with this now that you've gotten the alignment snapping working cleanly for 'um' and 'ah', are you thinking of expanding the tool?

cadamsdotcom

What an awesome tool and idea. I’d be keen to see if it can integrate with video editing tools.

Ideally it would slice the video in the timeline without actually removing anything, so you can scrub through your video and try with and without each disfluency (thank you - awesome word) & decide case by case which to keep!

BugsJustFindMe

I find the crusade against 'um' to be annoyingly misplaced. It frustrates the shit out of me that iOS speech-to-text dictation refuses to write my 'um's and 'uh's with no way to change that behavior. If a person asks to remove them, fine, but don't fucking alter my speech patterns when I'm sending messages to people.

AaronAPU

I accidentally learned how disgusting people’s mouth noises are while developing an audio leveler. The lip smacking and snot noises between sentences are the stuff of nightmares if you don’t do anything to exclude them from amplification.

The best approach I could come up with was to maintain a sliding histogram of loudness and exclude the low-level outliers.

You can do more in the noise/frequency domain but those were outside the scope of this tool.

show comments

__mharrison__

Interesting. I make a bunch of video content and I went another way.

When I want to redo a section, I say it again. But, I have a magic word — "mistake" — that I insert before. Previously I transcribed and just removed the sentence (or section) before mistake.

I recently automated this and used AI to determine what to cut and to drive davinci resolve to make the edit. Saves a lot of time in my workflow.

HeavyStorm

What a very cool utility.

josefritzishere

I used to do this with a razor and an aluminum cutting block.

npodbielski

I think it is harder to remove those from your own speech. I have been doing that for few months now and I still get back at it when I am in hurry or stressed.

show comments

sciencesama

there is a aah counter in toast master !! this is the software that helps !!

cryptoz

Really cool stuff and definitely going to try it; I’m also finding it wild that Google put effort into adding ums and erms into their text to speech model a while back. AI puts it in, AI helps take it out.

cyberax

BTW, any recommendations for AI tools that remove the laugh track? I don't even mind the awkward acting without the missing laughter.

fragmede

...

No, you run an entire second pass LLM over the output of Whisper. "no uhhh three no four." should just output four the numeral not even f.o.u.r.

Hi, my name is fragmede. Judging by the date on my computer it's been four months since it's since I've t touched the transcription directory on computer and tried to improve on the state of wisprflow. Mines pretty good but it just doesn't... ah you can't drag me back in.

sublinear

Disfluencies are not necessarily "filler". They can convey mood or hesitation. Cutting them can change the meaning.

A trivial example is "umm... well... (sigh) okay" versus just "okay". Not okay!

slhck

> Two small fixes, in order. First, each cut endpoint is allowed to slide a tiny bit (up to 60ms) to land in the quietest spot nearby. If there’s a momentary lull in the audio just before or after the original cut point, slide there. The slide is bounded so it can’t cross into a neighboring word, otherwise you’d chew off real speech. Second, from that quiet spot, the endpoint snaps to the nearest moment when the waveform is exactly crossing zero.

Oh, Claudish striking again.

show comments

dougcalobrisi

This post is mostly about how surprisingly hard it is to cut filler words out of speech cleanly. Apparently, stripping ums isn't a find and replace type thing, because Whisper's timestamps are off by up to a few hundred ms and cutting on them chops syllables or leaves stutters. So, I built a tool, erm, that starts from Whisper's guess, finds where each word actually starts and stops in the audio, and snaps the cuts to silence so there's no click, with ffmpeg doing the splicing.

https://github.com/dougcalobrisi/erm

monster_truck

It takes about 30 seconds in Audacity and will give an infinitely better result. Also works on any other sound

show comments