How to stop Linux threads cleanly

214 points73 comments6 days ago
leeter

I'm reminded of Raymond Chen's many many blogs[1][2][3](there are a lot more) on why TerminateThread is a bad idea. Not surprised at all the same is true elsewhere. I will say in my own code this is why I tend to prefer cancellable system calls that are alertable. That way the thread can wake up, check if it needs to die and then GTFO.

[1] https://devblogs.microsoft.com/oldnewthing/20150814-00/?p=91...

[2] https://devblogs.microsoft.com/oldnewthing/20191101-00/?p=10...

[3] https://devblogs.microsoft.com/oldnewthing/20140808-00/?p=29...

there are a lot more, I'm not linking them all here.

show comments
kazinator

> Well, since thread cancellation is implemented using exceptions, and thread cancellation can happen in arbitrary places

No, thread cancelation cannot happen in arbitrary places. Or doesn't have to.

There are two kinds of cancelation: asynchronous and deferred.

POSIX provides an API to configure this for a thread, dynamically: pthread_setcanceltype.

Furthermore, cancelation can be enabled and disabled also.

  int pthread_setcancelstate(int state, int *oldstate); // PTHREAD_CANCEL_ENABLE, PTHREAD_CANCEL_DISABLE
  int pthread_setcanceltype(int type, int *oldtype);    // PTHREAD_CANCEL_DEFERRED, PTHREAD_CANCEL_ASYNCHRONOUS
Needless to say, a thread would only turn on asynchronous cancelation over some code where it is safe to do so, where it won't be caught in the middle of allocating resources, or manipulating data structures that will be in a bad state, and such.
show comments
scottlamb

This article does a nice job of explaining why pthread cancellation is hopeless.

> If we could know that no signal handler is ran between the flag check and the syscall, then we’d be safe.

If you're willing to write assembly, you can accomplish this without rseq. I got it working many years ago on a bunch of platforms. [1] It's similar to what they did in this article: define a "critical region" between the initial flag check and the actual syscall. If the signal happens here, ensure the instruction pointer gets adjusted in such a way that the syscall is bypassed and EINTR returned immediately. But it doesn't need any special kernel support that's Linux-only and didn't exist at the time, just async signal handlers.

(rseq is a very cool facility, btw, just not necessary for this.)

[1] Here's the Linux/x86_64 syscall wrapper: https://github.com/scottlamb/sigsafe/blob/master/src/x86_64-... and the signal handler: https://github.com/scottlamb/sigsafe/blob/master/src/x86_64-...

the_duke

For interrupting long-running syscalls there is another solution:

Install an empty SIGINT signal handler (without SA_RESTART), then run the loop.

When the thread should stop:

* Set stop flag

* Send a SIGINT to the thread, using pthread_kill or tgkill

* Syscalls will fail with EINTR

* check for EINTR & stop flag , then we know we have to clean up and stop

Of course a lot of code will just retry on EINTR, so that requires having control over all the code that does syscalls, which isn't really feasible when using any libraries.

EDIT: The post describes exactly this method, and what the problem with it is, I just missed it.

show comments
loeg

If you can swing it (don't need to block on IO indefinitely), I'd suggest just the simple coordination model.

  * Some atomic bool controls if the thread should stop or not;
  * The thread doesn't make any unbounded wait syscalls;
  * And the thread uses pthread_cond_wait (or equivalent C++ std wrappers) in place of sleeping while idle.
To kill the thread, set the stop flag and cond_signal the condvar. (Under the hood on Linux, this uses futex.)
show comments
Ericson2314

Hi Francesco :)

I think a good shorthand for this stuff is

> One can either preemptively or cooperatively schedule threads, and one can also either preemptively or cooperatively cancel processes, but one can only cooperatively cancel threads.

ethin

This was a fun read, I didn't know about rseq until today! And before this I reasonably assumed that the naive busy-wait thing would typically be what you'd do in a thread in most circumstances. Or that at least most threads do loop in that manner. I knew that signals and such were a problem but I didn't think just wanting to stop a thread would be so hard! :)

Hopefully this improves eventually? Who knows?

show comments
ibejoeb

libcurl dealt with this a few months ago, and the sentiment is about the same: thread cancellation in glibc is hairy. The short summary (which I think is accurate) is that an hostname query via libnss ultimately had to read a config file, and glibc's `open` is a thead cancellation point, so if it's canceled, it'll won't free memory that was allocated before the `open`.

The write-up is on how they're dealing with it starts at https://eissing.org/icing/posts/pthread_cancel/.

show comments
Asmod4n

When you are on Linux the easiest way is to use signalfd. No unsafe async signal handling, just handling signals by reading from a fd.

quietbritishjim

This is just doubling down on the wrong approach.

The right approach is to avoid simple syscalls like sleep() or recv(), and instead call use multiplexing calls like epoll() or io_uring(). These natively support being interrupted by some other thread because you can pass, at minimum, two things for them to wait for: the thing you're actually interested in, and some token that can be signalled from another thread. For example, you could start a unix socket pair which you do a read wait on alongside the real work, then write to it from another thread to signal cancellation. Of course, by the time you're doing that you really could multiplex useful IO too.

You also need to manually check this mechanism from time to time even if you're doing CPU bound work.

If you're using an async framework like asyncio/Trio in Python or ASIO in C++, you can request a callback to be run from another other thread (this is the real foothold because it's effectively interrupting a long sleep/recv/whatever to do other work in the thread) at which point you can call cancellation on whatever IO is still outstanding (e.g. call task.cancel() in asyncio). Then you're effectively allowing this cancellation to happen at every await point.

(In C# you can pass around a CancellationToken, which you can cancel directly from another thread to save that extra bit of indirection.)

show comments
teddyh

Off-Topic: I surprised myself by liking the web site design. Especially the font.

show comments
vlovich123

This seems like a lot of work to do when you have signalfd, no? That + async and non blocking I/O should create the basis of a simple thread cancellation mechanism that exits pretty immediately, no?

show comments
HarHarVeryFunny

If you just want to stop and/or kill all child threads, you can read the list of thread IDs from /proc/pid/task, and send a signal to them with tgkill().

show comments
kscarlet

Ah, the eternal problem of asynch unwind!

  (without-interrupts 
    (acquire-resource)
    (unwind-protect
        (with-local-interrupts
          (do-jobs-might-block-or-whatever))
      (release-resource)))
... and to cancel:

  (interrupt-thread thread (lambda () (abort-thread)))
I think what is really needed is just exception (unwind cleanup) mechanism and a cheap way to mask interrupts. Signal deferral mechanism does exactly that -- so that with(out)-interrupts just simply set a variable and don't need to go through syscall.
BobbyTables2

One does not simply stop a thread…

EbEsacAig

I claim that this is a solved problem, without rseq.

1. Any given thread in an application waits for "events of interest", then performs computations based on those events (= keeps the CPU busy for a while), then goes back to waiting for more events.

2. There are generally two kinds of events: one kind that you can wait for, possibly indefinitely, with ppoll/pselect (those cover signals, file descriptors, and timing), and another kind you can wait for, possibly indefinitely, with pthread_cond_wait (or even pthread_cond_timedwait). pthread_cond_wait cannot be interrupted by signals (by design), and that's a good thing. The first kind is generally used for interacting with the environment through non-blocking syscalls (you can even notice SIGCHLD when a child process exits, and reap it with a WNOHANG waitpid()), while the second kind is used for distributing computation between cores.

3. The two kinds of waits are generally not employed together in any given thread, because while you're blocked on one kind, you cannot wait for the other kind (e.g., while you're blocked in ppoll(), you can't be blocked in pthread_cond_wait()). Put differently, you design your application in the first place such that threads wait like this.

4. The fact that pthread_mutex_lock in particular is not interruptible by signals (by design!) is no problem, because no thread should block on any mutex indefinitely (or more strongly: mutex contention should be low).

5. In a thread that waits for events via ppoll/pselect, use a signal to indicate a need to stop. If the CPU processing done in this kind of thread may take long, break it up into chunks, and check sigpending() every once in a while, during the CPU-intensive computation (or even unblock the signal for the thread every once in a while, to let the signal be delivered -- you can act on that too).

6. In a thread that waits for events via pthread_cond_wait, relax the logical condition "C" that is associated with the condvar to ((C) || stop), where "stop" is a new variable protected by the mutex that is associated with the condvar. If the CPU processing done in this kind of thread may take long, then break it up into chunks, and check "stop" (bracketed by acquiring and releasing the mutex) every once in a while.

7. For interrupting the ppoll/pselect type of thread, send it a signal with pthread_kill (EDIT: or send it a single byte via a pipe that the thread monitors just for this purpose; but then the periodic checking in that thread has to use a nonblocking read or a distinct ppoll, for that pipe). For interrupting the other type of thread, grab the mutex, set "stop", call pthread_cond_signal or pthread_cond_broadcast, then release the mutex.

8. (edited to add:) with both kinds, you can hierarchically reap the stopped threads with pthread_join.

a-dub

this stuff always seemed a mess. in practice i've always just used async io (non-blocking) and condition variables with shutdown flags.

trying to preemptively terminate a thread in a reliable fashion under linux always seemed like a fool's errand.

fwiw. it's not all that important, they get cleaned up at exit anyway. (and one should not be relying on operating system thread termination facilities for this sort of thing.)

quotemstr

pthread cancelation ends up not being the greatest, but it's important to represent it accurately. It has two modes: asynchronous and deferred. In asynchronous mode, a thread can be canceled any time, even in the middle of a critical section with a lock held. However, in deferred mode, a thread's cancelation can be delayed to the next cancelation point (a subset of POSIX function calls basically) and so it's possible to make that do-stuff-under-lock flow safe with cancelation after all.

That's not to say people do or that it's a good idea to try.

show comments
hulitu

> How to stop Linux threads cleanly

kill -HUP ?

harvie

while (true) { if (stop) { break; } }

If there only was a way to stop while loop without having to use extra conditional with break...

show comments
f1shy

Should this code:

  while (true) {
    if (stop) { break; }
    // Perform some work completing in a reasonable time
  }

Be just:

  While(!stop){
    Do-the-thing;
  }
Anyway, the last part:

>> It’s quite frustrating that there’s no agreed upon way to interrupt and stack unwind a Linux thread and to protect critical sections from such unwinding. There are no technical obstacles to such facilities existing, but clean teardown is often a neglected part of software.

I think it is a “design feature”. In C everything is low level, so I have no expectation of a high level feature like “stop this thread and cleanup the mess” IMHO asking that is similar to asking for GC in C.

show comments
dathinab

If your threads run "cooperative multi threading" task (e.g. rust tokio runtime, JS in general etc.) then this kinda is a non problem.

Due to task frequently returning to the scheduler the scheduler can do "should stop" check there (also as it might be possible to squeeze it into other atomic state bit maps it might have 0 relevant performance overhead (a single is-bit-set check)). And then properly shut down tasks. Now "properly shut down tasks" isn't as trivial, like the "cleaning up local resources" part normally is, but for graceful shutdown you normally also want to allow cleaning up remote resources, e.g. transaction state. But this comes from the difference of "somewhat forced shutdown" and "grace full shutdown". And in very many cases you want "grace full shutdown" and only if it doesn't work force it. Another reason not to use "naive" forced only shutdown...

Interpreter languages can do something similar in a very transparent manner (if they want to). But run into similar issues wrt. locking and forced unwinding/panics from arbitrary places as C.

Sure a very broken task might block long term. But in that case you often are better of to kill it as part of process termination instead and if that doesn't seem an option for "resilience" reasons than you are already in better use "multiple processes for resilience" (potentially across different servers) territory IMHO.

So as much as forced thread termination looks tempting I found that any time I thought I needed it it was because I did something very wrong else where.

show comments