I'm reminded of Raymond Chen's many many blogs[1][2][3](there are a lot more) on why TerminateThread is a bad idea. Not surprised at all the same is true elsewhere. I will say in my own code this is why I tend to prefer cancellable system calls that are alertable. That way the thread can wake up, check if it needs to die and then GTFO.
For interrupting long-running syscalls there is another solution:
Install an empty SIGINT signal handler (without SA_RESTART), then run the loop.
When the thread should stop:
* Set stop flag
* Send a SIGINT to the thread, using pthread_kill or tgkill
* Syscalls will fail with EINTR
* check for EINTR & stop flag , then we know we have to clean up and stop
Of course a lot of code will just retry on EINTR, so that requires having control over all the code that does syscalls, which isn't really feasible when using any libraries.
EDIT: The post describes exactly this method, and what the problem with it is, I just missed it.
If you can swing it (don't need to block on IO indefinitely), I'd suggest just the simple coordination model.
* Some atomic bool controls if the thread should stop or not;
* The thread doesn't make any unbounded wait syscalls;
* And the thread uses pthread_cond_wait (or equivalent C++ std wrappers) in place of sleeping while idle.
To kill the thread, set the stop flag and cond_signal the condvar. (Under the hood on Linux, this uses futex.)
Relying heavily on a check for an atomic bool is prone to race conditions. I think it's cleaner to structure the event loop as a message queue and have a queued message that indicates it's time to stop.
The tricky part is really point 2 there, that can be harder than it looks (e.g. even simple file I/O can be network drives). Async IO can really shine here, though it’s not exactly trivial designing async cancelletion either.
This seems like a lot of work to do when you have signalfd, no? That + async and non blocking I/O should create the basis of a simple thread cancellation mechanism that exits pretty immediately, no?
libcurl dealt with this a few months ago, and the sentiment is about the same: thread cancellation in glibc is hairy. The short summary (which I think is accurate) is that an hostname query via libnss ultimately had to read a config file, and glibc's `open` is a thead cancellation point, so if it's canceled, it'll won't free memory that was allocated before the `open`.
This was a fun read, I didn't know about rseq until today! And before this I reasonably assumed that the naive busy-wait thing would typically be what you'd do in a thread in most circumstances. Or that at least most threads do loop in that manner. I knew that signals and such were a problem but I didn't think just wanting to stop a thread would be so hard! :)
this stuff always seemed a mess. in practice i've always just used async io (non-blocking) and condition variables with shutdown flags.
trying to preemptively terminate a thread in a reliable fashion under linux always seemed like a fool's errand.
fwiw. it's not all that important, they get cleaned up at exit anyway. (and one should not be relying on operating system thread termination facilities for this sort of thing.)
pthread cancelation ends up not being the greatest, but it's important to represent it accurately. It has two modes: asynchronous and deferred. In asynchronous mode, a thread can be canceled any time, even in the middle of a critical section with a lock held. However, in deferred mode, a thread's cancelation can be delayed to the next cancelation point (a subset of POSIX function calls basically) and so it's possible to make that do-stuff-under-lock flow safe with cancelation after all.
That's not to say people do or that it's a good idea to try.
Cancellation points and cancellability state are discussed in the post. In a C codebase that you fully control pthread cancellation _can_ be made to work, but if you control the whole codebase I'd argue you're better off just structuring your program so that you yield cooperatively frequently enough to ensure prompt termination.
The while loop surrounds the whole thread, which does multiple tasks. The conditional is there to surround some work completing in a reasonable time. That's how I understood, at least.
The right approach is to avoid simple syscalls like sleep() or recv(), and instead call use multiplexing calls like epoll() or io_uring(). These natively support being interrupted by some other thread because you can pass, at minimum, two things for them to wait for: the thing you're actually interested in, and some token that can be signalled from another thread. For example, you could start a unix socket pair which you do a read wait on alongside the real work, then write to it from another thread to signal cancellation. Of course, by the time you're doing that you really could multiplex useful IO too.
You also need to manually check this mechanism from time to time even if you're doing CPU bound work.
If you're using an async framework like asyncio/Trio in Python or ASIO in C++, you can request a callback to be run from another other thread (this is the real foothold because it's effectively interrupting a long sleep/recv/whatever to do other work in the thread) at which point you can call cancellation on whatever IO is still outstanding (e.g. call task.cancel() in asyncio). Then you're effectively allowing this cancellation to happen at every await point.
(In C# you can pass around a CancellationToken, which you can cancel directly from another thread to save that extra bit of indirection.)
while (true) {
if (stop) { break; }
// Perform some work completing in a reasonable time
}
Be just:
While(!stop){
Do-the-thing;
}
Anyway, the last part:
>> It’s quite frustrating that there’s no agreed upon way to interrupt and stack unwind a Linux thread and to protect critical sections from such unwinding. There are no technical obstacles to such facilities existing, but clean teardown is often a neglected part of software.
I think it is a “design feature”. In C everything is low level, so I have no expectation of a high level feature like “stop this thread and cleanup the mess” IMHO asking that is similar to asking for GC in C.
yes, maybe except if you don't have a single tight loop and stop checks are not just done once in the loop body but manually sprinkled through various places of your code (e.g. thing a long running compute task split into part 1,2(tight loop),3(loop),4 then you probably want a stop check between each of them and in each inner iteration of 3 but probably not in each inner iteration of 2 (as each check is an atomic load).
Maybe. But seems to me there should be better ways to organize the code. In the case you mention there will be many places where you have to cleanup (that is what the article is about) so the code will be hell to debug: multithreaded, with multiple exit points in each thread… I have done relly tons and tons of multithreading and never once needed such a conplicated thing. Typically the code which gets run in parallel is either for managing one resource type OR number crunching w/o resource allocation… if you are spawning threads that do lots of resource allocation, maybe you have architecture problems, or you are solving a very niche problem.
If your threads run "cooperative multi threading" task (e.g. rust tokio runtime, JS in general etc.) then this kinda is a non problem.
Due to task frequently returning to the scheduler the scheduler can do "should stop" check there (also as it might be possible to squeeze it into other atomic state bit maps it might have 0 relevant performance overhead (a single is-bit-set check)). And then properly shut down tasks. Now "properly shut down tasks" isn't as trivial, like the "cleaning up local resources" part normally is, but for graceful shutdown you normally also want to allow cleaning up remote resources, e.g. transaction state. But this comes from the difference of "somewhat forced shutdown" and "grace full shutdown". And in very many cases you want "grace full shutdown" and only if it doesn't work force it. Another reason not to use "naive" forced only shutdown...
Interpreter languages can do something similar in a very transparent manner (if they want to). But run into similar issues wrt. locking and forced unwinding/panics from arbitrary places as C.
Sure a very broken task might block long term. But in that case you often are better of to kill it as part of process termination instead and if that doesn't seem an option for "resilience" reasons than you are already in better use "multiple processes for resilience" (potentially across different servers) territory IMHO.
So as much as forced thread termination looks tempting I found that any time I thought I needed it it was because I did something very wrong else where.
Concepts of cooperate multi threading, co-rutines etc. aren't limited to user space.
Actually they out date the whole "async" movement or whatever you want to call it.
Also the article is about user-space threads, i.e. OS threads, not kernel-space threads (which use kthread_* not pthread_* and kthreads stopping does work by setting a flag to indicate it's supposed to stop, wakes the thread and then waits for exit. I.e. it works much more close to the `if(stop) exit` example then any signal usage.
I'm reminded of Raymond Chen's many many blogs[1][2][3](there are a lot more) on why TerminateThread is a bad idea. Not surprised at all the same is true elsewhere. I will say in my own code this is why I tend to prefer cancellable system calls that are alertable. That way the thread can wake up, check if it needs to die and then GTFO.
[1] https://devblogs.microsoft.com/oldnewthing/20150814-00/?p=91...
[2] https://devblogs.microsoft.com/oldnewthing/20191101-00/?p=10...
[3] https://devblogs.microsoft.com/oldnewthing/20140808-00/?p=29...
there are a lot more, I'm not linking them all here.
For interrupting long-running syscalls there is another solution:
Install an empty SIGINT signal handler (without SA_RESTART), then run the loop.
When the thread should stop:
* Set stop flag
* Send a SIGINT to the thread, using pthread_kill or tgkill
* Syscalls will fail with EINTR
* check for EINTR & stop flag , then we know we have to clean up and stop
Of course a lot of code will just retry on EINTR, so that requires having control over all the code that does syscalls, which isn't really feasible when using any libraries.
EDIT: The post describes exactly this method, and what the problem with it is, I just missed it.
This option is described in detail in the blog posts, with its associated problems, see this section: https://mazzo.li/posts/stopping-linux-threads.html#homegrown... .
Ah, fair, I missed it when reading the post because the approach seemed more complicated.
If you can swing it (don't need to block on IO indefinitely), I'd suggest just the simple coordination model.
To kill the thread, set the stop flag and cond_signal the condvar. (Under the hood on Linux, this uses futex.)Relying heavily on a check for an atomic bool is prone to race conditions. I think it's cleaner to structure the event loop as a message queue and have a queued message that indicates it's time to stop.
Every event loop is subject to the blocked-due-to-long-running-computation issue. It bites ...
The tricky part is really point 2 there, that can be harder than it looks (e.g. even simple file I/O can be network drives). Async IO can really shine here, though it’s not exactly trivial designing async cancelletion either.
This seems like a lot of work to do when you have signalfd, no? That + async and non blocking I/O should create the basis of a simple thread cancellation mechanism that exits pretty immediately, no?
libcurl dealt with this a few months ago, and the sentiment is about the same: thread cancellation in glibc is hairy. The short summary (which I think is accurate) is that an hostname query via libnss ultimately had to read a config file, and glibc's `open` is a thead cancellation point, so if it's canceled, it'll won't free memory that was allocated before the `open`.
The write-up is on how they're dealing with it starts at https://eissing.org/icing/posts/pthread_cancel/.
Off-Topic: I surprised myself by liking the web site design. Especially the font.
Previously: https://news.ycombinator.com/item?id=38908556
And somehow just a day ago: https://news.ycombinator.com/item?id=45589156
This was a fun read, I didn't know about rseq until today! And before this I reasonably assumed that the naive busy-wait thing would typically be what you'd do in a thread in most circumstances. Or that at least most threads do loop in that manner. I knew that signals and such were a problem but I didn't think just wanting to stop a thread would be so hard! :)
Hopefully this improves eventually? Who knows?
this stuff always seemed a mess. in practice i've always just used async io (non-blocking) and condition variables with shutdown flags.
trying to preemptively terminate a thread in a reliable fashion under linux always seemed like a fool's errand.
fwiw. it's not all that important, they get cleaned up at exit anyway. (and one should not be relying on operating system thread termination facilities for this sort of thing.)
pthread cancelation ends up not being the greatest, but it's important to represent it accurately. It has two modes: asynchronous and deferred. In asynchronous mode, a thread can be canceled any time, even in the middle of a critical section with a lock held. However, in deferred mode, a thread's cancelation can be delayed to the next cancelation point (a subset of POSIX function calls basically) and so it's possible to make that do-stuff-under-lock flow safe with cancelation after all.
That's not to say people do or that it's a good idea to try.
Cancellation points and cancellability state are discussed in the post. In a C codebase that you fully control pthread cancellation _can_ be made to work, but if you control the whole codebase I'd argue you're better off just structuring your program so that you yield cooperatively frequently enough to ensure prompt termination.
> How to stop Linux threads cleanly
kill -HUP ?
while (true) { if (stop) { break; } }
If there only was a way to stop while loop without having to use extra conditional with break...
Feel free to read the article before commenting.
I’ve read it, and I found nothing to justify that piece of code. Can you please explain?
The while loop surrounds the whole thread, which does multiple tasks. The conditional is there to surround some work completing in a reasonable time. That's how I understood, at least.
Does not seem so clear to me. If so it could be stated with more pseudo code. Also the eventual need for multiple exit points…
This is just doubling down on the wrong approach.
The right approach is to avoid simple syscalls like sleep() or recv(), and instead call use multiplexing calls like epoll() or io_uring(). These natively support being interrupted by some other thread because you can pass, at minimum, two things for them to wait for: the thing you're actually interested in, and some token that can be signalled from another thread. For example, you could start a unix socket pair which you do a read wait on alongside the real work, then write to it from another thread to signal cancellation. Of course, by the time you're doing that you really could multiplex useful IO too.
You also need to manually check this mechanism from time to time even if you're doing CPU bound work.
If you're using an async framework like asyncio/Trio in Python or ASIO in C++, you can request a callback to be run from another other thread (this is the real foothold because it's effectively interrupting a long sleep/recv/whatever to do other work in the thread) at which point you can call cancellation on whatever IO is still outstanding (e.g. call task.cancel() in asyncio). Then you're effectively allowing this cancellation to happen at every await point.
(In C# you can pass around a CancellationToken, which you can cancel directly from another thread to save that extra bit of indirection.)
This is noted in the blog post, but the problem is that sometimes you don't have the freedom to do so. See this sidenote and the section next to it: https://mazzo.li/posts/stopping-linux-threads.html#fn3 .
Should this code:
Be just: Anyway, the last part:>> It’s quite frustrating that there’s no agreed upon way to interrupt and stack unwind a Linux thread and to protect critical sections from such unwinding. There are no technical obstacles to such facilities existing, but clean teardown is often a neglected part of software.
I think it is a “design feature”. In C everything is low level, so I have no expectation of a high level feature like “stop this thread and cleanup the mess” IMHO asking that is similar to asking for GC in C.
yes, maybe except if you don't have a single tight loop and stop checks are not just done once in the loop body but manually sprinkled through various places of your code (e.g. thing a long running compute task split into part 1,2(tight loop),3(loop),4 then you probably want a stop check between each of them and in each inner iteration of 3 but probably not in each inner iteration of 2 (as each check is an atomic load).
Maybe. But seems to me there should be better ways to organize the code. In the case you mention there will be many places where you have to cleanup (that is what the article is about) so the code will be hell to debug: multithreaded, with multiple exit points in each thread… I have done relly tons and tons of multithreading and never once needed such a conplicated thing. Typically the code which gets run in parallel is either for managing one resource type OR number crunching w/o resource allocation… if you are spawning threads that do lots of resource allocation, maybe you have architecture problems, or you are solving a very niche problem.
If your threads run "cooperative multi threading" task (e.g. rust tokio runtime, JS in general etc.) then this kinda is a non problem.
Due to task frequently returning to the scheduler the scheduler can do "should stop" check there (also as it might be possible to squeeze it into other atomic state bit maps it might have 0 relevant performance overhead (a single is-bit-set check)). And then properly shut down tasks. Now "properly shut down tasks" isn't as trivial, like the "cleaning up local resources" part normally is, but for graceful shutdown you normally also want to allow cleaning up remote resources, e.g. transaction state. But this comes from the difference of "somewhat forced shutdown" and "grace full shutdown". And in very many cases you want "grace full shutdown" and only if it doesn't work force it. Another reason not to use "naive" forced only shutdown...
Interpreter languages can do something similar in a very transparent manner (if they want to). But run into similar issues wrt. locking and forced unwinding/panics from arbitrary places as C.
Sure a very broken task might block long term. But in that case you often are better of to kill it as part of process termination instead and if that doesn't seem an option for "resilience" reasons than you are already in better use "multiple processes for resilience" (potentially across different servers) territory IMHO.
So as much as forced thread termination looks tempting I found that any time I thought I needed it it was because I did something very wrong else where.
user-space threads have entirely different semantics from kernel threads. both have their uses, but should generally not be conflated.
Concepts of cooperate multi threading, co-rutines etc. aren't limited to user space.
Actually they out date the whole "async" movement or whatever you want to call it.
Also the article is about user-space threads, i.e. OS threads, not kernel-space threads (which use kthread_* not pthread_* and kthreads stopping does work by setting a flag to indicate it's supposed to stop, wakes the thread and then waits for exit. I.e. it works much more close to the `if(stop) exit` example then any signal usage.