INFO: task btrfs:103945 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Until eventually
Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings
So I'm looking forward to getting an actual count of how often this happens without needing to babysit the warning suppressions and count the incidents myself.
I am curious - is this message indicative of a problem in the fs? I would have assumed anything marked "INFO" is, tautologically, not an error, but surely a filesystem shouldn't be locking up? Or is it just suggestive of high system load or poor hardware performance?
That the in-kernel code for btrfs locks up should never happen at all. There is a rumor going around that btrfs never reached maturity and suffers from design issues.
Just to double check my understanding (because being wrong on the internet is perhaps the fastest way to get people to check your work):
Is this saying that regular tasks that haven't been scheduled for two minutes and tasks that are uninterruptible (truly so, not idle or also killable despite being marked as uninterruptible) that haven't been woken up for two minutes are counted?
Not the same thing by any means - they don't indicate something is wrong with kernel or hardware.
The zombie process state is a normal transient state for all exiting processes where the only remaining function of the process is as a container for the exiting process's id and exit status; they go away once the parent process calls some flavor of the "wait" system call to collect the exit status. A pileup of zombies indicates a userspace bug: a negligent parent process that isn't collecting the exit status in a timely manner.
Additionally, there are a few more process accounting things, rusage, that zombie processes hold until reaped. See wait3(2), wait4(2) and getrusage(2).
My dmesg is already constantly full of
Until eventually So I'm looking forward to getting an actual count of how often this happens without needing to babysit the warning suppressions and count the incidents myself.You could leave this problem behind by switching to a filesystem that isn't full of deadlock bugs.
I am curious - is this message indicative of a problem in the fs? I would have assumed anything marked "INFO" is, tautologically, not an error, but surely a filesystem shouldn't be locking up? Or is it just suggestive of high system load or poor hardware performance?
That the in-kernel code for btrfs locks up should never happen at all. There is a rumor going around that btrfs never reached maturity and suffers from design issues.
What counts as a hung task? Blocking on unsatisfiable I/O for more than X seconds? Scheduler hasn’t gotten to it in X seconds?
If a server process is blocking on accept(), wouldn’t it count as hung until a remote client connects? or do only certain operations count?
torvalds/linux//kernel/hung_task.c :
static void check_hung_task(struct task_struct *t, unsigned long timeout) https://github.com/torvalds/linux/blob/9f16d5e6f220661f73b36...
static void check_hung_uninterruptible_tasks(unsigned long timeout) https://github.com/torvalds/linux/blob/9f16d5e6f220661f73b36...
Just to double check my understanding (because being wrong on the internet is perhaps the fastest way to get people to check your work):
Is this saying that regular tasks that haven't been scheduled for two minutes and tasks that are uninterruptible (truly so, not idle or also killable despite being marked as uninterruptible) that haven't been woken up for two minutes are counted?
Your and the Llama's explanations would make good comments for the source and/or the docs if true.
And there's https://en.wikipedia.org/wiki/Zombie_process too
Not the same thing by any means - they don't indicate something is wrong with kernel or hardware.
The zombie process state is a normal transient state for all exiting processes where the only remaining function of the process is as a container for the exiting process's id and exit status; they go away once the parent process calls some flavor of the "wait" system call to collect the exit status. A pileup of zombies indicates a userspace bug: a negligent parent process that isn't collecting the exit status in a timely manner.
Additionally, there are a few more process accounting things, rusage, that zombie processes hold until reaped. See wait3(2), wait4(2) and getrusage(2).