Crazy, so if I understand correctly, something with B200s and nvlink is causing issues where after 66 days and 12 hours of uptime, nvidia-smi and other jobs start failing, timing out, then once you restart the cluster it starts working again.
They suspect jobs will work if you only use 1 B200, but one person power cycled so wasn’t able to test it. Hopefully they won’t have to wait another 66 days for further troubleshooting.
NVLink postRxDetLinkMask errors show up right before the hang. Has anyone captured a bug report or stack trace while nvidia-smi is stuck to see what it's blocking on?
a pet peeve of mine, (along with people brigading on issues/threads e.g. posting them to unrelated news sites... op....) is woefully incorrect language.
> at day 66 all our jobs started randomly failing
if there's a definable pattern, you can call it unpredictabily, but you can't call it randomly.
Crazy, so if I understand correctly, something with B200s and nvlink is causing issues where after 66 days and 12 hours of uptime, nvidia-smi and other jobs start failing, timing out, then once you restart the cluster it starts working again.
They suspect jobs will work if you only use 1 B200, but one person power cycled so wasn’t able to test it. Hopefully they won’t have to wait another 66 days for further troubleshooting.
Some 32-bit counter somewhere used when in NVLINK overflows?
Isn't 32bit counter 49 days? Assuming that one was counting milliseconds, at least.
Only remember that because that's the limit for Windows 95…
100ns intervals. My favorite part of that story is how long after Windows 95 was released before anybody discovered the bug.
I wonder if the process to debugging this is just to search for what power of 2 times a time unit equals ~66 days
I think it's an overflow of a scaled counter.
Also, who else immediately noticed the AI-generated comment?
NVLink postRxDetLinkMask errors show up right before the hang. Has anyone captured a bug report or stack trace while nvidia-smi is stuck to see what it's blocking on?
*China specific code leaked into mainline.
a pet peeve of mine, (along with people brigading on issues/threads e.g. posting them to unrelated news sites... op....) is woefully incorrect language.
> at day 66 all our jobs started randomly failing
if there's a definable pattern, you can call it unpredictabily, but you can't call it randomly.
IMHO, what they said means that on day 65 all jobs work, on day 66, jobs work or don't, seemingly at random.
But what they seem to be indicating is that all jobs fail on day 66. There's no randomness in evidence.
Unexpectedly is probably what they meant
Seems quite predictable given the others in the bug report encountering the same.
66 days 14 hours and 24 minutes (66.6 days) would have been a far more diabolical hang...