This is great work - one of the rules for code written with Basis was not to use the system clock. I somewhat got around the network determinism by shimming out all network calls, and rather than going single threaded, writing a scheduler (https://basisrobotics.tech/2024/09/02/determinism/). My goal wasn’t to get ci->dev machine determinism, at least immediately, because callback ordering determinism was so important. Love seeing other work done in this space.
I know it was mentioned at the end, I was curious what are some of the notable issues that were find using the DST approach, and how did it benefit the development of the system? I would also be curious if a LLM system would be able to help analyze the TRACE logs ?
> what are some of the notable issues that were find using the DST approach
We've discovered a few distributed deadlocks. And in general it's been incredibly helpful in exercising any parts of the system that involve caches or eventual consistency, as these can be really hard to reason about otherwise.
> if a LLM system would be able to help analyze the TRACE logs
Neat idea! For us, the logs are typically being dug into only if there is a failure condition for the test as a whole. Often times we'll inject additional logging or state monitoring to better understand what led to the failure (which is easy enough to do given the reproducibility of the failure in the sim). Trace logs are also being analyzed in the context of the "meta-test", but that's just looking for identical outputs. (More about that here: https://github.com/tokio-rs/turmoil/issues/19#issuecomment-2... )
This is great work - one of the rules for code written with Basis was not to use the system clock. I somewhat got around the network determinism by shimming out all network calls, and rather than going single threaded, writing a scheduler (https://basisrobotics.tech/2024/09/02/determinism/). My goal wasn’t to get ci->dev machine determinism, at least immediately, because callback ordering determinism was so important. Love seeing other work done in this space.
What's the advantage of integrating this at a library level instead of just compiling it and running in Shadow? https://github.com/shadow/shadow
I know it was mentioned at the end, I was curious what are some of the notable issues that were find using the DST approach, and how did it benefit the development of the system? I would also be curious if a LLM system would be able to help analyze the TRACE logs ?
(I work at S2.)
> what are some of the notable issues that were find using the DST approach
We've discovered a few distributed deadlocks. And in general it's been incredibly helpful in exercising any parts of the system that involve caches or eventual consistency, as these can be really hard to reason about otherwise.
> if a LLM system would be able to help analyze the TRACE logs
Neat idea! For us, the logs are typically being dug into only if there is a failure condition for the test as a whole. Often times we'll inject additional logging or state monitoring to better understand what led to the failure (which is easy enough to do given the reproducibility of the failure in the sim). Trace logs are also being analyzed in the context of the "meta-test", but that's just looking for identical outputs. (More about that here: https://github.com/tokio-rs/turmoil/issues/19#issuecomment-2... )
Neat! Must be very satisfying for this to be working now. I wonder if it's feasible to get it working on a multi threaded runtime