Baby's Second Garbage Collector

(matheusmoreira.com)

57 points | by matheusmoreira 4 days ago ago

17 comments

  • Morpheus_Matrix 4 hours ago

    The performance question is the interesting one to me. The main thing that makes conservative GCs hard to benchmark is false positives, where non-pointer values happen to fall in valid heap ranges and prevent collection. On 32-bit this was a genuine problem but on 64-bit with large virtual address spaces, the chance of an arbitrary integer being a valid heap pointer drops alot. Especially if your allocator isnt using the low addresses. So the false retention problem is probably less bad than you'd expect.

    For profiling before you get the compiler instrumentation working, `perf record` + `perf report` will get you pretty far on Linux. Wont give you per-allocation-site data but its more than enough to see where time is going inside the collector itself. Valgrind's massif tool is also useful if you want heap snapshot data rather than CPU time.

    Worth looking at MPS (Memory Pool System from Ravenbrook) if you havent already. They deal with the same ambiguous reference problem and their approach is basically what you described, conservatively referenced objects get pinned and dont move during compaction. They have pretty detailed writing on the trade-offs between conservative scanning and precise enumeration that might be useful for your next article.

    One thing id be curious about is how your stack depth reduction after removing the recursive evaluator affects pause time. Conservative GC pause time is often dominated by how much stack there is to scan, so getting rid of recursive eval might have already improved your worst-case pauses more than you realize.

    • matheusmoreira 4 hours ago

      Thanks for the tips! I'll take a look at perf on Linux. If it supports freestanding programs I'll definitely start using it.

      > On 32-bit this was a genuine problem but on 64-bit with large virtual address spaces, the chance of an arbitrary integer being a valid heap pointer drops alot. Especially if your allocator isnt using the low addresses. So the false retention problem is probably less bad than you'd expect.

      False retention could be a problem in spite of that. Lone has moved to an index-based tagged value representation. Heap values are now indexes into a large contiguous array of objects. Pointers are calculated on the fly from the heap base pointer.

      I did this in order to support easy zero copy heap reallocation with mremap. Uncoupling the values from their addresses also paved the way for heap compaction. It did turn the pointers into small integers though which are probably extremely common everywhere. I could devise a scheme to XOR some constant into the value in order to randomize it a bit and just undo the operation before dereferencing the actual pointers. Not sure if it's worth the trouble though. I haven't measured anything so far.

      > One thing id be curious about is how your stack depth reduction after removing the recursive evaluator affects pause time. Conservative GC pause time is often dominated by how much stack there is to scan, so getting rid of recursive eval might have already improved your worst-case pauses more than you realize.

      Indeed! I expect the C stack to remain shallow throughout the entire program. It was one of the major wins of the new evaluator design.

      I just used a debugger to trace garbage collection during lone's execution of my recursive (fibonacci 10) test program. The garbage collector fired four times. The conservative stack scanner worked up from the bottom of the stack to the top. The difference between top and bottom was always 1008 in all four cases. Accesses are 8 byte aligned so that means 126 iterations. It's scanning around 16 cache lines. Added a counter for conservatively discovered values and the results were: 40, 38, 37, 33. So around 26~31% of values scanned were hits. Not too bad I guess?

  • frutiger 11 hours ago

    I wanted to read this but I couldn’t because of all the allusions in the article that distracted me from the points the author was trying to make.

    • allthetime 11 hours ago

      On mobile all the text is about 3 times bigger than it needs to be as well making for an obnoxious amount of scrolling. Unreadable code examples.

      • matheusmoreira 10 hours ago

        Thanks for the feedback. That's due to a viewport configuration meta tag which I added recently in an attempt to make it more responsive in portrait mode. I just reverted it. Should work just like on desktop now.

    • starkeeper 8 hours ago

      Yeah it totally grossed me out too, it's quite twisted. Author has serious issues.

      • ai_critic 6 hours ago

        If the prose is too much for you, any number of AI services will pre-chew it and give it to you devoid of voice, character, or offense.

  • mananaysiempre 10 hours ago

    TL;DR: Conservative collector. Not where I would have taken things, but valid.

    Re forcing a register spill: assuming the GC is invoked via an ABI-compliant function call, you don’t actually need to save all the scalar registers manually, only the callee-save ones (as setjmp does). Alternatively, you can make the compiler do the spilling by inserting a function preserving no registers into the call chain—this is spelled __attribute__((preserve_none)) in sufficiently new Clang or GCC, but you can also accomplish this kind of thing on Watcom with #pragma aux, for example.

    Re obtaining the top and bottom of the stack: in addition to __builtin_frame_address, there’s also the age-old option of declaring a local and looking at its address, as long as strict aliasing doesn’t bite you. And if you know you’re running on Linux or something sufficiently compatible (as seen from the reference to auxv), you can treat argv as the bottom of the stack for the main thread, as the startup stack on Linux is (IIRC) argv, then the argument strings, then the environment, then the auxiliary vector.

  • PaulHoule 6 hours ago

    'Way back when the universe was formed, the first stack frames came into existence. Nobody truly knows who mapped them there, though among the well-learned, whispers of a great kernel are heard, a certain "Linux".'

    That was 1970-01-01T00:00Z right?

    • matheusmoreira 5 hours ago

      Unknown to me. My study of kernology is limited to gazing with the naked eye into the higher addresses of 64-bit address space. I have looked there for this great kernel, but have found nothing but the fog of memory protection. I was unable to determine what lurked there and ascribe a date to its conception...

      When asked such a question, the adepts of this "Linux" defend a thesis supported by archaeological expeditions which revealed the year of the oldest known mailing list scroll bearing that name: 1991. However, although it is heresy, I do know of the existence of an even older sect of powerful and wise sorcerers whose numerology includes the number you mention. They profess that this "Linux" was not the first. They speak of an age long forgotten where new kernel universes were not only routinely birthed from scratch but also diverged from one another, an age they call the Epoch. They say these galaxies exist out there to this day, unobservable to us, hidden away in corporate mainframes in lands far, far away...

  • DevelopingElk 7 hours ago

    Good? Bad? Doesn't matter as long as you had fun.

    Have you tested this GCs performance? Sometimes a baby GC can be fast enough.

    • mananaysiempre 7 hours ago

      Normally the reason to do a conservative GC is because it makes the VM’s implementation (and potentially FFI) simpler, not because it’s faster. Also for fun of course. (There are reasons a fully-conservative design might be faster, like not having to havr a ton of scanning functions if you have a ton of disparate types, but in general my perception is that production GCs are normally moving to at least some degree, which a fully conservative one can’t be.)

      • matheusmoreira 7 hours ago

        Yeah. I implemented the conservative collector because it simplified development of the language's primitive functions. Allowed me to put lisp values on the C stack without worrying about them being garbage collected.

        The current version of the collector also compacts the heap and moves values around. All conservatively discovered values get pinned.

        • mananaysiempre 3 hours ago

          One recent interesting partly-conservative, moving design I remember reading about is Whippet[1] by Andy Wingo (Guile maintainer, SpiderMonkey contributor). He’s been partly forced into it, though, by the fact that Guile’s existing external API (unlike, say, Lua’s) exposes the fact that it uses a non-moving collector (by allowing the user to keep hold of raw pointers to Scheme objects), so I’m not sure if this should serve as an inspiration or a cautionary tale.

          [1] http://wingolog.org/archives/2023/02/07/whippet-towards-a-ne... and other related posts

          • matheusmoreira 2 hours ago

            I really like his blog! I emailed him when I published my delimited continuations article because it addresses the overlapping native/lisp stacks problem he wrote about. Sadly I don't think he's seen it.

            > One way is to inform the garbage collector of the locations of all roots [...] implicitly, in the form of a side table generated by the compiler associating code locations with root locations.

            I wonder if there's a way to get GCC to do this.

            • mananaysiempre 2 hours ago

              Getting GCC to do things for you is fraught. Probably still possible; just... fraught.

              I believe Clang was intended to be able to do this[1] and I remember seeing that stuff even back when it was a particularly spunky research project. The facility doesn’t seem to have really gone anywhere even if it technically works; I wonder why.

              In general, the problem with stack maps of any kind—even the very minimal ones you need for stack unwinding—is that they’re liable to be large, slow to process, or both. Now that I’m thinking about it, I wonder if you could map the stack at runtime using those same unwinding tables. A C++ compiler does have to know where it put things so it can call their destructors, and perhaps you could make your GC root a word-sized but otherwise opaque uncopyable word-sized thingy so the compiler can’t put it in more than one place at once. (I can’t be the first to have the idea, even with how stupid it sounds and with how utterly miserable Itanium ABI unwinding is.)

              [1] https://llvm.org/docs/StackMaps.html

    • matheusmoreira 7 hours ago

      Probably bad, right? I did have fun though.

      I haven't measured the performance. I would like to. I'm especially interested in comparing it with the current version of the collector. It is now capable of heap compaction which will be the subject of a future article. I'm also interested in knowing how big a problem the conservative part of the collector is. The C stack depth has been minimized significantly since I got rid of the recursive evaluator. I don't think it's a significant problem but I could be wrong.

      I need to implement the compiler's profiling functions in lone. Documentation on the matter seems scarce. Not even sure if those instrumentation APIs are stable. I got away with not implementing the ASAN and stack protector since they added flags that make the compiler emit trapping instructions instead of function calls. Profiling is a lot more complex.