1-bit remembered sets

1. Wait, what?
2. I want my algorithm already
3. Speed
4. More algorithms and considerations

One day I was cleaning up my old documentation, when I realized I could compute a live set in 280ns

You can check the source code here

1. Wait, what?

I think we need to get some context:

A year ago, I wrote a proposed specification for a new garbage collector for Mezzano. Current version is still sitting at random-stuff/newgc.txt (stay tuned for a new version).

One of the decisions there is to make mature GC and compaction faster, by hierachally partitioning my heap and storing remembered sets inside a write barrier. This saves scanning the actual heap, greatly reducing latency, which is important when your memory could be outside of DRAM.

This technique was compared to the train algorithm, but my method differs a bit.

The optimization was simple: instead of storing a few 8 or 16-bit indeces and having to deal with edgecases, I instead store a bitmap, which tells which other blocks does the block reference, inside a bigger block.

A year later, I was re-reading the spec, made a few grammar corrections, and that's when it hit me…

2. I want my algorithm already

The idea is that, if you constain your heap into one metablock, you can use bit logic to compute a low-resolution live set.

I will be demonstrating the algorithm in an expressive, universal language, Lisp.

;; This is our bitmaps
(defvar *graph* (make-array 256 :initial-element
                            (make-array 256 :element-type 'bit)))
#| Fill it with some data here
...
|#
(defun compute-liveness ()
  (let ((active (make-array 256 :element-type 'bit :initial-element 0))
        (mask (make-array 256 :element-type 'bit :initial-element 1)))
    (dotimes (x 5)
      (dotimes (c 256)
        (unless (zerop (aref mask c))
          (bit-ior active (aref *graph* c) active)))
      (bit-and mask active mask))
    active))

The algorithm is embarrassingly simple. You can also do this horizontally instead of vertically:

(defun compute-liveness-horizontally ()
  (let ((active (make-array 256 :element-type 'bit :initial-element 0))
        (mask (make-array 256 :element-type 'bit :initial-element 1))
        (zero (make-array 256 :element-type 'bit :initial-element 0)))
    (dotimes (x 5)
      (dotimes (c 256)
        (unless (zerop (aref mask c))
          (setf (aref active c)
                ;; ANSI doesn't have a function to test if a bitvector is zero...
                (if (equal zero (aref *graph* c))
                0
                1))))
      (bit-and mask active mask))
    active))

3. Speed

I made AVX2 versions of both algorithms, as well as optimized HIP kernels that compute the latency for one core/workgroup. (timings assume a 256x256 matrix)

On my desktop, using an R5 3600 CPU, the AVX2 vertical algorithm runs 5 passes in 380ns on average, while the horizontal algorithm does in ~1us. The latter has a scalar bottleneck from VPTEST writing to a flag register. AVX512 wouldn't have the problem.

Since the algorithm is GPU-friendly, I wrote a kernel for it. Interestingly, the vertical algorithm is way slower, even when writing it in pure RDNA2 assembly; however, the horizontal algorithm, using thread ballots, is way faster, matching CPU.

The timings come to 1600ns for vertical, 400ns for horizontal on my overclocked RX6800. Remember that this measures latency, not throughput, and runs on a single WGP.

4. More algorithms and considerations

If you want to compact the heap, making an offset table from the live set takes just 20ns. You just slide the bits and compute their distance from the original location.

We've been using a points-to matrix, however, pointed-by could work too, and you can transform between the two pretty quickly. The vertical layout is actually slower for this, though. pointed-by matrices are nice for reference counting using popcount's.

Diagonals (blocks pointing to themselves) should be ignored, setting their bits to zero. This is essentially adding another pass to the algorithm (or a quarter, since you have some sparsity)

The more roots you have, the less efficient the algorithm. Handle DLG invariants with care.

The algorithm gets very conservative once you start mutating values. A bit can be set but it's hard to unset it. You can only do that by scanning the block immediately or deferring it. To make it faster, don't check whether a pointer is part of an unboxed array.

Given that, this algorithm is more beneficial to functional languages.

This is essentially doing the normal marking phase but inverse. A normal GC would walk the heap until it knows all the live objects. This GC instead peels the live set. This algorithm can be done the traditional way, but it'd require up to num-blocks - roots-size steps to guarantee that no live object is swept. By contrast, peeling a heap would never mark a live object as dead, iterations only trade speed vs space savings.

1-bit remembered sets#3

Table of Contents

1. Wait, what?

2. I want my algorithm already

3. Speed

4. More algorithms and considerations

Comments (0)