-- The Persistent, state-of-the-art collector and persistor for Mezzano/LispOS --
 --    Claimore    --
 -- Revision 1.9-1 --

Goals:
- Scalable to terabytes, if not petabytes of (virtual) heap space
- soft real-time (<7ms), matters more than throughput
- Pause is minimized, thread-local so any non-allocating thread is safe
 -- Maybe, at least optionally, not have STW at all
- Optimized for modern hardware, can sacrifice core2duo performance
- non-moving
- Uses as little (storage) bandwidth and paging as possible
- Storage overhead is critical, minimize it without sacrificing performance.
- Precise, obviously
- Using safepoints, not preemption

Important notes:
This collector is in almost completely unexplored territory. I am less than an amateur and must make up my own answer to each problem and make tons of blind asumptions.
Especially from the latter, things likely won't work the way they should the first time.
A fast, durable SSD is recommended.
This will currently assume a 16TiB limit on the heap. This is temporary and only exists because I can't solve everything yet nor have the hardware. For my consolation, this is still much larger than what ZGC will allow.

0. Terms and basics.
Nursery: The thread's own young generation heap.
Mature: An object is mature if it's not in the Nursery. Not called "old" for the frequency of GC and early promotions
Fine: An object is fine if it survived as mature and is ready to be - or is already saved. Demoted if ever written to.
Blocks: 4k slices of the heap, synonymous with pages
Metablocks: 1M slices of the heap. The nursery will want 2 of them whole
Superblocks: 256M slices of the heap. The nursery cannot cross their boundaries

1. (Virtual) memory layout
- Compared to original Mezzano, it will be far simpler
- Gone:
   Cons area of both generations
   Young generation areas
   GC mark bits
   All the unused space
   All semispaces
   Card table (see 1.3)
- Wired area is completely changed
 -- It'll only exist for bootstrapping, not for static allocations
 -- No longer tied to the max memory size, will probably be smaller
- not limited to 512G of physical memory, support at least 64TB
- Made to support 57-bit addressing
- Since DMA isn't wired, put it somewhere else
 -- Can put inside the object heap in do-not-move metablocks
New:
 - Wired area
 - Wired/normal function area
 - Stack area(s)
 - Pinned(?) (we technically don't need it since we have blocks that don't move)
 - General
 - Block information (crossing into the non-canonical region)
 - 57-bit page table
 - General extended
 - Physical memory map (crossing out of the non-canonical region)
 - 48-bit page table
 - Disk info
 - Volatile foreign memory

2. Young generation (nursery)
- Largely based on SICL's thread-local young collector
- Unlike SICL, default nursery size would be 2.25-2.75mb
 -- These are the local cache sizes of sandy bridge to broadwell, latter for LGA2011 (target platform)
 -- This benefits the ring bus on Intel CPUs
 -- Seems to be good for the newest CPUs' L2 caches also
- This means the heap will have to allocate 2 metablocks + scraps
 -- The scraps will try to be reused from previously unfilled metablocks
 -- Very likely that one metablock will be used entirely
 -- Both should be good for the cache
- CPUs will try to claim meta-metablocks for themselves to avoid synchronization
- Anything shared or pointed to by mature objects is sent to the mature heap
 -- This means no need for playing cards
- Possible optimization: nurseries will have a bitmap for which objects point inside that nursery. Simply done to reduce branches during tracing
- Addednum: the size could be dynamic, depending on active threads (i.e in single-threaded work, Zen CPUs can give a 16mb heap)

3. Mature collector
- Has hints to the Train, not really like it
- Each block (and metablock) has a bitmap of the blocks it references inside its parent
 -- 256 for blocks, 256 for metablocks and 64-128k+ for superblocks (quadratic growth pls help)
- blocks and metablocks are points-to, superblocks are pointed-by.
- Read barriers are done by page faulting, rare enough to work
- GC paging is minimised, it only fetches the superblock headers
- Mature GC is fully concurrent. It never pauses anything and relies on previously-given info
- Compaction is seldom done, while some superblocks disable it entirely

4. Persistence
- Mostly based on CLOSOS log-structured method
- Threads should hopefully be cleaned before being saved
- Like with normal Mezzano, snapshot write is done concurrently with execution, with CoW
- Will support hierarchal storage, heap size will be the size of the slowest (biggest) tier
 -- This means 1TB SSD + 2TB HDD would have a 2TB heap
 -- Hierarchal storage will use a different, more complex page replacement algorithm
 -- The lower tiers may also support compression to extend the heap even more
  --- However, (room) will be a bit more confusing
- Will also support a very simple, directly-mapped mode, and may be the only mode at first.

5. Write barrier
- As written before, writing to any fine object will immediately demote it to just mature
- And you need to evacuate any young objects that start referencing a mature one
 -- The problem with this is that you'll need to effectively do a GC cycle each time this happens
- The solution to this: another write barrier!
- A small hash-table with 16-bit double-word addressing and 2 relations per entry. At 4k entries it takes up 16k of memory and the operation is extremely cheap.
- The most important part would be accuracy. ~80% is the goal

6. Mature compaction
This thing is hard enough that it needs an entire section for it.
As said before, compaction *rarely* ever happens. Might not happen on the first month even, but it will eventually.
Many things have to be done to *reduce* the cost of this inevitable operation.
We already remember the relations between metablocks, which reduces tracing to an acceptable level
What needs to be compacted are superblocks with many partially-filled metablocks. As the nursery needs at least two full (and better contiguous) metablocks in a single superblock
The normal mature collector cannot compact but could mark the metablocks by their filled-ness with 2 bits

One approach that can speed things up is changing the pointer format: adding relative pointers or even addressing by block ID rather than pure address
That would make compaction not need copying or pointer updates, The collector may only need to change some metadata or a few outliar pages.
However this adds overhead for the entire runtime, for something that rarely if ever happens

Blocks could enjoy another bitmap, with 2 bits to tell if there's a refence from or to a different superblock. This reduces a ton of guess-work and you might deduce many dependencies with fewer reads.
Then another 2 bits per block for metablocks. That's still only a 1/8192 overhead for a big reduction in bandwidth

The block-to-block remembered bitmap has 1 bit unused for alignment reasons. It can be used for free to encode if there's an object pointed to by a lower (older) metablock, since this is an uncommon occurance.


6: virtual memory
When a page fault occurs, the system will issue a fetch.
While the request is being served, the thread would run a *neural network*, which, within the typical 20us delay, would predict the next few pages to fetch, and adds them to the queue.
Then, a global physical memory manager will use a more powerful neural net to decide evictions. (i.e more runtime, AVX allowed)

While this is technically possible and greatly improves throughput/latency, it's obviously complex and unlikely to be implemented anytime soon (barring the state of DL in Lisp). Hire me a mathematician.

Also unlike in normal Mezzano, offloading memory to disk IS cheaper than GC, you can decrease memory pressure. https://dl.acm.org/doi/10.1145/3582016.3582045


7: The actual algorithms explained
Freelists:
  I don't actually know how the freelist works yet.
  Disk doesn't need it, physical memory has been covered
  In virtual memory there is that metablock fillness bitmap

Nursery: (already explained enough in ch. 1)
  It's pretty much SICL's sliding GC, unchanged except for how it grabs memory (and also it doesn't copy to old gen)
  It always grabs 2 contiguous 1M metablocks in the same superblock and one that isn't fully filled. If the latter doesn't exist, just have a 3M heap.
  The leftover objects from "scraps" could be still considered part of the current nursery.
  There's a stronger write barrier. It tries to remember relationships between objects on the same metablock (or nursery?) by writing to a small hash table without collission detection.
  There could be another, much smaller struct for "same nursery, different metablock" relations, but can add branch overhead.
  Builds block metadata during collection
  At the end of the minor GC, it also sends the processor state and stack

Mature collector:
  Unlikely to run in the same superblock a cpu allocates in, reduces synchronization
  So in idle cases, it may not run at all even during checkpoints
  Mostly normal mark/sweep collector. Tries to find objects that died between 2 minor collections, perhaps looking at what was rewritten.
  Overall wouldn't do too much work, in most application its presence isn't critical.
  Very unlikely to fragment

Fine collector:
  An extension of the mature collector. Either seperate or part of it
  Walks the heap using the metadata vectors, which could be entirely cached in DRAM
  Superblocks are reference-counted
  Inside the superblocks does the compaction the mature collector couldn't. Fine can do moves more safely because it won't interrupt the running threads.
  Compaction gets more aggressive as storage fills up, but gets calmer when CPU is more utilized
  i.e, it will start looking into coalescing metablocks of some filledness, then pages when desperate.


New checkpointing format, based on soft scheduling (Percival)
Solving the issue of optimal checkpoint size and handling them.
The disk is still log-structured, but "segments" are no longer whole snapshots. They only hold the timestamp and modified page numbers, 511 entries just to be page-aligned (not sure a "list head" is needed)
The snapshotting policy is fully dynamic.
On each thread, there are values like "memory allocated", "GCs performed" "memory written". they are read and written non-atomically and considered against heuristic numbers. 
If positive, a timestamp will be written to where all threads should stop, copy their state and possibly (local) GC.
After that, the modified data will be concurrently copied to disk.
(If little is modified, the copy could pause instead)
There is also a daemon, which would issue snapshots if none happened in N ms
Threads can immediately continue after GC, before other threads are finished.


Databases:
Technically unrelated to this, but still want to include as it is useful to me (besides that DBs are a good benchmark)
A good use for the persisted system is making good-performing databases without the hassle of binary backing formats.
In https://rcs.uwaterloo.ca/~ali/papers/sosp21-aurora.pdf, RocksDB (a key-value store) had its persistence code rewritten for Aurora's API, cutting from >100k to just 400 LoC, while in most cases gaining performance.
The persistence remained explicit, but still shows the benefits of moving these things to the supervisor.
This could be made made implicit/automated, but the application's throughput suffered greatly under small (10-20ms) checkpoint periods. This system will address that.

What I thought of is a system for making more reliable DBs:
If you processed a transaction, sent the result but crashed at the same time, losing the data, it could be very catastrophic.
A solution I thought of: you can hold off returning until a checkpoint is complete, while you could keep processing everything else, as if it has.
This is helped by a persistent log that tracks all requests until they're complete. That would help keeping in-flight requests on failure, while also fully allowing opportunistic transaction processing above.

Result: good throughput, step latency, more bandwidth needed.

Dealing with your own network requests is a harder topic, and I don't know how a distributed database would work

Under question:
Optimizing for mobile SoCs
 - Target device: Dimensity 7300/Kirin990 i.e modern midrange, or 2019 flagship
 - Characteristics: very small caches (128k L2, 1M shared L3 + some SLC), 4 (p-) cores, very good IPC, smaller DRAM, less bandwidth (370mb/s seq write).
 - Benchmarks still rate CPU as 1.5x better than an i7-3632qm, so design may remain unchanged.