Skip to content

State tree / app state garbage collection #154

@aakoshh

Description

@aakoshh

I was thinking about implementing garbage collection for the ledger.

See the Forest and Lotus implementations.

Forest uses a "semi-space" strategy of copying the reachable blocks in the store to a new database and deleting the old one, always having the last two databases, writing into the 'current' and reading from both 'current' and 'old'. They use ParityDB with hashed keys, so they cannot enumerate keys in their original format, which would be required for filtering if they wanted to implement "mark and sweep" GC. Their strategy works because everything in their store is IPLD data, so they can analyse reachability from the tipset roots, and everything relevant gets copied.

That's not how we use RocksDB: we maintain separate column families for different types of data:

  1. Application state history is not based on CIDs, just using RocksDB as a KVStore
  2. Actor state is an IPLD Blockstore in its own namespace (column family)
  3. Bitswap Store is an IPLD Blockstore reading its own namespace as well as the actor namespace, but only writing to its own

The latter separation was introduced so that data arriving over bitswap has no chance of affecting actor calculations while reachability analysis is not completed by the FVM. This shouldn't be an issue with EVM actors, but if we had builtin actors, it could be. NB now that we have snapshots and potentially garbage collection itself can also make the actor storage itself different, so random CID lookups cannot be considered determinstic.

I was thinking about implementing a sort of mark-and-sweep with Bloom filters on the actor state:

  1. when we start collecting, we start entering new keys written to the store into a Bloom filter, which is to prevent any data created by contract executions from being deleted soon after
  2. we traverse the reachable blocks from the application history and enter their CIDs into another Bloom filter
  3. once done, we send the history Bloom filter to the store and merge it into the other one
  4. Then we iterate the keys and send it to the store for deletion iff the key is not in the Bloom filter (if it's in the Bloom filter, it's very likely that it's because it is reachable)

This way we can collect e.g. 95% of the unused keys (depends on how big the Bloom filter is) using limited memory and no blocking.

The downside is that the database has to be wrapped into something that receives read and write requests through a channel (reads should go in the same channel if we want them to be consistent with writes), and processes them in a thread. It can potentially preform deletions in batches.

RocksDB should eventually compact the data to reclaim free space and heal the fragmentation. This is a problem that the "semi-space" strategy doesn't have, but in return it doesn't require 100% extra disk space.

A different question is how to deal with the Bitswap store, where there are no obvious reachability guidelines. For example we wanted to use Bitswap for resolving cross-messages, and there is no obvious way to tie these to any root, because unlike Forest which has the tipset, in our case the only custodian for blocks and transactions is CometBFT. If we want roots for cross messages (checkpoints) we have to add them to the application state explicitly. Then we could run the same garbage collector on the bitswap store as well. For now the codepaths using Bitswap hasn't been enabled yet.

Maybe worth noting that it is also possible to drop an entire column family, to implement "semi-space" on a per-namespace level, but it's probably not a good idea because we don't know when RocksDB will reclaim the space, so it would potentially duplicate the storage requirement for an unknown amount of time, whereas the multi-database approach simply deletes the files no longer in use.

We could potentially also change the way the application state history is maintained to be an AMT, to have a single blockstore. The KVStore abstraction is different in that it offers transactionality.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions