Library Overview

If you have some atproto repository data, and you want to operate on it with Python, you’ve come to the right place [1]. The APIs offered here are rather low-level, but I’m planning on adding higher-level helper utilities in the future.

Block Storage

The foundations of repos are content-addressed Blocks of data, as in the IPLD data model. The abstract BlockStore() interface facilitates access to blocks, agnostic of the underlying storage medium. The following implementations are available:

MemoryBlockStore() - stores blocks in memory only (inside a dict)
ReadOnlyCARBlockStore() - accesses the contents of a CAR file.
SqliteBlockStore() - accesses blocks stored in a table of an sqlite database.

Finally, the OverlayBlockStore() class allows you to layer one BlockStore over another, with writes going to the top layer only. This is useful in several scenarios, for example, reading blocks from two CAR files at once so that you can diff them, or for staging modifications in memory ready to be committed to persistent storage.

Merkle Search Trees

With a BlockStore, we can read and write content-addressed blocks of data. Content-addressing is cool, but sometimes you want mutability. The Merkle Search Tree data structure builds on top of content-addressed Block storage, providing a mutable map of keys onto values. In atproto, the keys are arbitrary strings (under certain constraints), and the values are “records”.

Everything is still immutable under the hood, so modifying an MST results in a new root hash.

atmst doesn’t have a dedicated class to represent an MST (yet?), instead we just reference the root node by CID.

Nodes

An MST is comprised of one or more Nodes. atmst represents Nodes using MSTNode(), an immutable dataclass.

Nodes are stored in a BlockStore, serialised as DAG-CBOR, and the NodeStore() class handles this transparently. A NodeStore internally maintains an LRU cache, mapping CIDs to MSTNode objects, to reduce the impacts of BlockStore read latency, hash verification, and deserialisation overheads.

The NodeWrangler() class facilitates modifications to MSTs, via the put_record() and delete_record() methods. These methods each return a CID reference to the new MST root, with any newly created Nodes tracked internally using a NodeStore().

For reading MSTs, the NodeWalker() class acts as a “cursor” for walking the tree from a given starting point (usually the root), including convenience methods for accessing records by key.

The mst_diff() method makes use of NodeWalker() internally.

Recipes

For some examples of how all these components fit together, check out the source of cartool.py.

TODO: Improve this part of the docs!