Skip to content

Improved Reprovider.Strategy for entity DAGs (HAMT/UnixFS dirs, big files) #8676

@lidel

Description

@lidel

@aschmahmann @petar I remember we discussed this a while ago, as a low-hanging fruit for bigger data providers like Pinata, but was unable to find an issue, so created this one.

Improving provider strategies was previously discussed in: #6221, #5774, ipfs-inactive/package-managers#84. In this issue I want to propose a well-scoped improvement of codec-aware strategy that could be shipped without refactoring the entire system.

TLDR

  • Add a new (opt-in) strategy: when announcing a big UnixFS directory tree, only announce root blocks of directories and files, and skip all internal file data blocks.
  • Leverage full content path for finding providers of root blocks.

Problem statement

Right now, we support three values in Reprovider.Strategy which tells reprovider what should be announced. Valid strategies are:

  • "all" - announce all stored data (this is also the implicit default)
  • "pinned" - only announce pinned data
  • "roots" - only announce directly pinned keys and root keys of recursive pins

If the repository gets too big, all and pinned are too expensive and folks are forced to use roots which is codec-agnostic and will only announce the root block of UnixFS DAG.

This means in case of big UnixFS datasets, the user has to write additional orchestration code to go the extra mile and manually pin every file withing a bigger DAG, and make sure those sub-pins are removed when the entire DAG is no longer needed.

Proposed solution: codec-aware (UnixFs) strategy

Depending on a codec, different blocks may have different importance. In case of UnixFS the important blocks are manifest (root) blocks of directories and files. Sub-blocks of individual files with the data itself are not as critical as those manifest blocks. It is CID of manifest block that is looked up on DHT first.

A big data provider may want to opt-in to codec-aware strategy as "best-effort" way to provide something on DHT rather than nothing: in case of UnixFS only provide these manifest blocks on the DHT, facilitating initial lookup without the cost of announcing all the sub-blocks.

Open questions

  • Is announcing of those UnixFS root blocks enough?
    • Depends. After the manifest block of a big file is fetched, the user is already connected to a peer which most likely has the rest of the blocks and transfer can happen over bitswap. But if the transfer gets interrupted and connection is lost, then it is not possible to resume because we already have root block in local store and we only lookup for missing sub-blocks which were not announced on DHT.
      • Potential fix would be to do DHT lookup not only for a specific sub-block in a file, but also for the first UnixFS root block above them (either a root of a file, or a parent directory). Rationale being, if someone has the root of a file, they most likely have the rest.
      • We track this in Leverage Content Path Affinity in routing #10251

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/enhancementA net-new feature or improvement to an existing featurekind/featureA new featureneed/triageNeeds initial labeling and prioritizationtopic/UnixFSTopic UnixFStopic/dhtTopic dht

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions