Skip to content

IPIP 0499: CID Profiles #499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

IPIP 0499: CID Profiles #499

wants to merge 12 commits into from

Conversation

mishmosh
Copy link
Contributor

@mishmosh mishmosh commented Apr 3, 2025

Currently, CIDs can be generated with a variety of settings and optimizations for chunking, DAG width, and more. This means the same file can yield multiple, different CIDs depending on which tools and settings are used, and it is not possible to reliably reproduce or verify the CID.

This proposal introduces profiles for IPFS CIDs. Profiles explicitly define CID version, hash algorithm, chunk size, DAG width, layout, and other parameters. They can be used to verify data across implementations, provide recommended settings depending on retrieval performance goals, and more.

@mishmosh mishmosh requested a review from a team as a code owner April 3, 2025 14:03
@mishmosh mishmosh changed the title Create ipip-0000.md: CID profiles IPIP 0499: CID Profiles Apr 3, 2025
lidel added a commit to ipfs/kubo that referenced this pull request Apr 15, 2025
lets make the fanout match the max links from files
and rename profile to `-wide`

this will make it easier to discuss in ipfs/specs#499
lidel and others added 2 commits April 15, 2025 23:41
Co-authored-by: Bumblefudge <bumblefudge@learningproof.xyz>
Import.* config params for controlling DAG width were added in:
ipfs/kubo#10774
@lidel
Copy link
Member

lidel commented Apr 15, 2025

Thank you for kicking this off, and filling initial state.

I've incorporated specific "dag width" settings for File, Directory and HAMTDirectory nodes,
and updated the table to reflect state from ipfs/kubo#10774
and profiles that exist in Kubo master branch: legacy-cid-v0, test-cid-v1 and test-cid-v1-wide:

Next:

  • agree what "cid-2025" profile should look like
    • this will be new default in "Kubo v1.0"
    • we have test-cid-v1 and test-cid-v1-wide in Kubo as potential candidates
  • switch to PR from local branch (so we have build preview)
  • figure out how to render the information (currently the table is not supported by https://github.com/ipfs/spec-generator)

@SethDocherty

This comment was marked as off-topic.

@2color
Copy link
Member

2color commented Aug 12, 2025

I pushed a bunch of edits to move the conversation forward. This is sorely needed in the ecosystem, and the hope is that by building consensus we can improve developer experience when working with UnixFS and the overall health of the UnixFS ecosystem.

Feedback is always appreciated.


This proposal introduces configuration profiles for CIDs used to represent files and directories with UnixFS. These ensure that the deterministic CID generation for the same data, regardless of the implementation.

Profiles explicitly define the UnixFS parameters, e.g. dag width, hash algorithem, and chunk size, that affect the resulting CID, such that given the profile and input data different implementations will generate identical CIDs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Profiles explicitly define the UnixFS parameters, e.g. dag width, hash algorithem, and chunk size, that affect the resulting CID, such that given the profile and input data different implementations will generate identical CIDs.
Profiles explicitly define the UnixFS parameters, e.g. dag width, hash algorithm, and chunk size, that affect the resulting CID, such that given the profile and input data different implementations will generate identical CIDs.

This lack of determinism makes has a number of drawbacks:

- It is difficult to verify content across different tools and implementations, as the same content may yield different CIDs.
- Users are requires to store and transfer UnixFS merkle proofs in order to verify CIDs, adding storage overhead, network bandwidth, and complexity to the verification process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Users are requires to store and transfer UnixFS merkle proofs in order to verify CIDs, adding storage overhead, network bandwidth, and complexity to the verification process.
- Users are required to store and transfer UnixFS merkle proofs in order to verify CIDs, adding storage overhead, network bandwidth, and complexity to the verification process.


By introducing profiles, we can benefit from both the optionality offered by UnixFS, where users are free to chose their own parameters, and determinism through profiles.

UnixFS CIDs can be generated with a variety of settings and optimizations for chunking, DAG width, and more. This means the same file tree can yield multiple, different CIDs depending on which tools and settings are used, and it is not possible to reliably reproduce or verify the CID. Profiles offer With profiles, following the same profile will produce identical CIDs for identical content, whic makes verification regardless of implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-iterative, typos. Last sentence does not compile in natural language.

1. CID version (currently only CIDv0 or CIDv1)
1. Hash function
1. UnixFS chunk size
1. UnixFS DAG layout (e.g. balanced, trickle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. UnixFS DAG layout (e.g. balanced, trickle)
1. UnixFS DAG layout (e.g. balanced, trickle etc...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to fix the unixfs spec as it doesn't really say how the layout works. For a start, they DAG layout is not part of the unixfs "format", but rather unixfs protobufs are embedded in merkledag-pb protobufs, that could be cbors or jsons. So either we set that in stone (unixfs dag == protobuf-pb) or we need to be explicit in profiles.

Then there is the part about the subtleties of the layout. i.e. the trickle parameters, the balancing of the balanced dag etc, which are also badly specified. It may be that now you can get away with UnixFS DAG width for configuring existing DAG layouts, but I would imagine layouts with more options than the width.

1. UnixFS DAG layout (e.g. balanced, trickle)
1. UnixFS DAG width (max number of links per `File` node)
1. `HAMTDirectory` fanout (must be a power of 2)
1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this number is dynamic based on the lengths of the actual link entries in the dag, we will need to specify what algorithm that estimation follows. I would put such things in a special "ipfs legacy" profile to be honest, along with cidv0, non-raw leaves etc. We probably should heavily discourage coming up with profiles that do weird things, like dynamically setting params or not using raw-leaves for things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, each layout would have its own set of layout-params:

  • balanced:
    • max-links: N
  • trickle:
    • max-leaves-per-level: N

Comment on lines +57 to +58
1. Whether empty directories are included in the DAG
- Some implementations apply filtering before merkleizing filesystem entries in the DAG.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird, because then we need to consider empty files, hidden files, unreadable files, symlinks and symlink follows, so probably need to mention all those as part of the profile too?

The profiles define a set of parameters that affect the resulting CID. These parameters are based on the UnixFS specification and are used to generate the CID for a given file tree. The parameters include:

1. CID version (currently only CIDv0 or CIDv1)
1. Hash function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This subtly means that the hash function will expected to be the same for all the nodes in the DAG in question. I'm not sure if that is a requirement that is written anywhere, so technically you can build unixfs DAGs with multiple hash functions (for fun right?).


The profiles define a set of parameters that affect the resulting CID. These parameters are based on the UnixFS specification and are used to generate the CID for a given file tree. The parameters include:

1. CID version (currently only CIDv0 or CIDv1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this matter? Or how much? Currently, in Merkledag-PB, links are the multihashes, so in principle, the cid version used (and the multibase) just decides the final presentation of the root CID. If profiles affect only unixfs, the codec is also fixed. If we have the same multihash, the same codec, the only thing that can change is the CID-encoding base if we have one.

So if we want a profile to dictate exactly the final string representation of the root CID, we need to list "multibase". And if not, if we are happy with the profile just producing equivalent CIDs (potentially in different bases), then CID version does not fully matter.


### Compatibility

UnixFS Data encoded with the profiles defined in this IPIP is fully compatible with existing implementations, as it is fully compliant with the UnixFS specification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot be compliant with details that are not specified as of today..


### Alternatives

As an alternative to profiles, users can store and transfer CAR files of UnixFS content, which include the merkle proofs needed to verify the CID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As an alternative to profiles, users can store and transfer CAR files of UnixFS content, which include the merkle proofs needed to verify the CID.
As an alternative to profiles, users can store and transfer CAR files of UnixFS content, which include the merkle DAG nodes needed to verify the CID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants