-
Notifications
You must be signed in to change notification settings - Fork 235
IPIP 0499: CID Profiles #499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
lets make the fanout match the max links from files and rename profile to `-wide` this will make it easier to discuss in ipfs/specs#499
Co-authored-by: Bumblefudge <bumblefudge@learningproof.xyz>
Import.* config params for controlling DAG width were added in: ipfs/kubo#10774
Thank you for kicking this off, and filling initial state. I've incorporated specific "dag width" settings for Next:
|
This comment was marked as off-topic.
This comment was marked as off-topic.
Co-authored-by: Christian Paul <info@jaller.de>
I pushed a bunch of edits to move the conversation forward. This is sorely needed in the ecosystem, and the hope is that by building consensus we can improve developer experience when working with UnixFS and the overall health of the UnixFS ecosystem. Feedback is always appreciated. |
|
||
This proposal introduces configuration profiles for CIDs used to represent files and directories with UnixFS. These ensure that the deterministic CID generation for the same data, regardless of the implementation. | ||
|
||
Profiles explicitly define the UnixFS parameters, e.g. dag width, hash algorithem, and chunk size, that affect the resulting CID, such that given the profile and input data different implementations will generate identical CIDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Profiles explicitly define the UnixFS parameters, e.g. dag width, hash algorithem, and chunk size, that affect the resulting CID, such that given the profile and input data different implementations will generate identical CIDs. | |
Profiles explicitly define the UnixFS parameters, e.g. dag width, hash algorithm, and chunk size, that affect the resulting CID, such that given the profile and input data different implementations will generate identical CIDs. |
This lack of determinism makes has a number of drawbacks: | ||
|
||
- It is difficult to verify content across different tools and implementations, as the same content may yield different CIDs. | ||
- Users are requires to store and transfer UnixFS merkle proofs in order to verify CIDs, adding storage overhead, network bandwidth, and complexity to the verification process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Users are requires to store and transfer UnixFS merkle proofs in order to verify CIDs, adding storage overhead, network bandwidth, and complexity to the verification process. | |
- Users are required to store and transfer UnixFS merkle proofs in order to verify CIDs, adding storage overhead, network bandwidth, and complexity to the verification process. |
|
||
By introducing profiles, we can benefit from both the optionality offered by UnixFS, where users are free to chose their own parameters, and determinism through profiles. | ||
|
||
UnixFS CIDs can be generated with a variety of settings and optimizations for chunking, DAG width, and more. This means the same file tree can yield multiple, different CIDs depending on which tools and settings are used, and it is not possible to reliably reproduce or verify the CID. Profiles offer With profiles, following the same profile will produce identical CIDs for identical content, whic makes verification regardless of implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-iterative, typos. Last sentence does not compile in natural language.
1. CID version (currently only CIDv0 or CIDv1) | ||
1. Hash function | ||
1. UnixFS chunk size | ||
1. UnixFS DAG layout (e.g. balanced, trickle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. UnixFS DAG layout (e.g. balanced, trickle) | |
1. UnixFS DAG layout (e.g. balanced, trickle etc...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to fix the unixfs spec as it doesn't really say how the layout works. For a start, they DAG layout is not part of the unixfs "format", but rather unixfs protobufs are embedded in merkledag-pb protobufs, that could be cbors or jsons. So either we set that in stone (unixfs dag == protobuf-pb) or we need to be explicit in profiles.
Then there is the part about the subtleties of the layout. i.e. the trickle parameters, the balancing of the balanced dag etc, which are also badly specified. It may be that now you can get away with UnixFS DAG width for configuring existing DAG layouts, but I would imagine layouts with more options than the width.
1. UnixFS DAG layout (e.g. balanced, trickle) | ||
1. UnixFS DAG width (max number of links per `File` node) | ||
1. `HAMTDirectory` fanout (must be a power of 2) | ||
1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this number is dynamic based on the lengths of the actual link entries in the dag, we will need to specify what algorithm that estimation follows. I would put such things in a special "ipfs legacy" profile to be honest, along with cidv0, non-raw leaves etc. We probably should heavily discourage coming up with profiles that do weird things, like dynamically setting params or not using raw-leaves for things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, each layout would have its own set of layout-params:
- balanced:
- max-links: N
- trickle:
- max-leaves-per-level: N
1. Whether empty directories are included in the DAG | ||
- Some implementations apply filtering before merkleizing filesystem entries in the DAG. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird, because then we need to consider empty files, hidden files, unreadable files, symlinks and symlink follows, so probably need to mention all those as part of the profile too?
The profiles define a set of parameters that affect the resulting CID. These parameters are based on the UnixFS specification and are used to generate the CID for a given file tree. The parameters include: | ||
|
||
1. CID version (currently only CIDv0 or CIDv1) | ||
1. Hash function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This subtly means that the hash function will expected to be the same for all the nodes in the DAG in question. I'm not sure if that is a requirement that is written anywhere, so technically you can build unixfs DAGs with multiple hash functions (for fun right?).
|
||
The profiles define a set of parameters that affect the resulting CID. These parameters are based on the UnixFS specification and are used to generate the CID for a given file tree. The parameters include: | ||
|
||
1. CID version (currently only CIDv0 or CIDv1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this matter? Or how much? Currently, in Merkledag-PB, links are the multihashes, so in principle, the cid version used (and the multibase) just decides the final presentation of the root CID. If profiles affect only unixfs, the codec is also fixed. If we have the same multihash, the same codec, the only thing that can change is the CID-encoding base if we have one.
So if we want a profile to dictate exactly the final string representation of the root CID, we need to list "multibase". And if not, if we are happy with the profile just producing equivalent CIDs (potentially in different bases), then CID version does not fully matter.
|
||
### Compatibility | ||
|
||
UnixFS Data encoded with the profiles defined in this IPIP is fully compatible with existing implementations, as it is fully compliant with the UnixFS specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot be compliant with details that are not specified as of today..
|
||
### Alternatives | ||
|
||
As an alternative to profiles, users can store and transfer CAR files of UnixFS content, which include the merkle proofs needed to verify the CID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative to profiles, users can store and transfer CAR files of UnixFS content, which include the merkle proofs needed to verify the CID. | |
As an alternative to profiles, users can store and transfer CAR files of UnixFS content, which include the merkle DAG nodes needed to verify the CID. |
Currently, CIDs can be generated with a variety of settings and optimizations for chunking, DAG width, and more. This means the same file can yield multiple, different CIDs depending on which tools and settings are used, and it is not possible to reliably reproduce or verify the CID.
This proposal introduces profiles for IPFS CIDs. Profiles explicitly define CID version, hash algorithm, chunk size, DAG width, layout, and other parameters. They can be used to verify data across implementations, provide recommended settings depending on retrieval performance goals, and more.