-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Discussed in #8081
Originally posted by wardi February 16, 2024
Uploaded files in CKAN are limited to 0 or 1 file attached to only groups or resources.
The group or resource model stores a reference to the file with a plain text column that can be updated like other metadata values. Resources can store the length, hash and format of a file uploaded, but these are metadata fields free for users to update (or not) that aren't durably linked to the file itself.
Uploaded files can leak, staying on the underlying storage and costing money even though there is no longer any way to reach them from the CKAN site.
There is no shared way to represent files that aren't yet attached to a group or resource, e.g:
- large file uploads in-progress with the ability to resume https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3/ManagedUpload.html
- files created before creating the dataset/resource metadata Resource-first "Add Dataset" workflow #6689
- files that need to be checked for validity before being accepted https://github.com/qld-gov-au/ckanext-validation
It's not possible to attach multiple files to a resource even when they represent the same data. This would be very useful for:
- sites like https://open.toronto.ca/catalogue/ where geospatial data is automatically converted into multiple formats and projections
- data that can be split across files like the Parquet format https://github.com/apache/parquet-format
- data generated by analyzing the resource Metadata resources: A special resource type for storing metadata #7856
Model solution
Let's create a model for uploaded files in CKAN that can be linked to resources or groups or anything else that a site might need.
Files would have:
- owner type + id for permissions (e.g. resource, user, group, etc.)
- original file name
- file reference (specific to storage back end)
- total size in bytes
- format detected or determined from file name
- completion state (ranges received for background/parallel uploads)
- hash(es) (when supported by back end)
Other possibilities:
- name of back end (multiple back end support or for migrating files live)
- support for "files" that are actually links to externally managed resources so we can monitor changes to content based on hash/size when retrieved
- alternate links for redundancy when some services aren't available
- custom fields for permissions, tracking, validation reports or other plugin data
This model would make file metadata reliable, allow us to build new features and potentially save people money by better tracking hosted data in CKAN.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status