Files: phase 3 #9026

smotornyuk · 2025-07-18T15:42:40Z

Phase 3 of #8920

This PR consolidates DB models and API actions into 4,000 lines of code.

Improvements of features introduced in phase 2

config option to switch between the original unchanged Uploader, which does not use files at all, and the new Uploader that relies on configurable storage. This option is for extensions that extend and heavily override the original uploader and can break after even the slightest change in its logic.

API

I want to try a different layout of logic separation inside ckan.logic. As new logic belongs to the "file" domain, instead of separating actions between get, delete, and update, I created new files:

logic/action/file.py: file actions
logic/auth/file.py: file auth functions
logic/schema/file.py: file schemas

I added just the most generic actions, like create, delete, etc. Other operations, like resumable uploads and the ability to override a file, may sound tempting, but I see a lot of problems and cannot make a good generic solution for them. For now, I recommend adding custom actions over the existing generic functionality.

And there is a "hidden" file_search action. It can be called normally, I just excluded it from documentation, that's why it's "hidden". I'm not sure if searching for files should be available via API(for now, I locked it behind sysadmin permission). And as @amercader is working on new search specs, the syntax of search parameters is also unstable. Even though I'm not going to index the file via a search engine, using similar search parameters is natural. I diverged slightly from the current specification, and here are the rules:

Soring is controlled by sort. It can be either string sort: "name", which means sort by a single column ASC. Alternatively, it can be a list of strings, sort: ["name", "size"] to sort by multiple columns ASC simultaneously. Finally, it can be listed with nested 2-element lists, sort: [["name", "asc"], ["size", "desc"]], to control direction. The second and third styles can be mixed. There is no string processing and "name desc" won't be split into column and direction - you must provide list with two elements if you need more control.

Filters are similar to specification from ckanext-search, but I decided to use $ prefix for every operator, not only for and / or. In addition, $and / $or, must be specified on the top level, which is similar to the way one builds queries in SQLAlchemy. For example, if you want to find a file with the name hello or world (and you do not want to use $in), you must use the first payload, not the second

# correct
filters = {
  "$or": [
    {"name": "hello"},
    {"name": "world"},
  ]
}

# INCORRECT
filters = {
  "name": {
    "$or": ["hello", "world"] # `$or` is not allowed as field-level operator
  }
}

Here's an example of a complex query that is still valid. Note, it's top-level $or contains 2 items that are essentially describing the same conditions. The second item just shorter because it relies on implications(e.g. field: [1,2,3] implies filed: {$in: [1,2,3]}:

{
    "$or": [
        {
            "$and": [
                {"name": {"$in": ["wsgi.py", "README.rst"]}},
                {"size": {"$gt": 3167, "$lt": 3200}},
                {"content_type": {"$eq": "text/plain"}}
            ]
        },
        {
            "name": ["wsgi.py", "README.rst"],
            "size": {"$gt": 3167, "$lt": 3200},
            "content_type": "text/plain"
        }
    ]
}

DB model

There are 3 new models: File, Owner, and OwnerTransferHistory. All of them are defined using a new declarative style of model definition from SQLAlchemy v2.0. It's different from the imperative mapping and classic declarative style, but it integrates well into the typing system, and it's worth switching to it.

File contains file details. I'm not sure if atime and mtime(access and modification time) columns are useful, and can drop them. Other columns are reasonable.

Owner model is an extension for file management logic that helps with permissions. The idea is that a file can be either unowned or owned by something that has type (user) and id(user ID). When a file is accessed, we check the relationship between the owner of the file and the user who accesses the file. And depending on this relationship(the user itself is an owner of the file; the file is owned by the dataset that is created by the user), CKAN decides what the user can do with the file.

OwnerTransferHistory is a hidden model. Whenever a file is moved from one owner to another, a new record appears in this table. In this way, we can find out who uploaded the file and who owned it before the current owner. This functionality can be covered by activities, but files are different from public entities that are mentioned inside activities, and I think that keeping ownership history separately is better. Anyway, this model is not promoted and we can drop it in the future if it does not work in the real world.

CLI

A couple of new CLI commands under ckan file ... route. Move file from one storage to another, stream the content of the file to STDOUT, report missing files, show statistics about used space, etc.

CKANConfig.subtree

New method in the config object that returns options that match given prefix and, optionally, transforms these options into nested dictionary. When you have following config and want all options for sqlalchemy:

redis.a = 1
redis.b = 2

sqlalchemy.url = 1
sqlalchemy.nested.hello = 2
sqlalchemy.nested.world = 3

you can extract either of the following dictionaries:

# flat
{"url": 1, "hested.hello": 2, "nested.world": 3}

# nested
{
  "url": 1, 
  "hested": {
    "hello" : 2, 
    "world": 3,
  },
}

I'm using this method to collect storage configuration, but it can be used elsewhere, especially for features that share a common prefix for multiple config options.

New config options

There are 3 of them, the most important is ckan.files.owner.cascade_access, which makes the Owner model useful. Documentation of this option explains how ownership works.

raw storage adapters and CKAN-storage adapters

File-keeper does not require CKAN, which means some features, like converting file content into a Flask response, must be provided by CKAN. Because of it, original storage adapters defined inside file-keeper can be used in CKAN, but it's preferable to wrap them into a class that enables a few CKAN-specific operations. You can find such wrappers under ckan/lib/files/default/.

Additional adapters

I registered S3, Azure Blob, and GCS adapters in addition to Apache Libcloud. The latter supports a lot of different providers, but it gives access to a very limited set of operations, which is why I think it is worth exposing native adapters for the most popular storage. BTW, file-keeper documentation explains how to emulate these storages locally.

There is one unsolved problem, though. Adapters rely on additional dependencies, such as boto3 or azure-storage-blob. These dependencies publish new versions very often, and these versions are backward compatible. Because of this, I don't want to add a fixed version of these libraries to CKAN requirements. And even more, these libraries are not required if you are not going to use the corresponding storage, so I don't want to install them at all by default.
And here's the problem. I don't want to add packages to requirements, because they will not be used. But if somebody wants to use them, they must find out how the package must be installed. I'm sure that even if I write with big red letters the required dependencies, people won't read it and wonder why the storage is not working.

Any ideas? Maybe, when the user configures storage that is not available due to the missing driver, I should add the recommended pip install ... command to the exception text?

Adapters for tests

I added a storage adapter that keeps files in memory for tests, and another adapter that just ignores all operations without raising an error(like /dev/null).
I can also register a Redis adapter, because the memory adapter does not share files between processes, thus you cannot use it to test CLI or background jobs. But this is a really rare scenario, so I suggest using the filesystem adapter + /tmp for it.

ContextCache

Cache is stored inside the context. Similar to context["package"] that holds current package. But it stores items with their primary key, so it's safe to share this cache between actions as long as you invalidate items that were removed from DB(which I do). I need it to reduce the number of DB requests.

Permissions

This part looks intricate on first glance, but it's actually quite robust and simple. All API actions have auth functions. Every auth function checks if the user has permission, depending on the nature of the operation:

update (can rename or transfer file to a different owner)
read (can read file metadata)
delete (can delete the file)

All permissions depend on the owner. If the file is owned by the package, the user has permission(read/delete/update) for the file if they can do the same thing(read/delete/update) with the package itself. In this way, if a user can only read the dataset, they can download the file, but cannot modify or delete it. If the user cannot see the dataset, they won't see the file as well.

And, if it doesn't work, apart from chaining the auth function, you can use an interface method that extends permission checks. Using it, you can do anything crazy enough, like give a user permission to download the file if the filename matches the user's name.

Blueprint

A new blueprint with a single route that can be used to download the file. Generic implementation just streams the content, so CKAN works as a proxy between storage and the user. However, this point can be extended on the storage level. Via ~20 lines of code you'll be able to redirect the user to a signed temporal S3 link. And user will download file directly from the cloud, reducing load on CKAN app.

Maybe I'll add a PR that makes this possible out of the box in the future. Currently, the flow for signed/resumable/multipart uploads is not settled in file-keeper, and I'm trying to avoid unstable features.

CHANGELOG.rst

Co-authored-by: Ian Ward <ian@excess.org>

smotornyuk · 2025-09-02T20:30:41Z

@Zharktas , @EricSoroos just confirmed behavior of S3 adapter. If you install file-keeper with S3 extra (I'm using alpha while this PR is in review):

pip install 'file-keeper[s3]==0.1.0a4'

you can connect to S3 using either explicit credentials, which is not recommended, as you mentioned on the call:

from file_keeper import make_storage
storage = make_storage(
  "test", {
    "type": "file_keeper:s3",
    "bucket": "file-keeper",
    "key": "***",
    "secret": "***", 
    # region, etc.
  }
)

The second option is to specify AWS_* envvars, and then you'll need only the bucket name:

$ AWS_ACCESS_KEY_ID=*** AWS_SECRET_ACCESS_KEY=*** python

>>> from file_keeper import make_storage
>>> storage = make_storage(
...   "test", {
...     "type": "file_keeper:s3",
...     "bucket": "file-keeper",
...   }
... );

And you can also use ~/.aws/credentials. Assuming you have their default profile like this:

[default]
aws_access_key_id = ***
aws_secret_access_key = ***
region=***

You can also use just the bucket name to initialize the connection.

These examples are using the raw file_keeper:s3 adapter, but the same thing will work for the CKAN version of it ckan:s3 inside an initialized CKAN app

Does it cover your concerns?

Updated: and thanks for the vital point, I added these examples to file-keeper's docs

Another thought. Include @amercader and @wardi .

I think it's worth collecting such questions and building a FAQ. Is it better to create a github discussion where threads will describe the problem and solution, or wiki section better suits this purpose?

smotornyuk added 30 commits March 26, 2025 00:34

chore: switch group uploader to file-keeper

e44d526

chore: add files module and register storages when application loads

0d1c3c6

chore: fix tests

e909f78

chore: switch resource uploader to file keeper

64918e7

chore: add reset_storages fixture

bf081d9

wip: add interface

83b6265

wip: add config declarations

dd14eab

wip: add fs storage

40d8215

wip: fix tests

151b79e

chore: fix tests #2

b53c5fd

wip: improve user-clean cli

5b7f5f4

wip: add tests for declaration loader

a6ff0f1

wip: tests for make and get storage

e61b743

wip: add tests for storage settings collector

c72f188

wip: document and test files.base

d4bd05a

wip: add Settings to shared members

bdef949

wip: add storages section to documentation

a118675

chore: finalize storage documentation

7cd5fec

chore: add changelog entry

c54b0dd

chore: add changelog entry #2

9fbc021

Merge branch 'master' into files-phase-1

4d3f576

chore: define reader.as_response method

44665d9

chore: configurable resource storage

366d52d

chore: better mimetype handling

4bde95e

chore: configurable secondary storages

4da056b

chore: add public_fs storage

21ffc3a

chore: document new storage types

a8d9ba9

chore: optional adapters

ec35f63

chore: use permanent_links for admin uploads

4098dbf

chore: add opendal storage

bfadf5b

smotornyuk added 20 commits August 26, 2025 20:21

chore: fix auth permissions

884c583

chore: add file schema

8024738

chore: add scaffolding for files api

eb43bda

chore: add all but search actions

d3aadbc

chore: add search api

d326093

chore: add api tests

d87a584

chore: add s3, azure and gcs

924776b

chore: auth tests

7695842

chore: rename files cli to singular file

3e28e1e

chore: doc fixes

51167e7

Merge remote-tracking branch 'origin/master' into files-phase-3

05c2478

chore: fix existing storage commands

9fda91f

chore: add stats and stream

c26a4bd

chore: add maintain commands

850b948

chore: fix tests

6b4cf40

chore: add views

6adbc6b

chore: add changelog fragment

72f9c51

Merge remote-tracking branch 'origin/master' into files-phase-2

c959c21

Merge branch 'files-phase-2' into files-phase-3

b12df43

chore: fix search

44f6e47

smotornyuk marked this pull request as ready for review August 29, 2025 21:23

chore: rename file blueprint

9736acd

wardi assigned Zharktas, wardi and amercader Sep 2, 2025

wardi reviewed Sep 2, 2025

View reviewed changes

CHANGELOG.rst Outdated Show resolved Hide resolved

Update CHANGELOG.rst

9b997a2

Co-authored-by: Ian Ward <ian@excess.org>

smotornyuk added 2 commits September 6, 2025 19:22

Merge remote-tracking branch 'origin/master' into files-phase-3

b68d406

Merge branch 'master' into files-phase-3

f45cb1f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files: phase 3 #9026

Files: phase 3 #9026

Uh oh!

smotornyuk commented Jul 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

smotornyuk commented Sep 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Files: phase 3 #9026

Are you sure you want to change the base?

Files: phase 3 #9026

Uh oh!

Conversation

smotornyuk commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Improvements of features introduced in phase 2

API

DB model

CLI

CKANConfig.subtree

New config options

raw storage adapters and CKAN-storage adapters

Additional adapters

Adapters for tests

ContextCache

Permissions

Blueprint

Uh oh!

Uh oh!

smotornyuk commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

smotornyuk commented Jul 18, 2025 •

edited

Loading

smotornyuk commented Sep 2, 2025 •

edited

Loading