Skip to content

Conversation

smotornyuk
Copy link
Member

@smotornyuk smotornyuk commented Jul 18, 2025

Phase 3 of #8920

This PR consolidates DB models and API actions into 4,000 lines of code.

Improvements of features introduced in phase 2

  • config option to switch between the original unchanged Uploader, which does not use files at all, and the new Uploader that relies on configurable storage. This option is for extensions that extend and heavily override the original uploader and can break after even the slightest change in its logic.

API

I want to try a different layout of logic separation inside ckan.logic. As new logic belongs to the "file" domain, instead of separating actions between get, delete, and update, I created new files:

  • logic/action/file.py: file actions
  • logic/auth/file.py: file auth functions
  • logic/schema/file.py: file schemas

I added just the most generic actions, like create, delete, etc. Other operations, like resumable uploads and the ability to override a file, may sound tempting, but I see a lot of problems and cannot make a good generic solution for them. For now, I recommend adding custom actions over the existing generic functionality.

And there is a "hidden" file_search action. It can be called normally, I just excluded it from documentation, that's why it's "hidden". I'm not sure if searching for files should be available via API(for now, I locked it behind sysadmin permission). And as @amercader is working on new search specs, the syntax of search parameters is also unstable. Even though I'm not going to index the file via a search engine, using similar search parameters is natural. I diverged slightly from the current specification, and here are the rules:

Soring is controlled by sort. It can be either string sort: "name", which means sort by a single column ASC. Alternatively, it can be a list of strings, sort: ["name", "size"] to sort by multiple columns ASC simultaneously. Finally, it can be listed with nested 2-element lists, sort: [["name", "asc"], ["size", "desc"]], to control direction. The second and third styles can be mixed. There is no string processing and "name desc" won't be split into column and direction - you must provide list with two elements if you need more control.

Filters are similar to specification from ckanext-search, but I decided to use $ prefix for every operator, not only for and / or. In addition, $and / $or, must be specified on the top level, which is similar to the way one builds queries in SQLAlchemy. For example, if you want to find a file with the name hello or world (and you do not want to use $in), you must use the first payload, not the second

# correct
filters = {
  "$or": [
    {"name": "hello"},
    {"name": "world"},
  ]
}

# INCORRECT
filters = {
  "name": {
    "$or": ["hello", "world"] # `$or` is not allowed as field-level operator
  }
}

Here's an example of a complex query that is still valid. Note, it's top-level $or contains 2 items that are essentially describing the same conditions. The second item just shorter because it relies on implications(e.g. field: [1,2,3] implies filed: {$in: [1,2,3]}:

{
    "$or": [
        {
            "$and": [
                {"name": {"$in": ["wsgi.py", "README.rst"]}},
                {"size": {"$gt": 3167, "$lt": 3200}},
                {"content_type": {"$eq": "text/plain"}}
            ]
        },
        {
            "name": ["wsgi.py", "README.rst"],
            "size": {"$gt": 3167, "$lt": 3200},
            "content_type": "text/plain"
        }
    ]
}

DB model

There are 3 new models: File, Owner, and OwnerTransferHistory. All of them are defined using a new declarative style of model definition from SQLAlchemy v2.0. It's different from the imperative mapping and classic declarative style, but it integrates well into the typing system, and it's worth switching to it.

File contains file details. I'm not sure if atime and mtime(access and modification time) columns are useful, and can drop them. Other columns are reasonable.

Owner model is an extension for file management logic that helps with permissions. The idea is that a file can be either unowned or owned by something that has type (user) and id(user ID). When a file is accessed, we check the relationship between the owner of the file and the user who accesses the file. And depending on this relationship(the user itself is an owner of the file; the file is owned by the dataset that is created by the user), CKAN decides what the user can do with the file.

OwnerTransferHistory is a hidden model. Whenever a file is moved from one owner to another, a new record appears in this table. In this way, we can find out who uploaded the file and who owned it before the current owner. This functionality can be covered by activities, but files are different from public entities that are mentioned inside activities, and I think that keeping ownership history separately is better. Anyway, this model is not promoted and we can drop it in the future if it does not work in the real world.

CLI

A couple of new CLI commands under ckan file ... route. Move file from one storage to another, stream the content of the file to STDOUT, report missing files, show statistics about used space, etc.

CKANConfig.subtree

New method in the config object that returns options that match given prefix and, optionally, transforms these options into nested dictionary. When you have following config and want all options for sqlalchemy:

redis.a = 1
redis.b = 2

sqlalchemy.url = 1
sqlalchemy.nested.hello = 2
sqlalchemy.nested.world = 3

you can extract either of the following dictionaries:

# flat
{"url": 1, "hested.hello": 2, "nested.world": 3}

# nested
{
  "url": 1, 
  "hested": {
    "hello" : 2, 
    "world": 3,
  },
}

I'm using this method to collect storage configuration, but it can be used elsewhere, especially for features that share a common prefix for multiple config options.

New config options

There are 3 of them, the most important is ckan.files.owner.cascade_access, which makes the Owner model useful. Documentation of this option explains how ownership works.

raw storage adapters and CKAN-storage adapters

File-keeper does not require CKAN, which means some features, like converting file content into a Flask response, must be provided by CKAN. Because of it, original storage adapters defined inside file-keeper can be used in CKAN, but it's preferable to wrap them into a class that enables a few CKAN-specific operations. You can find such wrappers under ckan/lib/files/default/.

Additional adapters

I registered S3, Azure Blob, and GCS adapters in addition to Apache Libcloud. The latter supports a lot of different providers, but it gives access to a very limited set of operations, which is why I think it is worth exposing native adapters for the most popular storage. BTW, file-keeper documentation explains how to emulate these storages locally.

There is one unsolved problem, though. Adapters rely on additional dependencies, such as boto3 or azure-storage-blob. These dependencies publish new versions very often, and these versions are backward compatible. Because of this, I don't want to add a fixed version of these libraries to CKAN requirements. And even more, these libraries are not required if you are not going to use the corresponding storage, so I don't want to install them at all by default.
And here's the problem. I don't want to add packages to requirements, because they will not be used. But if somebody wants to use them, they must find out how the package must be installed. I'm sure that even if I write with big red letters the required dependencies, people won't read it and wonder why the storage is not working.

Any ideas? Maybe, when the user configures storage that is not available due to the missing driver, I should add the recommended pip install ... command to the exception text?

Adapters for tests

I added a storage adapter that keeps files in memory for tests, and another adapter that just ignores all operations without raising an error(like /dev/null).
I can also register a Redis adapter, because the memory adapter does not share files between processes, thus you cannot use it to test CLI or background jobs. But this is a really rare scenario, so I suggest using the filesystem adapter + /tmp for it.

ContextCache

Cache is stored inside the context. Similar to context["package"] that holds current package. But it stores items with their primary key, so it's safe to share this cache between actions as long as you invalidate items that were removed from DB(which I do). I need it to reduce the number of DB requests.

Permissions

This part looks intricate on first glance, but it's actually quite robust and simple. All API actions have auth functions. Every auth function checks if the user has permission, depending on the nature of the operation:

  • update (can rename or transfer file to a different owner)
  • read (can read file metadata)
  • delete (can delete the file)

All permissions depend on the owner. If the file is owned by the package, the user has permission(read/delete/update) for the file if they can do the same thing(read/delete/update) with the package itself. In this way, if a user can only read the dataset, they can download the file, but cannot modify or delete it. If the user cannot see the dataset, they won't see the file as well.

And, if it doesn't work, apart from chaining the auth function, you can use an interface method that extends permission checks. Using it, you can do anything crazy enough, like give a user permission to download the file if the filename matches the user's name.

Blueprint

A new blueprint with a single route that can be used to download the file. Generic implementation just streams the content, so CKAN works as a proxy between storage and the user. However, this point can be extended on the storage level. Via ~20 lines of code you'll be able to redirect the user to a signed temporal S3 link. And user will download file directly from the cloud, reducing load on CKAN app.

Maybe I'll add a PR that makes this possible out of the box in the future. Currently, the flow for signed/resumable/multipart uploads is not settled in file-keeper, and I'm trying to avoid unstable features.

@smotornyuk smotornyuk marked this pull request as ready for review August 29, 2025 21:23
Co-authored-by: Ian Ward <ian@excess.org>
@smotornyuk
Copy link
Member Author

smotornyuk commented Sep 2, 2025

@Zharktas , @EricSoroos just confirmed behavior of S3 adapter. If you install file-keeper with S3 extra (I'm using alpha while this PR is in review):

pip install 'file-keeper[s3]==0.1.0a4'

you can connect to S3 using either explicit credentials, which is not recommended, as you mentioned on the call:

from file_keeper import make_storage
storage = make_storage(
  "test", {
    "type": "file_keeper:s3",
    "bucket": "file-keeper",
    "key": "***",
    "secret": "***", 
    # region, etc.
  }
)

The second option is to specify AWS_* envvars, and then you'll need only the bucket name:

$ AWS_ACCESS_KEY_ID=*** AWS_SECRET_ACCESS_KEY=*** python

>>> from file_keeper import make_storage
>>> storage = make_storage(
...   "test", {
...     "type": "file_keeper:s3",
...     "bucket": "file-keeper",
...   }
... );

And you can also use ~/.aws/credentials. Assuming you have their default profile like this:

[default]
aws_access_key_id = ***
aws_secret_access_key = ***
region=***

You can also use just the bucket name to initialize the connection.

These examples are using the raw file_keeper:s3 adapter, but the same thing will work for the CKAN version of it ckan:s3 inside an initialized CKAN app

Does it cover your concerns?

Updated: and thanks for the vital point, I added these examples to file-keeper's docs


Another thought. Include @amercader and @wardi .

I think it's worth collecting such questions and building a FAQ. Is it better to create a github discussion where threads will describe the problem and solution, or wiki section better suits this purpose?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants