-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Files: phase 3 #9026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Files: phase 3 #9026
Conversation
Co-authored-by: Ian Ward <ian@excess.org>
@Zharktas , @EricSoroos just confirmed behavior of S3 adapter. If you install file-keeper with S3 extra (I'm using alpha while this PR is in review): pip install 'file-keeper[s3]==0.1.0a4' you can connect to S3 using either explicit credentials, which is not recommended, as you mentioned on the call: from file_keeper import make_storage
storage = make_storage(
"test", {
"type": "file_keeper:s3",
"bucket": "file-keeper",
"key": "***",
"secret": "***",
# region, etc.
}
) The second option is to specify $ AWS_ACCESS_KEY_ID=*** AWS_SECRET_ACCESS_KEY=*** python
>>> from file_keeper import make_storage
>>> storage = make_storage(
... "test", {
... "type": "file_keeper:s3",
... "bucket": "file-keeper",
... }
... ); And you can also use [default]
aws_access_key_id = ***
aws_secret_access_key = ***
region=*** You can also use just the bucket name to initialize the connection. These examples are using the raw Does it cover your concerns? Updated: and thanks for the vital point, I added these examples to file-keeper's docs Another thought. Include @amercader and @wardi . I think it's worth collecting such questions and building a FAQ. Is it better to create a github discussion where threads will describe the problem and solution, or wiki section better suits this purpose? |
Phase 3 of #8920
This PR consolidates DB models and API actions into 4,000 lines of code.
Improvements of features introduced in phase 2
API
I want to try a different layout of logic separation inside
ckan.logic
. As new logic belongs to the "file" domain, instead of separating actions betweenget
,delete
, andupdate
, I created new files:logic/action/file.py
: file actionslogic/auth/file.py
: file auth functionslogic/schema/file.py
: file schemasI added just the most generic actions, like create, delete, etc. Other operations, like resumable uploads and the ability to override a file, may sound tempting, but I see a lot of problems and cannot make a good generic solution for them. For now, I recommend adding custom actions over the existing generic functionality.
And there is a "hidden"
file_search
action. It can be called normally, I just excluded it from documentation, that's why it's "hidden". I'm not sure if searching for files should be available via API(for now, I locked it behind sysadmin permission). And as @amercader is working on new search specs, the syntax of search parameters is also unstable. Even though I'm not going to index the file via a search engine, using similar search parameters is natural. I diverged slightly from the current specification, and here are the rules:Soring is controlled by
sort
. It can be either stringsort: "name"
, which means sort by a single column ASC. Alternatively, it can be a list of strings,sort: ["name", "size"]
to sort by multiple columns ASC simultaneously. Finally, it can be listed with nested 2-element lists,sort: [["name", "asc"], ["size", "desc"]]
, to control direction. The second and third styles can be mixed. There is no string processing and"name desc"
won't be split into column and direction - you must provide list with two elements if you need more control.Filters are similar to specification from
ckanext-search
, but I decided to use$
prefix for every operator, not only forand
/or
. In addition,$and
/$or
, must be specified on the top level, which is similar to the way one builds queries in SQLAlchemy. For example, if you want to find a file with the namehello
orworld
(and you do not want to use$in
), you must use the first payload, not the secondHere's an example of a complex query that is still valid. Note, it's top-level
$or
contains 2 items that are essentially describing the same conditions. The second item just shorter because it relies on implications(e.g.field: [1,2,3]
impliesfiled: {$in: [1,2,3]}
:DB model
There are 3 new models: File, Owner, and OwnerTransferHistory. All of them are defined using a new declarative style of model definition from SQLAlchemy v2.0. It's different from the imperative mapping and classic declarative style, but it integrates well into the typing system, and it's worth switching to it.
File
contains file details. I'm not sure ifatime
andmtime
(access and modification time) columns are useful, and can drop them. Other columns are reasonable.Owner
model is an extension for file management logic that helps with permissions. The idea is that a file can be either unowned or owned by something that has type (user) and id(user ID). When a file is accessed, we check the relationship between the owner of the file and the user who accesses the file. And depending on this relationship(the user itself is an owner of the file; the file is owned by the dataset that is created by the user), CKAN decides what the user can do with the file.OwnerTransferHistory
is a hidden model. Whenever a file is moved from one owner to another, a new record appears in this table. In this way, we can find out who uploaded the file and who owned it before the current owner. This functionality can be covered by activities, but files are different from public entities that are mentioned inside activities, and I think that keeping ownership history separately is better. Anyway, this model is not promoted and we can drop it in the future if it does not work in the real world.CLI
A couple of new CLI commands under
ckan file ...
route. Move file from one storage to another, stream the content of the file to STDOUT, report missing files, show statistics about used space, etc.CKANConfig.subtree
New method in the config object that returns options that match given prefix and, optionally, transforms these options into nested dictionary. When you have following config and want all options for sqlalchemy:
you can extract either of the following dictionaries:
I'm using this method to collect storage configuration, but it can be used elsewhere, especially for features that share a common prefix for multiple config options.
New config options
There are 3 of them, the most important is
ckan.files.owner.cascade_access
, which makes theOwner
model useful. Documentation of this option explains how ownership works.raw storage adapters and CKAN-storage adapters
File-keeper does not require CKAN, which means some features, like converting file content into a Flask response, must be provided by CKAN. Because of it, original storage adapters defined inside file-keeper can be used in CKAN, but it's preferable to wrap them into a class that enables a few CKAN-specific operations. You can find such wrappers under
ckan/lib/files/default/
.Additional adapters
I registered S3, Azure Blob, and GCS adapters in addition to Apache Libcloud. The latter supports a lot of different providers, but it gives access to a very limited set of operations, which is why I think it is worth exposing native adapters for the most popular storage. BTW, file-keeper documentation explains how to emulate these storages locally.
There is one unsolved problem, though. Adapters rely on additional dependencies, such as
boto3
orazure-storage-blob
. These dependencies publish new versions very often, and these versions are backward compatible. Because of this, I don't want to add a fixed version of these libraries to CKAN requirements. And even more, these libraries are not required if you are not going to use the corresponding storage, so I don't want to install them at all by default.And here's the problem. I don't want to add packages to requirements, because they will not be used. But if somebody wants to use them, they must find out how the package must be installed. I'm sure that even if I write with big red letters the required dependencies, people won't read it and wonder why the storage is not working.
Any ideas? Maybe, when the user configures storage that is not available due to the missing driver, I should add the recommended
pip install ...
command to the exception text?Adapters for tests
I added a storage adapter that keeps files in memory for tests, and another adapter that just ignores all operations without raising an error(like
/dev/null
).I can also register a Redis adapter, because the memory adapter does not share files between processes, thus you cannot use it to test CLI or background jobs. But this is a really rare scenario, so I suggest using the filesystem adapter +
/tmp
for it.ContextCache
Cache is stored inside the context. Similar to
context["package"]
that holds current package. But it stores items with their primary key, so it's safe to share this cache between actions as long as you invalidate items that were removed from DB(which I do). I need it to reduce the number of DB requests.Permissions
This part looks intricate on first glance, but it's actually quite robust and simple. All API actions have auth functions. Every auth function checks if the user has permission, depending on the nature of the operation:
All permissions depend on the owner. If the file is owned by the package, the user has permission(read/delete/update) for the file if they can do the same thing(read/delete/update) with the package itself. In this way, if a user can only read the dataset, they can download the file, but cannot modify or delete it. If the user cannot see the dataset, they won't see the file as well.
And, if it doesn't work, apart from chaining the auth function, you can use an interface method that extends permission checks. Using it, you can do anything crazy enough, like give a user permission to download the file if the filename matches the user's name.
Blueprint
A new blueprint with a single route that can be used to download the file. Generic implementation just streams the content, so CKAN works as a proxy between storage and the user. However, this point can be extended on the storage level. Via ~20 lines of code you'll be able to redirect the user to a signed temporal S3 link. And user will download file directly from the cloud, reducing load on CKAN app.
Maybe I'll add a PR that makes this possible out of the box in the future. Currently, the flow for signed/resumable/multipart uploads is not settled in file-keeper, and I'm trying to avoid unstable features.