Resource-first "Add Dataset" workflow #6689

jqnatividad · 2022-02-04T21:38:08Z

jqnatividad
Feb 4, 2022

In the implementation I'm currently working on, the client did extensive user research, and did a survey of several CKAN portals.

One of their main findings is that default CKAN "Add Dataset" workflow is not ideal, especially when you have a lot of package metadata (customized through scheming).

They would like to upload the canonical resource first (they mainly have only one resource per package), and then start populating the package metadata and the data dictionary upfront.

As the data dictionary is inferred by datapusher/xloader asynchronously (with not so bulletproof inference of messytables), this is currently not possible with the current workflow.

Are there any other CKAN installations that implemented a resource-first "Add Dataset" workflow?

Answered by wardi

Feb 10, 2022

Some initial thoughts on approaches:

1. separate application for resource uploads first

We discussed this briefly at the dev call today and @amercader mentioned some related work that serves resource editing pages through a separate JS application that supports multiple uploads. The same approach would be possible for ebut with a separate application that comes before the dataset creation page.

This separate application would:

process a file upload (store in temporary directory)
analyze the columns by calling out to a tool like https://github.com/jqnatividad/qsv
- provide progress feedback to user while this is proceeding
call pacakge_create to create a draft dataset with id and name se…

View full answer

wardi · 2022-02-10T15:35:11Z

wardi
Feb 10, 2022
Maintainer

Some initial thoughts on approaches:

1. separate application for resource uploads first

We discussed this briefly at the dev call today and @amercader mentioned some related work that serves resource editing pages through a separate JS application that supports multiple uploads. The same approach would be possible for ebut with a separate application that comes before the dataset creation page.

This separate application would:

process a file upload (store in temporary directory)
analyze the columns by calling out to a tool like https://github.com/jqnatividad/qsv
- provide progress feedback to user while this is proceeding
call pacakge_create to create a draft dataset with id and name set to a new uuid4 (title and name will be changed later)
- other required fields in the schema would need validation set to only be required when dataset is longer a draft
call datastore_create to create a resource and data dictionary based on the analyzed columns, but not upload the data itself
call resource_update to switch the url_type to a normal upload, and upload the file from temp directory to ckan
- xloader/datapusher can now proceed to upload the actual data in the background
redirect to the ckan data dictionary editing page for continued editing

Reorder the dataset editing process on the ckan side to allow editing metadata after customizing the data dictionary, and publish the dataset by removing the draft setting.

2. new flask view for resource uploads first

Do everything listed in approach 1 as part of a ckan extension instead of as a separate application, keeping the code together with the ckan dataset editing workflow changes

3. pre-create dataset, monitor xloader/datapusher progress

When user clicks to create a new dataset:

call pacakge_create to create a draft dataset with id and name set to a new uuid4 (title and name will be changed later)
- other required fields in the schema would need validation set to only be required when dataset is longer a draft
redirect to resource create page

After resource creation send user to a page that will show the progress of the upload and xloader/datapusher for this resource.
Once the datastore table is created (don't need to wait for all data to be loaded) redirect to the data dictionary form.

Reorder the dataset editing process on the ckan side to allow editing metadata after customizing the data dictionary, and publish the dataset by removing the draft setting.

4. as above but with new background job

Do everything in approach 3 but replace/extend datapusher/xloader using a tool like https://github.com/jqnatividad/qsv to analyze the columns before loading into the datastore

4 replies

jqnatividad Feb 11, 2022
Author

Thanks @wardi,

Of all these approaches, which one is most amenable to being merged into CKAN core?

Whenever we do an implementation, we always strive to make it something that can be merged (e.g. the initial data dictionary was informed by Boston's use cases, but was eventually incorporated into core).

This is not only in an effort to give back to the community in the spirit of open source, but pragmatically speaking, it is also in our "enlightened self-interest" to having something that's maintained and kept current by the project.

If there is no interest to incorporate it into core, which approach would you recommend if we create a separate extension to promote re-use and wider adoption (and co-maintenance) by the community (so, we don't just code it into the site's theme extension).

wardi Feb 11, 2022
Maintainer

All four approaches above assume no core changes, but some parts might be made easier by changes to core.

If core allowed reordering the resource and dataset creation pages then an extension wouldn't need to copy a big chunk of the flask view code from core just to override a few lines.
If ckan's upload process was separated from resource creation then creating a task that analyzed the columns before loading into the datastore would be easier. IIRC @TkTech started some work in this direction: files are uploaded to a user's account, then attached to resources afterwards. Making file uploads a first class thing in ckan would also simplify attaching files to other things (like groups, orgs, views, datasets themselves ...). This is a large effort sort of change but could be really useful.
If ckan's dataset + resource create/update process was converted into a single page, (e.g. with javascript callbacks to package_revise for all fields to allow immediate updates, safe concurrent editing, server-side error checking and no save button required) then reordering resource/dataset creation and adding additional pages/forms to the process would be much easier. This is also a large effort sort of change but it's something that would really improve ckan's form UX/UI possibilities.
for option 4 datapusher/xloader could add a plugin interface to make it easier to insert processing jobs. This would be a pretty easy addition that would also reduce code duplication.

jqnatividad Feb 14, 2022
Author

Thanks @wardi ,
We'll go with option 2 + option 4 then ( new flask view for resource uploads first + new background job ), with a new extension and a fork of datapusher.

We did some work to make datapusher use COPY last year (ckan/datapusher#221), making it 10x faster. We'll build on that and replace messytables with qsv, so we don't even need to insert everything as a string with COPY, as we are guaranteed the inferred data types are correct by qsv.

WDYT?

wardi Feb 14, 2022
Maintainer

You'll create a new view to handle the file upload and use that to create the resource, then allow new datapusher to process the file? Yes that should work.

jqnatividad · 2022-05-20T18:17:15Z

jqnatividad
May 20, 2022
Author

See #6869 for a follow-up discussion...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resource-first "Add Dataset" workflow #6689

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Resource-first "Add Dataset" workflow #6689

Uh oh!

jqnatividad Feb 4, 2022

1. separate application for resource uploads first

Replies: 2 comments · 4 replies

Uh oh!

wardi Feb 10, 2022 Maintainer

1. separate application for resource uploads first

2. new flask view for resource uploads first

3. pre-create dataset, monitor xloader/datapusher progress

4. as above but with new background job

Uh oh!

jqnatividad Feb 11, 2022 Author

Uh oh!

wardi Feb 11, 2022 Maintainer

Uh oh!

jqnatividad Feb 14, 2022 Author

Uh oh!

wardi Feb 14, 2022 Maintainer

Uh oh!

jqnatividad May 20, 2022 Author

jqnatividad
Feb 4, 2022

Replies: 2 comments 4 replies

wardi
Feb 10, 2022
Maintainer

jqnatividad Feb 11, 2022
Author

wardi Feb 11, 2022
Maintainer

jqnatividad Feb 14, 2022
Author

wardi Feb 14, 2022
Maintainer

jqnatividad
May 20, 2022
Author