Skip to content

package_update: faster dataset metadata update #8421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Oct 23, 2024

Conversation

wardi
Copy link
Contributor

@wardi wardi commented Sep 5, 2024

Fixes #8407

Proposed fixes:

  • faster dataset metadata update with IDatasetForm.resource_validation_dependencies for declaring dataset fields that may affect resource validation (default: none)
  • allow sysadmin to set metadata_modified value on datasets - useful for harvesting or mirroring ckan sites
  • only update metadata_modified when dataset or resource metadata has changed
  • remove allow_partial_update context parameter and instead allow normal API users to:
    • update dataset metadata without passing resources
    • update group/org metadata without passing users, datasets and groups
100-resource dataset action old new improvement
package_patch (dataset field) .729 .381 90%
package_patch (no change after validation) .537 .046 1000%

Features:

  • includes tests covering changes
  • includes updated documentation
  • includes user-visible changes
  • includes API changes
  • includes bugfix for possible backport

@wardi wardi marked this pull request as ready for review September 11, 2024 23:16
@wardi
Copy link
Contributor Author

wardi commented Sep 11, 2024

This just needs a few more tests

if not dataset_changed and original_package.get('resources'
) == data_dict.get('resources'):
# flag no change for caller
context['package_update_unchanged'] = 'before validation'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in love with this part but it's the simplest possible way I could think to let the activity extension skip creating useless activity entries. I don't like that this is a new Context value that's a really specific case not a general solution.

Alternatives I considered:

  • Create a generic way to return extra information in the API envelope via the context that could be used by other actions. This would be useful for API users so they would know if an update actually occurred after any of the package or resource updating actions.
  • Create a new interface for the activity plugin that could be passed both the original_package and updated data_dict from package_update so it doesn't have to call package_show itself or and the generic signal handling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are getting rid of an ugly context var and introducing an ugly context var so it evens out :)

Create a generic way to return extra information in the API envelope via the context that could be used by other actions.

You mean something like context["extra_data"]["package_update_unchanged"] = "before validation"?
This looks like encouraging sharing info in the context, so perhaps I prefer a specific key that is clear where it comes from. Unless I misunderstood?

This would be useful for API users so they would know if an update actually occurred after any of the package or resource updating actions.

Not sure what you mean here. You mean an update happening e.g. in a chained action?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amercader I'm thinking something like context['updated_entities']['package'] = [id] and then expose that updated_entities in the API response envelope or something so that API users can know when something has been modified as a result of the action they call. Wouldn't need to implement it for everything right away but it would naturally extend to resources, groups, orgs, users etc. and bulk actions that may update many things at once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed at meeting I'll implement this as context['changed_entities'] to potentially encompass creation and deletion in the future, add this value to the response envelope for API calls and document it in the changelog.

@wardi
Copy link
Contributor Author

wardi commented Sep 19, 2024

Performance measurements added to description

if not dataset_changed and original_package.get('resources'
) == data_dict.get('resources'):
# flag no change for caller
context['package_update_unchanged'] = 'before validation'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are getting rid of an ugly context var and introducing an ugly context var so it evens out :)

Create a generic way to return extra information in the API envelope via the context that could be used by other actions.

You mean something like context["extra_data"]["package_update_unchanged"] = "before validation"?
This looks like encouraging sharing info in the context, so perhaps I prefer a specific key that is clear where it comes from. Unless I misunderstood?

This would be useful for API users so they would know if an update actually occurred after any of the package or resource updating actions.

Not sure what you mean here. You mean an update happening e.g. in a chained action?

@amercader amercader merged commit 7c5c987 into master Oct 23, 2024
9 checks passed
@amercader amercader deleted the 8407-package-update-fasterer branch October 23, 2024 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Faster dataset metadata updates with package_update
2 participants