Skip to content

simpler package indexing, no system plugins #8396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Mar 21, 2025

Conversation

wardi
Copy link
Contributor

@wardi wardi commented Aug 13, 2024

Fixes #8395

Proposed fixes:

  • remove SynchronousSearchPlugin
  • add indexing code to logic layer
  • remove exception-swallowing in IDomainObjectModification notify code

Features:

  • includes tests covering changes
  • includes updated documentation
  • includes user-visible changes
  • includes API changes
  • includes bugfix for possible backport

@wardi wardi marked this pull request as ready for review August 20, 2024 04:23
@wardi
Copy link
Contributor Author

wardi commented Dec 18, 2024

@Zharktas @pdelboca merged and ready for review again

@@ -65,7 +65,7 @@ def load_environment(conf: Union[Config, CKANConfig]):
warnings.filterwarnings('ignore', msg, sqlalchemy.exc.SAWarning)

# load all CKAN plugins
p.load_all()
p.load_all(force_update=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is force_update required here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_environment is used in tests which sometimes need to clear all the plugins by having an empty plugin list

Comment on lines 143 to 146
def load(
*plugins: str
*plugins: str,
force_update: bool = False,
) -> Plugin | list[Plugin]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to document here what force_update is and why we needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added docstrings to load and load_all

Copy link
Member

@pdelboca pdelboca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the PR @wardi ! I think it makes the codebase way more explicit and easier to follow when working on indexing.

It confuses me a little bit the both parameter, I think it would be nice to document it and I also left some comments on some lines that I think could use a little bit more of explanation about what's going on.

@wardi
Copy link
Contributor Author

wardi commented Dec 31, 2024

@pdelboca thanks, how does it look now?

@amercader amercader added this to the Forgotten favorites milestone Jan 31, 2025
@amercader
Copy link
Member

Had a look at this as it directly relates to the work I'm doing on rethinking the search.

This PR looks great but I find the both option and including the validated data in with_custom_schema a bit confusing, and that it complicates the actions code. If the indexing code is the only one that requires both versions of the dict, could we perhaps make two package_show calls in index_update_package(), one with use_default_schema=True and one with use_default_schema=False (both)? Or alternatively validate the dict directly like the old code used to do

@amercader
Copy link
Member

For the current version we need to keep things as they are but see #8444 (comment) for some thoughts on data_dict/validated_data_dict going forward

@wardi
Copy link
Contributor Author

wardi commented Mar 15, 2025

@amercader it's a significant performance hit on datasets with many resources to do all the package_show work twice, that's why I went with slightly uncomfortable "both" option.

@amercader
Copy link
Member

@wardi I understand how running package_show twice may affect performance, but if we only focus on validating the data dict, that needs to happen regardless before indexing, right? So if we call package_show once with use_default_schema=True and use_cache=False, and then on index_update_package() we validate it with

 schema = package_plugin.show_package_schema()
 validated_pkg_dict, _errors = lib_plugins.plugin_validate(...)

Wouldn't that take roughly the same time?

My concern other than the added complexity in package_show (mostly, but also in package_create and package_update) is that this is exposed to the public API, so any user can pass use_default_schema=both on a normal package_show call and will get a dict with:

{
    "id": "xxxx",
    "name" "test-dataset",
    # ...
    "validated_data_dict": "<JSON dump of the validated_data_dict>",
    "with_custom_schema": {
       # A copy of the same dict
    }
}

The with_custom_schema is expected, but the validated_data_dict comes from using the cached data_dict, which surprise, surprise also contains a validated_data_dict 😅

I know context vars are frown upon but this seems like a reasonable thing to hide from the public API. And perhaps it would be the perfect time to introduce an explicit for_indexing key as we discussed.

@wardi
Copy link
Contributor Author

wardi commented Mar 17, 2025

@wardi I understand how running package_show twice may affect performance, but if we only focus on validating the data dict, that needs to happen regardless before indexing, right? So if we call package_show once with use_default_schema=True and use_cache=False, and then on index_update_package() we validate it with

 schema = package_plugin.show_package_schema()
 validated_pkg_dict, _errors = lib_plugins.plugin_validate(...)

Wouldn't that take roughly the same time?

Validating first with the default schema and then with a custom schema isn't the same as validating only with the custom schema, so I couldn't use that approach.

If package_show had a way of returning the raw "dictized" version of the package then we could do something like this.

My concern other than the added complexity in package_show (mostly, but also in package_create and package_update) is that this is exposed to the public API, so any user can pass use_default_schema=both on a normal package_show call and will get a dict with:

{
    "id": "xxxx",
    "name" "test-dataset",
    # ...
    "validated_data_dict": "<JSON dump of the validated_data_dict>",
    "with_custom_schema": {
       # A copy of the same dict
    }
}

The with_custom_schema is expected, but the validated_data_dict comes from using the cached data_dict, which surprise, surprise also contains a validated_data_dict 😅

That's a bug for sure :-)

I know context vars are frown upon but this seems like a reasonable thing to hide from the public API. And perhaps it would be the perfect time to introduce an explicit for_indexing key as we discussed.

If we go with a for_indexing option let's make it a normal parameter. There's nothing that is returned that is internal or tied to access from a plugin so why would we want to prevent someone from e.g. building an external service that indexes datasets.

If we decide to drop the default schema version of data from the search index this all gets much simpler. Do you know if the primary consumer of this data version is the harvest extension or anything else?

@wardi
Copy link
Contributor Author

wardi commented Mar 20, 2025

@amercader use_default_schema=both has been removed

@amercader
Copy link
Member

Thanks @wardi , I think this is a massive improvement and lays the foundation for a newer and better search

@amercader amercader merged commit 1cb7fc0 into master Mar 21, 2025
9 checks passed
@amercader amercader deleted the 8395-simpler-package-index branch March 21, 2025 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

simplify package search index
4 participants