simpler package indexing, no system plugins #8396

wardi · 2024-08-13T21:47:46Z

Fixes #8395

Proposed fixes:

remove SynchronousSearchPlugin
add indexing code to logic layer
remove exception-swallowing in IDomainObjectModification notify code

Features:

includes tests covering changes
includes updated documentation
includes user-visible changes
includes API changes
includes bugfix for possible backport

…e-index

wardi · 2024-12-18T21:18:18Z

@Zharktas @pdelboca merged and ready for review again

pdelboca · 2024-12-31T09:55:14Z

ckan/config/environment.py

@@ -65,7 +65,7 @@ def load_environment(conf: Union[Config, CKANConfig]):
        warnings.filterwarnings('ignore', msg, sqlalchemy.exc.SAWarning)

    # load all CKAN plugins
-    p.load_all()
+    p.load_all(force_update=True)


Why is force_update required here?

load_environment is used in tests which sometimes need to clear all the plugins by having an empty plugin list

pdelboca · 2024-12-31T10:06:52Z

ckan/plugins/core.py

 def load(
-        *plugins: str
+        *plugins: str,
+        force_update: bool = False,
 ) -> Plugin | list[Plugin]:


It would be nice to document here what force_update is and why we needed.

added docstrings to load and load_all

ckan/logic/action/create.py

ckan/lib/search/__init__.py

pdelboca

I like the PR @wardi ! I think it makes the codebase way more explicit and easier to follow when working on indexing.

It confuses me a little bit the both parameter, I think it would be nice to document it and I also left some comments on some lines that I think could use a little bit more of explanation about what's going on.

wardi · 2024-12-31T16:26:10Z

@pdelboca thanks, how does it look now?

amercader · 2025-03-14T13:23:57Z

Had a look at this as it directly relates to the work I'm doing on rethinking the search.

This PR looks great but I find the both option and including the validated data in with_custom_schema a bit confusing, and that it complicates the actions code. If the indexing code is the only one that requires both versions of the dict, could we perhaps make two package_show calls in index_update_package(), one with use_default_schema=True and one with use_default_schema=False (both)? Or alternatively validate the dict directly like the old code used to do

amercader · 2025-03-14T14:16:09Z

For the current version we need to keep things as they are but see #8444 (comment) for some thoughts on data_dict/validated_data_dict going forward

wardi · 2025-03-15T17:39:03Z

@amercader it's a significant performance hit on datasets with many resources to do all the package_show work twice, that's why I went with slightly uncomfortable "both" option.

amercader · 2025-03-17T13:30:42Z

@wardi I understand how running package_show twice may affect performance, but if we only focus on validating the data dict, that needs to happen regardless before indexing, right? So if we call package_show once with use_default_schema=True and use_cache=False, and then on index_update_package() we validate it with

 schema = package_plugin.show_package_schema()
 validated_pkg_dict, _errors = lib_plugins.plugin_validate(...)

Wouldn't that take roughly the same time?

My concern other than the added complexity in package_show (mostly, but also in package_create and package_update) is that this is exposed to the public API, so any user can pass use_default_schema=both on a normal package_show call and will get a dict with:

{
    "id": "xxxx",
    "name" "test-dataset",
    # ...
    "validated_data_dict": "<JSON dump of the validated_data_dict>",
    "with_custom_schema": {
       # A copy of the same dict
    }
}

The with_custom_schema is expected, but the validated_data_dict comes from using the cached data_dict, which surprise, surprise also contains a validated_data_dict 😅

I know context vars are frown upon but this seems like a reasonable thing to hide from the public API. And perhaps it would be the perfect time to introduce an explicit for_indexing key as we discussed.

wardi · 2025-03-17T18:44:05Z

@wardi I understand how running package_show twice may affect performance, but if we only focus on validating the data dict, that needs to happen regardless before indexing, right? So if we call package_show once with use_default_schema=True and use_cache=False, and then on index_update_package() we validate it with
 schema = package_plugin.show_package_schema()
 validated_pkg_dict, _errors = lib_plugins.plugin_validate(...)
Wouldn't that take roughly the same time?

Validating first with the default schema and then with a custom schema isn't the same as validating only with the custom schema, so I couldn't use that approach.

If package_show had a way of returning the raw "dictized" version of the package then we could do something like this.

My concern other than the added complexity in package_show (mostly, but also in package_create and package_update) is that this is exposed to the public API, so any user can pass use_default_schema=both on a normal package_show call and will get a dict with:
{
    "id": "xxxx",
    "name" "test-dataset",
    # ...
    "validated_data_dict": "<JSON dump of the validated_data_dict>",
    "with_custom_schema": {
       # A copy of the same dict
    }
}
The with_custom_schema is expected, but the validated_data_dict comes from using the cached data_dict, which surprise, surprise also contains a validated_data_dict 😅

That's a bug for sure :-)

I know context vars are frown upon but this seems like a reasonable thing to hide from the public API. And perhaps it would be the perfect time to introduce an explicit for_indexing key as we discussed.

If we go with a for_indexing option let's make it a normal parameter. There's nothing that is returned that is internal or tied to access from a plugin so why would we want to prevent someone from e.g. building an external service that indexes datasets.

If we decide to drop the default schema version of data from the search index this all gets much simpler. Do you know if the primary consumer of this data version is the harvest extension or anything else?

wardi · 2025-03-20T21:43:54Z

@amercader use_default_schema=both has been removed

amercader · 2025-03-21T09:41:43Z

Thanks @wardi , I think this is a massive improvement and lays the foundation for a newer and better search

wardi added 10 commits August 13, 2024 17:44

[#8395] simpler package indexing, no system plugins

17f5c3b

[#8395] index modifies parameter, make copy

2a3e90f

[#8395] reindex on delete/purge group/org

df8a029

[#8395] package_show use_default_schema=both

cb48593

[#8395] resource_create defer_commit requires reindex

307041f

[#8395] search-index rebuild error code, simplify

a424975

[#8395] index test fixes

12f6cc8

[#8395] datastore: use resource_patch for active flag

756f133

[#8395] various test fixes

622ae3e

[#8395] remove owner_org with ''

71ed8c0

wardi marked this pull request as ready for review August 20, 2024 04:23

wardi assigned pdelboca and Zharktas Aug 20, 2024

wardi added 2 commits December 18, 2024 14:48

Merge remote-tracking branch 'origin/master' into 8395-simpler-packag…

1ac00d5

…e-index

[#8395] merge fixes

3830a0d

pdelboca reviewed Dec 31, 2024

View reviewed changes

ckan/logic/action/create.py Outdated Show resolved Hide resolved

pdelboca reviewed Dec 31, 2024

View reviewed changes

ckan/lib/search/__init__.py Outdated Show resolved Hide resolved

pdelboca reviewed Dec 31, 2024

View reviewed changes

wardi added 3 commits December 31, 2024 09:59

[#8395] improve parameter docs

a8a769d

[#8395] remove unnecessary changes

fcdf85d

[#8395] update test: before_resource_show called less

09f5b01

amercader added this to the Forgotten favorites milestone Jan 31, 2025

Merge branch 'master' into 8395-simpler-package-index

e592179

[#8395] add logic.package_show_default_and_custom_schemas

059ca87

amercader merged commit 1cb7fc0 into master Mar 21, 2025
9 checks passed

amercader deleted the 8395-simpler-package-index branch March 21, 2025 09:41

github-project-automation bot moved this from In Progress to Done in Performance, Developer & Maintainer Experience Mar 21, 2025

wardi added the Performance label Mar 25, 2025

amercader mentioned this pull request May 16, 2025

Breaking change: the before_dataset_index hook receives unvalidated dataset dict #8953

Open

amercader mentioned this pull request Jul 3, 2025

CLI functions should return non-zero exit code when something goes wrong #9011

Closed

simpler package indexing, no system plugins #8396

simpler package indexing, no system plugins #8396

Uh oh!

Conversation

wardi commented Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed fixes:

Features:

Uh oh!

wardi commented Dec 18, 2024

Uh oh!

pdelboca Dec 31, 2024

Choose a reason for hiding this comment

Uh oh!

wardi Dec 31, 2024

Choose a reason for hiding this comment

Uh oh!

pdelboca Dec 31, 2024

Choose a reason for hiding this comment

Uh oh!

wardi Dec 31, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pdelboca left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wardi commented Dec 31, 2024

Uh oh!

amercader commented Mar 14, 2025

Uh oh!

amercader commented Mar 14, 2025

Uh oh!

wardi commented Mar 15, 2025

Uh oh!

amercader commented Mar 17, 2025

Uh oh!

wardi commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wardi commented Mar 20, 2025

Uh oh!

amercader commented Mar 21, 2025

Uh oh!

Uh oh!

Uh oh!

wardi commented Aug 13, 2024 •

edited

Loading

pdelboca left a comment •

edited

Loading

wardi commented Mar 17, 2025 •

edited

Loading