Imagining a generic /api/action/search API and pluggable search back ends #8444

wardi · 2024-09-17T21:57:12Z

wardi
Sep 17, 2024
Maintainer

Generic Search

A search API that:

supports all CKAN entities for a unified search (e.g. return datasets, groups, pages, other custom or external entities at the same time)
includes a general text search across all fields like simple package_search?q= queries
has multiple language stemming support, (as available by search back end, separate indexes required)
allows sorting results by relevance, date modified or custom values (e.g. vector or geospatial search as available by search back end) and paginate by score/date or custom values
filters by specific fields like package_search?fq= queries including multiple options for each filter or range queries for date/number fields
is able to return facet counts for other available fq filter values (as available by search back end)
can return only entity_types and ids instead of complete objects
has a clear pattern/location for additional parameters to be sent directly to the search back end for custom processing

We can adapt the existing package_search to call this new API by:

lightly parsing q= queries for solr-like syntax and converting it to a general text search or solr-specific parameter
converting sort= and fq= parameters
limit results to ids if fl=id is passed
always filtering on entity_type=package
filtering out private and draft datasets when include_private=false or include_drafts=false
returning (and converting?) all facets

Custom search back ends #7552 would implement querying based on the common requirements of the generic /api/action/search but would be free to process back end-specific additional parameters in any way they choose. We document the common parameters and inform users that additional parameters are site or back end-specific.

The common parameters are enough to implement site-wide search in CKAN with facets supporting multiple-select and ranges in core. Extensions would be used to add support for additional parameters to the UI e.g. vector or geospatial search.

Indexing

It may be useful to index entities in multiple places, e.g. elastic search + a graph database. Multiple plugins may be enabled for indexing but only a single plugin will handle implementing the generic search API.

Language stemming will be enabled based on a configuration option, falling back to the ckan.locales_offered setting if not provided.

CKAN will provide an interface to plugins for registering search index schemas. CKAN will merge these schemas on start-up and refuse to start if there is any conflict (e.g. one schema declares "authors" as a multiple-text field while another declares it as an integer). The merged schema is used to configure the search back end using a ckan search-index CLI command.

Search schemas are a flat dict of fields with types like the ones in https://solr.apache.org/guide/8_1/field-types-included-with-solr.html e.g.:

a text field, possibly available in multiple languages
a string field (not stemmed, no language, e.g. something to facet on)
a number
a date

These fields may be repeated or single-value. Additional text fields are generally discouraged as using a common text field for all text should give better results for default searches.

At the CKAN action level when any entity that could be indexed is created, updated or deleted it will be passed to a generic function to convert it to only a text representation for indexing. Plugins can intercept this conversion, like the current IPackageController.before_dataset_index method does but for any entity type.

This will allow extensions like ckanext-scheming to generate a search schema and convert the data for indexing automatically. This is the last missing piece for flexible and immediately usable custom schemas in CKAN without custom plugin code required.

wardi · 2024-09-18T20:28:50Z

wardi
Sep 18, 2024
Maintainer Author

CKAN Interfaces (rough)

class ISearchProvider:
    def search_query(
            self,
            query: str,                           # e.g. 'water data'
            filters: dict[str, str | list[str]],  # e.g. {'metadata_modified<': '2024-01-01', 'entity_type': ['package']}
            sort: list[list[str]],                # e.g. [['title'], ['metadata_modified', 'desc']]
            lang: str,                            # for text query language stemming e.g. 'de'
            additional_params: dict[str, Any],    # custom parameters this provider may process or ignore
            return_ids: bool,                     # True: return records as ids (may increase maximum record limit)
            return_entity_types: bool,            # True: wrap records with {'entity_type': et, 'data': record} objects
            return_facets: bool,                  # True: return facet counts for available indexes
            limit: Optional[int],                 # maximum records to return, None: maximum provider allows
            ) -> Optional[SearchResults]:
        '''generate search results or return None if another provider
        should be used for the query'''

    def initialize_search_provider(self, combined_schema: SearchSchema, clear: bool) -> None:
        '''create or update indexes for fields based on combined search
        schema containing all field names, types and repeating state'''

    def index_search_record(self, entity_type: str, id_: str, search_data: dict[str, str | list[str]]) -> None:
        'create or update search data record in index'

    def delete_search_record(self, entity_type: str, id_: str) -> None:
        'remove record from index'

class ISearchFeature:
    def entity_types(self) -> list[str]:
        'return list of entity types covered by this feature'

    def feature_search_schema(self) -> SearchSchema:
        '''return index fields names, their types (text, str, date, numeric)
        and whether they are repeating'''

    def format_search_data(self, entity_type: str, data: dict[str: Any]) -> dict[str, str | list[str]]:
        '''convert data for this entity type to search data suitable to be
        passed to the search provider index method'''

    def existing_record_ids(self, entity_type: str) -> Iterable[str]:
        '''return a list or iterable of all record ids for the given entity type
        managed by this feature. Return an empty list for core entity
        types like 'package' or entity types managed by another feature.

        This method is used to identify missing and orphan records in the
        search index'''

    def fetch_records(self, entity_type: str, records: Optional[Iterable[str]]): -> Iterable[dict[str, Any]]:
        '''generator of all records for this entity type managed by this
        feature, or only records for the ids passed if not None.

        This method is used to rebuild all or some records in the search
        index'''

Thinking about how to handle cached data_dict and validated_data_dict, should SearchSchema be extended to allow declaring stored fields separate from indexed fields and ISearchProvider.search have a parameter to choose the stored field to return? Or should we only store the validated_data_dict equivalent for all records and make package_search and package_show much slower when requesting the default schema?

0 replies

pwalsh · 2024-09-19T09:01:27Z

pwalsh
Sep 19, 2024

I dont see an equivalent in the aforementioned https://solr.apache.org/guide/8_1/field-types-included-with-solr.html but object, nested and join types as they are defined by Elastic seem quite important, and at very least object and nested are often extensively used in my experience with Elastic:

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

It also may be worth looking at postgres types for reference:

https://www.postgresql.org/docs/current/datatype.html

0 replies

amercader · 2024-09-19T10:59:42Z

amercader
Sep 19, 2024
Maintainer

This is looking great.

IIUC:

ISearchProvider would be implemented by different search backends (Solr, ES, Postgres, etc).
ISearchFeature could be implemented by:
- a scheming plugin that would translate entries in the schema file to changes in the schema (e.g. to index a particular custom field)
- site plugins that wanted to customize the search in a way not covered by the scheming search functionality
- feature plugins that add extra functionality with custom field types, params etc (e.g. ckanext-spatial)

It might be useful to test these proposed interfaces against real use cases to see if they cover them:

1. spatial search in ckanext-spatial

This currently works leveraging custom Solr field types. There are actually two types of search with different Solr field types, but the mechanics would be the same so let's focus on one. To provide spatial search:

A new location_rpt field type is added to the Solr schema field definitions
A new spatial_geom field of type location_rpt is added to the Solr schema
At index time (before_dataset_index()), spatial information is read (by default from a custom field/extra named spatial, but it could come from elsewhere), validated and stored in the spatial_geom field in the expected format
At search time (before_dataset_search()), user input is captured via custom parameters (ext_bbox={minx},{miny},{maxx},{maxy}), translated to something Solr understands ({{!field f=spatial_geom}}Intersects(ENVELOPE({minx}, {maxx}, {maxy}, {miny}))) and added to the fq_list param

An hypothetical Elastic Search powered spatial search would follow the same mechanism, but in that case the spatial field type is already supported out of the box (so no point 1).

So to replicate the current Solr based query:

Point 1 sounds more like an ISearchFeature concern. Would that be a separate setup command in the feature plugin?
Once that is in place, point 2 would be done via ISearchFeature.search_schema. Would the extra search params (bbox) be defined here as well or in a separate mehtod?
ISearchFeature.entity_type would return package (or maybe dataset?? 🥲 )
Point 3 would be replaced by ISearchFeature.format_search_data() (assuming this gets the full validated dict)
Point 4 would remain in their current hook as hooks like before_dataset_search would still be supported right? Or do we want to move them from IPackageController to ISearchFeature? I suppose we need to do if we want to support customizing the search across different entities. Of course the way extensions custom parameters are managed will change (no more ext_ nonsense) but the principle would be the same
ISearchFeature.existing_record_ids() and ISearchFeature.fetch_records() would not be implemented as we are handling just datasets so core will take care of it right?

2. Indexing and querying pages

Ckanext-sitesearch supports indexing page contents if ckanext-pages is installed, and performing searches. There is no additional setup other than installing and configuring ckanext-pages, but the indexing is done via a separate command (ckan sitesearch rebuild pages). When rebuilding the index, all page models are retrieved from the database

So in this case:

ISearchFeature.entity_type would return a custom pages
ISearchFeature.search_schema would return a whole new schema for the entity type pages (title, description, modified date etc, and a catch all field)
ISearchFeature.existing_record_ids() and ISearchFeature.fetch_records() need to be implemented using the Page models and dictization to return whatever ids or objects needed to index them.

There are still some blind spots for me but overall sounds like a good direction to follow!

3 replies

wardi Sep 19, 2024
Maintainer Author

* `ISearchProvider` would be implemented by different search backends (Solr, ES, Postgres, etc).

* `ISearchFeature` could be implemented by:
  
  * a scheming plugin that would translate entries in the schema file to changes in the schema (e.g. to index a particular custom field)
  * site plugins that wanted to customize the search in a way not covered by the scheming search functionality
  * feature plugins that add extra functionality with custom field types, params etc (e.g. ckanext-spatial)

Maybe this is too many things for ISearchFeature to support. Defining new entity types could be separated into an IEntityType interface where we move entity_types, existing_record_ids and fetch_records that could be used by e.g. ckanext-pages in combination with ISearchFeature.

Scheming custom field indexes, geospatial search and things like datapusher plus summary statistics would use only ISearchFeature and not have to define do-nothing versions of the above methods.

So to replicate the current Solr based query:

* Point 1 sounds more like an `ISearchFeature` concern. Would that be a separate setup command in the feature plugin?

* Once that is in place, point 2 would be done via `ISearchFeature.search_schema`. Would the extra search params (`bbox`) be defined here as well or in a separate mehtod?

This is the difficult overlap in ISearchProvider and ISearchFeature because we need specific knowledge of how a feature is implemented in the provider so that part can't be in the feature plugin.

Let's allow providers to offer types that aren't included in the standard set of search field types (e.g. spatial_geom and location_rpt field types would have implementations in the Solr provider and ES provider), then the feature plugin includes field(s) with those types in its SearchSchema. It would be up to the provider to interpret additional search parameters or filters passed in for those fields.

* `ISearchFeature.entity_type` would return `package` (or maybe `dataset`?? 🥲 )

Let's use package for the entity_type to align with the API naming.

* Point 4 would remain in their current hook as hooks like `before_dataset_search` would still be supported right? Or do we want to move them from `IPackageController` to `ISearchFeature`? I suppose we need to do if we want to support customizing the search across different entities. Of course the way extensions custom parameters are managed will change (no more `ext_` nonsense) but the principle would be the same

With the proposal above ISearchFeature has no way to intercept search query parameters, so unless we change that we'd have to leave the interpretation in the provider.

* `ISearchFeature.existing_record_ids()` and `ISearchFeature.fetch_records()` would not be implemented as we are handling just datasets so core will take care of it right?

Moving these to a new IEntityType interface would remove the need for these empty methods.

amercader Sep 20, 2024
Maintainer

This is the difficult overlap in ISearchProvider and ISearchFeature because we need specific knowledge of how a feature is implemented in the provider so that part can't be in the feature plugin.

Let's allow providers to offer types that aren't included in the standard set of search field types (e.g. spatial_geom and location_rpt field types would have implementations in the Solr provider and ES provider), then the feature plugin includes field(s) with those types in its SearchSchema. It would be up to the provider to interpret additional search parameters or filters passed in for those fields.

I'm not convinced about this. Maybe from a technical point of view it makes sense to split things this way but from a maintenance aspect, it would be better if all modifications needed for a particular feature were centralized in the same extension. Otherwise if users want to implement say a vector search in Solr or ES need to go to the Solr/ES provider maintainers and ask to incorporate support for certain types or make the provider understand new search parameters, effectively tying the provider to a particular feature.

What about ISearchFeature hooking into ISearchProvider.initialize_search_provider() to setup any field types etc and to ISearchProvider.search_query to interpret the additional parameters and modify the query sent to the provider? The latter is basically what we are doing now with IPackageController.before_dataset_search()

wardi Sep 20, 2024
Maintainer Author

Sounds good. It's a couple more steps but it unblocks search feature development. Search feature plugins will grow provider-specific bits of code but at least all the code for a feature can be in one place.

amercader · 2024-09-19T11:05:05Z

amercader
Sep 19, 2024
Maintainer

@pwalsh

nested and join types

These are not present in the current search schema so I'd stick with the currently supported types. Additional types like the ones you mentioned would be be easy to offer by the relevant ISearchProvider (or a ISearchFeature, who would registed new field types is still a bit hazy to me)

0 replies

amercader · 2024-09-26T11:19:46Z

amercader
Sep 26, 2024
Maintainer

Trying to visualize a bit more how the proposed interfaces would play together. In this case testing how the spatial search might look like:

Querying

# ckan/logic/action/get.py (or an extension for now)
def search(context, data_dict):

    # Check auth


    backend = get_search_backend()


    # Validate data_dict. Any key not in the standard schema is moved to
    # additional_params
    # Q: Do we validate just the common interface params here or get additional schema
    #    entries from the search feature plugin to validate additional params like `bbox`?
    schema = default_search_query_schema()
    for plugin in PluginImplementations(ISearchFeature):
        plugin.search_schema(schema)

    # Call query method of the relevant backend

    query_dict = { 
        ... 
    }
    result = backend.search_query(**query_dict)

    return result


# ckanext-search-solr
class SolrSearchProvider(ISearchProvider):

    def search_query(self, query, filters,...):

        for plugin in PluginImplementations(ISearchFeature):

            # Search feature plugins will probably want to modify the query, filters etc parameters
            # Q: what's the best way of passing them to the search feature plugins? 
            #    wrapped in a query_dict? modify in place?
            plugin.before_search(query, filters, sort, lang, additional_paramas, ...)

        # Construct actual Solr query

        # Call Solr Client

        # Parse results to adapt them to the common interface format

        return results

    

# ckanext-spatial
class SolrSpatialSearch(ISearchFeature):

    def before_search(self, query, filters, etc):

        # Check additional_params for a `bbox` param

        # Add a new filter based on that
        # Note: the filters format suggestion was a field_name: field_value type, but this doesn't fit 
        # this convention as it defines the whole value for fq, so I'm suggesting a convention to add `fq` filters
        # directly
        filters["fq"] = "{{!field f=spatial_geom}}Intersects(ENVELOPE({minx}, {maxx}, {maxy}, {miny}))"

Indexing

# This could be a CLI command or an action
def index(entity_type, entity_ids)

    # Check auth

    # Get whatever data needs to be indexed, e.g. a validated data_dict for datasets or ISearchEntity.fetch_records()
    # for custom entities
    data_dict = get_search_data_for_entity()

    # Call the index method of the relevant backend
    backend = get_search_backend()

    backend.index_search_record(entity_type, id_, data_dict)


# ckanext-search-solr
class SolrSearchProvider(ISearchProvider):

    def index_search_record(entity_type, id_, data_dict):
        for plugin in PluginImplementations(ISearchFeature):
            plugin.before_index(entity_type, id_, data_dict)


# ckanext-spatial
class SolrSpatialSearch(ISearchFeature):
    def before_index(entity_type, id_, data_dict):

        # Check if entity is dataset, and if it has a `spatial` geojson field

        # if so, add the relevant field to the data to index
        data_dict["spatial_field"] = wkt_version_of_geometry

1 reply

Lowprofile666-hub Sep 26, 2024

Thank you

wardi · 2024-09-26T20:01:39Z

wardi
Sep 26, 2024
Maintainer Author

Instead of a before_search that mutates the parameters can we do something like:

    processed = {}
    for plugin in PluginImplementations(ISearchFeature):
        processed.update(plugin.process_additional_params(additional_params))

then the processed dict could be passed through to search containing trusted values (not arbitrary user-provided ones) and we don't have to mutate the parameters passed.

Same sort of thing for indexing. A before_index that mutates parameters feels like a poor design because it mixes inputs and outputs and makes it harder for plugins to work with one another.

Let's keep the validated version of the entity in one dict and the record to be indexed in another. Something kind of like:

    def index_search_record(entity_type, id_, data_dict):
        solr_record = {}
        for plugin in PluginImplementations(ISearchFeature):
            solr_record.update(plugin.augment_search_record(entity_type, id_, data_dict))
        # the rest of solr_record populated here

1 reply

amercader Oct 11, 2024
Maintainer

@wardi I'm putting together a gist with the different files mentioned so it's easy to follow the draft implementation.

I like the approach of not mutating parameters when calling ISearchFeature methods, but I don't understand how would processed parameters by ISearchFeature be merged with the trusted values?
For instance if you the spatial search receives {"bbox": xmin,ymin,xmax,ymax} and needs Solr to receive "fq={{!field f=spatial_geom}}Intersects(ENVELOPE({minx}, {maxx}, {maxy}, {miny}))" what does it need to return in process_additional_params()?
{"fq": "xxx"} or {"filters": ["xxx"]} and then the Solr ISearchProvider merges them with the filters already present?

makes it harder for plugins to work with one another.

If all plugins update the same processed dict they can also clash with one another right? e.g. if both return filters only the last one called will be kept

amercader · 2024-10-11T11:01:29Z

amercader
Oct 11, 2024
Maintainer

@sagargg @gavram do you feel like you have enough to create a small Proof of Concept?

I would focus on small, focused POCs to not complicate things and then tie it all together progressively:

A simple search action with a basic Solr provider implementing ISearchProvider, that supports indexing datasets and another core entity, plus returns query results. We can use this to polish the external API interface and the return format
A basic search feature plugin, can be the spatial search described above. This would test how to index custom fields and how to provide additional parameters to modify the standard search
Same as the two above but with Elastic search
Indexing and querying of custom entities (like pages)

How does this sound?

BTW I've put together the various snippets discussed in the discussion in a gist so it's easy to follow:

https://gist.github.com/amercader/2f0f54e1fcf33bad5b7d8b10aa2a10f8

1 reply

amercader Nov 11, 2024
Maintainer

@gavram @sagargg just checking if you had a chance to work on this or plan to?

amercader · 2025-03-13T14:16:46Z

amercader
Mar 13, 2025
Maintainer

I've been working on an initial Proof of Concept to validate this approach here, in case someone wants to follow along:

https://github.com/amercader/ckanext-search

Things look promising but there's still a long way to go

0 replies

amercader · 2025-03-14T14:15:23Z

amercader
Mar 14, 2025
Maintainer

Going forward maybe we need to re-think what we send to index and what actually gets indexed.

Right now we pass the following to the function that indexes datasets in Solr:

The un-validated data_dict (default schema). These are the fields that get indexed based on the Solr schema, i.e. they get actually passed to pysolr for indexing
This same un-validated data_dict gets indexed as a data_dict field
The validated data_dict (custom schema) gets indexed as a validated_data_dict field

Setting aside the issue of space and whether the indexed data_dict is needed at all, it seems wrong that what we are indexing (and what gets passed to IPackageController.before_dataset_index() is the un-validated data_dict. Users might have validators in place that affect the field values, and when thinking about your custom schema and what fields get indexed and how it makes more sense to think about it in terms of how you define your custom fields in scheming than how CKAN internally stores them (e.g. as extras, dumped json lists, etc)

What if the new search:

Received the validated_data_dict, and used that to index the relevant fields based on the search schema (which we want to integrate in the schema definition anyway)
Only stored the validated_data_dict in the search index, to keep the current caching mechanism in place

If a site really needs the un-validated data_dict stored the can hook into before_dataset_index and store them themselves but for the majority of sites it seems like the validated version would suffice in most cases.

@wardi does this make sense?

1 reply

wardi Mar 15, 2025
Maintainer Author

💯 would really prefer this approach

amercader · 2025-04-03T09:52:54Z

amercader
Apr 3, 2025
Maintainer

Filters

We need to define a generic, provider agnostic way of providing filters to the search methods, i.e. this param of the ISearchProvider.search_query() method:

 filters: dict[str, str | list[str]],  # e.g. {'metadata_modified<': '2024-01-01', 'entity_type': ['package']}

Additionally, similar functionality is needed for the DataStore search filters, so let's come up with something that can work on both (#8689)

I've done some minimal exploration of the state of the art, even as just for inspiration, :

Solr has the standard query parser most of us have had to deal with at some point, but also a JSON Request API

?fq=entity_type:package&fq=metadata_modified:[2024-01-01 TO *]

{

  "filter" : [
    "entity_type:package",
    "metadata_modified:[2024-01-01 TO *]"
  ]
}

Elasticsearch / Opensearch have a very comprehensive (and complex) DSL

{
  "query": {
    "bool": {
      "must": [
        { "term": { "entity_type": "package" } }
      ],
      "filter": [
        { "range": { "metadata_modified": { "gte": "2024-01-01" } } }
      ]
    }
  }
}

Django's Field lookups for QuerySets use the syntax described by @EricSoroos in Datastore: Filters #8689 (<fieldname>__<operation>)
```
MyModel.objects.filter(
    metadata_modified__lt="2024-01-01",
    entity_type="package"
)
```

MongoDB uses similar dict objects with operations as keys

{
  "entity_type": "package",
  "metadata_modified": { "$gte": "2024-01-01" }
}

Just adopting one of the above and calling it a day isn't really an option, all query languages have options tailored to the specific software so at least we should be looking at a subset (the one that we want our searches to support and the ones that the preferred providers can support). I considered also extracting the parse / validation functionality of a Python implementation like the opensearch-py one, but it would mean pulling a lot of extra logic.
I think the best approach is to define a small, focused spec that can be expanded gradually in the future, and write our own validators for it.

I like @wardi's suggestion in #8696 of separating fields, operators and values:

"filters": {
    "field1": "value",    # exact match
    "field2": ["value1", "value2"]    # or clauses
    "field3": {"gte": 9000, "lt": 10000}    # range
},

What operations should be supported initially?

All systems above support ranges with gt, gte, lt and lte so those are definitely in.
Matching operators, things like exact, iexact, startswith, endswith, contains or in
Null, empty values or whether the field exists
Some providers have date specific operators to extract date parts like year or month
What about AND / OR combinations? ES / OS use must / should keys and Mongo $and / $or to group filters, and they can be nested as well.

Keen to hear people's thoughts on this

0 replies

wardi · 2025-04-04T03:13:39Z

wardi
Apr 4, 2025
Maintainer Author

@amercader I like this suggested syntax, it's composable, extendable and succinct.

We can compose or clauses and operations, e.g. an "or" clause combined with ranges:

"filters": {
    "field3": [
        {"gte": 9000, "lt": 10000},
        {"gte": 15000, "lt": 19000}
    ]
}

We can extend it with new operations using plugins:

"filters": {
    "published_date": {"days_old": 30}
}

Strings / numbers / bools / null are short-form for {"eq": the_value}. We must use an "eq" operation when the value for comparison may be a list or a dict, e.g.

"filters": {
    "json_field": {"eq"': {"a": "b"}},         # "a" is not an operation
    "text_array_field": {"eq": ["a", "b"]}    # "[]" is not an or clause
}

tangent 1: filter_operations

Operations appearing only after column names is a limitation, though. There isn't a way to express operations that cover multiple columns like "admin = true or organization = public"

We also don't have a way to express pagination based on the sort order like "records where (year, month) > (2024, 5)" where year and month are separate columns.

What could these examples look like? We could have a separate argument for filters where the column name must be a parameter, e.g. "admin = true or organization = public" could be:

"filter_operations": [
    {"eq": ["admin", true]},
    {"eq": ["organization", "public"]}
]

"records where (year, month) > (2024, 5)" could be:

"filter_operations": {
    "gt": [["year", "month"], [2024, 5]]
}

"filters" is column — operation — value
"filter_operations" is operation — [column(s), value(s)]

tangent 2: advanced sorting

That last year, month example almost works for pagination, but if the sort order combines ascending and descending columns we might be better off defining "after" and "before" operations that automatically apply to the sort order e.g.

"sort": ["row asc", "seat desc"],
"filter_operations": {
    "after": [53, "R"]
}

where "after" would "know" that it's looking for records [53, "Q"], [53, "P"], ..., [53, "A"], [54, "Z"], ...

Or do we drop support for sorting multiple columns in different orders? I'm not sure how to represent that last example in the general case with a postgresql backend, it's likely not possible in general for all search backends.

If we require that sorting/pagination only be done on one unique indexed column at a time then we avoid this problem. That shouldn't be a problem for dataset search except that we need to ensure things that might not be unique (e.g. date_updated) are combined with something that is unique like the package id behind the scenes.

tangent 3: language support

We also need to think about language support.

It's reasonable for a user to have multiple language versions of dataset metadata, then e.g. ask for the datasets to be returned sorted by "Spanish title with es_ES collation". If the backend remembers the collation type of each field then this shouldn't affect the search interface, but it will be important to be able to declare fields to be indexed with specific collation types so that users don't get the wrong results.

For general search (users passing a value to the q parameter) of multilingual metadata we don't know the language the user would like to search against. We could decide to use a trigram index for any q search or we have to generate multiple full text indexes, one for each language stemming rule supported by the site and users will need to pass the language they're using for their q parameter.

5 replies

amercader Apr 25, 2025
Maintainer

Thanks @wardi, finally got some time to think a bit more about this.

We can compose or clauses and operations, e.g. an "or" clause combined with ranges:

👍

Strings / numbers / bools / null are short-form for {"eq": the_value}. We must use an "eq" operation when the value for comparison may be a list or a dict, e.g.

👍

We can extend it with new operations using plugins:

For now I'm allowing plugins to add new query params, not to customize the filters, but we can think of a way of enabling this in the future.

amercader Apr 25, 2025
Maintainer

tangent 1: filter_operations

I'm not sure I like the separate parameter, it adds complexity to the implementation. What about keeping the same basic syntax for filters but introducing a new top-level dedicated key ($or, _or, etc) that contains lists of filters that need to be combined with OR:

"filters": {
    "field1": "value",
    "$or": [
        {"admin": "true"},
        {"organization": "public"}
    ]
}

---

"filters": {
    "$or": [
        {"year": {"gt": "2024"}},
        {"month": {"gt": "5"}},
    ]
},

Of course we can also support $and. Filters not in $or or $and are assumed to be AND filters:

"filters": {
    "field1": "value",    # exact match
    "field2": ["value1", "value2"]    # or clauses
    "field3": {"gte": 9000, "lt": 10000}    # range
},

# Same as
"filters": {
    "$and": [
        {"field1": "value"},    # exact match
        {"field2": ["value1", "value2"]},    # or clauses
        {"field3": {"gte": 9000, "lt": 10000}}    # range
    ]
},

we can choose whether to support nested $or and $and keys, but this could quickly become complex to implement at the backend level

amercader Apr 25, 2025
Maintainer

tangent 2: advanced sorting

Or do we drop support for sorting multiple columns in different orders? I'm not sure how to represent that last example in the general case with a postgresql backend, it's likely not possible in general for all search backends.

Setting aside the pagination bit, sorting different columns in different directions is pretty common no? (eg field1 asc, field2 desc) even in postgres (ORDER BY field1 ASC, field2 DESC) so I don't think we should restrict this.

Would there be an issue with doing the following?:

"sort": ["row asc", "seat desc"],
"filters": {
    "row": {"gt": 53},
    "seat": {"lt": "R"}
     }
}
``

wardi Apr 25, 2025
Maintainer Author

The filters example you give would exclude records that come after the ones already seen by the sort order, like {"row": 54, "seat": "V"}

wardi Apr 26, 2025
Maintainer Author

I think the $and and $or operations are a little nicer than my filter_operations suggestion, (except for limiting the valid column names a bit).

My earlier suggestion for paginating with an after filter operation:

"sort": ["row asc", "seat desc"],
"filter_operations": {
    "after": [53, "R"]
}

Could be instead implemented like this with $or:

"sort": ["row asc", "seat desc"],
"filters": {
    "$or": [
        {"row": 53, "seat": {"lt": "R"}},
        {"row": {"gt": 53}}
    ]
}

Which should generalize pretty easily.

wardi · 2025-04-27T18:12:56Z

wardi
Apr 27, 2025
Maintainer Author

I think my tangent 1 and tangent 2 are solved by having this $or operator.

We can handle the case of a field starting with $ by saying that users must double the $ if it appears as the first character in a field name. This way sites can add their own custom top level operators too. e.g. if someone needs to be able to compare fields in the same record:

"filters": {
    "$eq_fields": ["created_date", "modified_date"]
}

This also opens up the possibility for the datastore to allow custom advanced queries on large datasets (e.g. to support a visualization or summary) without needing to open up access to the generic datastore_search_sql.

So our filters dict keys would take the form:

"field_name": { "operator": <parameters>, ... }

"$$field_with_a_$_at_the_start": { "operator": <parameters>, ... }

or

"$top_level_operator": <parameters>

With these short forms:

"field_name": <not list or dict>  =>  "field_name": { "eq": <not list or dict> }

"field_name": <list of dicts> => "$or": [ { "field_name": <dict 1>, ..., "field_name": <dict n> } ]

"field_name": <list of non-dicts>  =>  "field_name": { "in": <list of non-dicts> }

and if the filters value itself is a list instead of a dict:

[ <filter1>, <filter2> ]  =>  { "$or": [ <filter1>, <filter2> ] }

Filters that are not a dict or list are invalid, and the validity of parameters depend on the operator.

3 replies

amercader Apr 30, 2025
Maintainer

and if the filters value itself is a list instead of a dict:

[ , ] => { "$or": [ , ] }

Shouldn't it default to "$and"? that was my original suggestion for filters outside $or or $and

"filters": {
    "field1": "value", 
    "field2": ["value1", "value2"]  
},

It's also our default operator for Solr queries

wardi Apr 30, 2025
Maintainer Author

Suggesting list-filters be treated as an OR because we can already use {} for the AND case.

We have existing OR-like behaviour represented with a list:

"filters": {
    "field1": ["this", "that"]
}

I suggested in #8444 (comment) above that a list could be used to combine multiple operations on the same field as an OR (I just updated my parent comment to include this):

"filters": {
    "field3": [
        {"gte": 9000, "lt": 10000},
        {"gte": 15000, "lt": 19000}
    ]
}

Here I'm suggesting that a list at the top level could represent an OR as well:

"filters": [
   {"field1": "this"},
   {"field2": "that"}
]

amercader May 2, 2025
Maintainer

Got you.

I like it, basically lists of filters are OR clauses, and dicts are AND clauses, either equality filters in different fields (e.g. "filters": {"field1": "value1", "field2": "value2"} => field1 = "value1" AND field2 = "value2 or operators on the same field (e.g. "filters": {"field1: {"gt": 10, "lt": 20}} => field1 > 10 AND field1 < 20.

I think that we are at the stage that all the bits need to be written down together in a draft spec :) I'll try to put something together.

wardi · 2025-04-28T03:36:20Z

wardi
Apr 28, 2025
Maintainer Author

As an implementation detail of $or we could accept dict parameters:

"$or": {
    "field1": "value1",
    "field2": "value2"
}

and treat them the same as:

"$or": [
    { "field1": "value1" },
    { "field2": "value2" }
]

AFAICT $and is only really required when we want to apply the same operation on the same field multiple times, e.g.:

"$and": [
    { "title": { "contains": "water" } },
    { "title": { "contains": "quality" } }
]

I think this is fine, but for operations where requiring $and use is common we could consider extending their implementation to avoid $and. e.g. a contains operation might accept a list where all values must appear:

"title": { "contains": [ "water", "quality" ] }

0 replies

amercader · 2025-05-02T13:52:45Z

amercader
May 2, 2025
Maintainer

@wardi I'm working on an initial implementation of the filters validation. I think it would be great if search providers got the "expanded" form of the provided filters, i.e. with all the different shorthands expanded to the standard filter form so they only have to worry about that particular part of the syntax.

So at the CKAN core level (e.g. in the datastore_search or the new search actions) filters will be validated and expanded and then passed to the backends query function.

Do you think that the expansion should take place in the validator function itself or should be done in two steps (first validation, then expansion)?

Also I'm wondering if it's worth that we expand all filters to $or and $and constructs so we reduce ambiguity for providers and they only have to worry about one way of encoding the filters, even if it's the more verbose one.

So even if users pass {"field1": "value1", "field2": "value2"} to the API, providers will always receive something like

{
    "$and": [
        {"field1": {"eq": "value1"}},
        {"field2": {"eq": "value2"}},
    ]
}

1 reply

wardi May 2, 2025
Maintainer Author

Yes, a common place to normalize/expand the filters would be much better than having potentially different implementations scattered around. This can be part of the the validator because it's likely easier to implement normalization and validation together.

Like you describe the normalization could be:

single-operation dicts (maybe we should use an object for these instead?)
$and and $or top level operation values are always lists of single-operation dicts

with a NamedTuple the normalization above would look something like:

from typing import NamedTuple
class FilterOp(NamedTuple):
    op: str
    field: str | None
    value: Any

filters = FilterOp(op='$and', field=None, value=[
    FilterOp(op='eq', field='field1', value='value1'),
    FilterOp(op='eq', field='field2', value='value2')
])

Benefits of using a NamedTuple are

faster, less memory usage than dicts/lists
immutable once created
cleaner code (e.g. filter.op instead of filter.keys()[0])
top level operations are identified during normalization (field set to None)
unambiguously only one way to represent filters

amercader · 2025-05-06T12:14:54Z

amercader
May 6, 2025
Maintainer

@wardi (and anyone interested!) can you check this and see if I missed something?

https://hackmd.io/@amercader/Hk2PqOPxgx

2 replies

wardi May 6, 2025
Maintainer Author

the "suggest changes" feature on hackmd is very broken, maybe a gist or github repo would work better?

amercader May 7, 2025
Maintainer

@wardi see https://github.com/amercader/ckanext-search/pull/1/files#diff-bf6e8f6cbeee621f4691a94635b3276d3ec2bd232a4db0d0a51d005f0ff28c13. Ignore the validators and tests in the same PR, these need to be updated based on your NamedTuple suggestion

amercader · 2025-07-22T11:25:19Z

amercader
Jul 22, 2025
Maintainer

Generic search endpoint (`search` action)

It would be good to settle on a public API interface as this will affect the internal implementation. Would love to get people's thoughts on this

Input parameters

This is the search public interface and matches roughly the parameters of the ISearchProvider.search_query() method.

Query params

Right now it looks basically like this:

{
    "q": "salmon",
    "filters": {},
    "sort": "",
    "limit": "",
    "start": "",
    "facets": {},
    "lang": "",
}

q: The search query.

The q parameter is the query term. By default it applies to all text fields.

TODO: Should CKAN core sanitize the input somehow or it should be up to the providers?
TODO: Should we standardize a way to search in specific fields like Solr's field:value?
filters: query filters defined using the CKAN Query Filters spec. These are parsed and validated by core, providers will receive FilterOp objects.
sort, limit and start are self-explanatory, although we could add generic support for indexed-based pagination (ping @wardi on this)
facets contains all info needed to return facets on the query. The supported fields in the legacy Solr search are facet.mincount, facet.limit and facet.fields. It would be great if providers could add support for more fields as there are many interesting features that can be enabled.
lang: still haven't thought it through but it will allow providers to tweak the query depending on the language

Facets definition

TL;DR:
TODO: what's the preference?

Option 1: neater, requires another plugin hook to customize facet fields

{ 
    "q": salmon,
    "facets": {
        "fields": ["tags", "res_format"],
        "mincount": 1
    }

or:

Option 2: messier, can use the existing plugin hook to customize the search query schema

{ 
    "q": salmon,
    "facets.fields": ["tags", "res_format"],
    "facets.mincount": 1
    }

?

By default, a ValidationError is raised if any extra parameter is passed.
If providers want to add custom fields (e.g. df, boost, mm in Solr), they need to use the ISearchProvider.search_query_schema() method (example). Custom fields are then allowed, and passed to the providers as a combined additional_parameters param.

This works well for root level custom fields, but not for custom facets or filter fields. For filters we will need a ISearchFeature.filters_schema() method or similar that the filters parser module can hook into, so we might as well add an ISearchProvider.facets_query_schema() method.

The alternative is to use prefixed root level params (that can be added with the existing ISearchProvider.search_query_schema() method):

{
    "q": "salmon",
    "facet.field": ["tags", "res_format"],
    "facet.prefix": "prod_",
}

Entity specific parameters

In the current dataset search there are a number of parameters that control the results returned:

{
    "include_drafts": "",
    "include_deleted": "",
    "include_private": "",
}

These obviously are only relevant to datasets and the search action is meant to return different entities as well. Do we:

Allow entity specific params in dedicated actions like dataset_search() or organization_search?
Just include them in the search parameters and document that only work on datasets? Internally we will translate them to filter queries (e.g. entity_type = dataset AND state in ('deleted', 'draft'))

Output format

At the bottom is what the output of package_search looks like now.

Facets

Seems uncontroversial to only have one facets related output key. Do we like the current search_facets format?

Results

For package_search it's clear that what we are getting is a list of dataset dicts ordered by whatever sort was provided.
If we are mixing entities though, does it make more sense to return them all in the same list:

{
    "results": [
     <validated_dataset_dict 1>,
     <validated_org_dict 1>,
     <validated_dataset_dict 2>,
     <validated_group_dict 1>,
     <validated_org_dict 2>,
    ]
}

or namespace each of the entity types:

{
    "results": {
        "dataset": [
             <validated_dataset_dict 1>,
             <validated_dataset_dict 2>,
        ],
        "group": [
            <validated_group_dict 1>
        ],
        "organization": [
            <validated_org_dict 1>
        ],
    }
}

Not sure how this would affect sorting though. Also it would probably

Current package_search output:

{
    "count": 3,
    "sort": "",
    "results": [
     <validated_data_dict 1>,
     <validated_data_dict 2>,
     <validated_data_dict 3>,
    ],
    "facets": {
        "field1": {
            "value1": 1,
            "value2": 5,
        },
        "field2": {
            "value1": 3,
            "value2": 4,
        },
    },
    "search_facets": {
        "field1": {
            "title: "Field 1",
            "items": [
                 {
                      "name": "value1",
                      "display_name": "Value 1",
                      "count: 1,
                 },
                 {
                      "name": "value2",
                      "display_name": "Value 2",
                      "count: 5,
                 },
        },
        "field2": {
            "title: "Field 2",
            "items": [
                 {
                      "name": "value1",
                      "display_name": "Value 1",
                      "count: 3,
                 },
                 {
                      "name": "value2",
                      "display_name": "Value 2",
                      "count: 4,
                 },
        },
    }
}

0 replies

wardi · 2025-07-22T19:56:07Z

wardi
Jul 22, 2025
Maintainer Author

We would be forced to namespace the different entity type results if not all the entities will be part of the same index, right? When they're in separate indexes there's no way to order the results against each other.

But then what does it mean to paginate the results? We would be getting the next page of "limit" entities for each entity type that still have results?

That doesn't sound very intuitive or user-friendly.

Instead we could choose limit this API to only entity types that are part of the same index. It would be up to the site to configure the entity types covered, potentially even including datastore rows or unstructured documents.

The entity type counts should be available as one of the facets and filters available, even though they aren't "real" fields in the entities themselves.

For results I can think of three ways to indicate the entity types but I don't love any of them:

a separate list of result entity types:

{
  "results": [
    <validated_dataset_dict 1>,
    <validated_org_dict 1>,
    <validated_dataset_dict 2>,
    <validated_group_dict 1>,
    <validated_org_dict 2>
  ],
  "result_entity_types": [
    "package", "organization", "package", "group", "organization"
  ]
}

embedded in the entity dicts returned:

{
  "results": [
    {
      "entity_type": "package",
      <contents of validated_dataset_dict 1>
    },
    {
      "entity_type": "organization",
      <contents of validated_org_dict 1>
    },
    …
  ],
}

in envelopes around each result:

{
  "results": [
    {
      "entity_type": "package",
      "result": <validated_dataset_dict 1>
    },
    {
      "entity_type": "organization",
      "result": <validated_org_dict 1>
    },
    …
  ],
}

(1) means not modifying the results list which is nice, but requires iterating over parallel lists when we care about the types
(2) is the cleanest but means returning entities that are modified from what was stored, this might cause a problem with round-tripping unless we ignore entity_type in our validation schemas. We could use a prefix like "@entity_type" to hint that these values aren't part of the entity data itself.
(3) is ugly and extra work to get the actual results, but it does give us a place for other pseudo-fields like "score" when returning results based on how close a text match was

1 reply

amercader Jul 28, 2025
Maintainer

New plan after discussing this with @wardi :

2 actions:
- search: entity_type is mandatory and can only be one. Supports the full API: filters,facets, pagination, sort etc. Returns the results in a single list
- search_all: Search across entity types, only supports minimal params q (and maybe lang). Returns results namespaced by entity type, with counts. From these general results users can get more specific using the search action for each entity
For entity type specific params we can have different search query schemas per entity, that tweak a common one

wardi · 2025-07-22T20:39:06Z

wardi
Jul 22, 2025
Maintainer Author

For encouraging index-based pagination the "start" parameter could also accept a list instead of a typical integer offset. Each item in the list will correspond to one of the sort field values for the item at the end of the previous page.

e.g. if the request included

"sort": ["published_date", "author"]

Then a valid value for "start" could be:

"start": ["2021-01-20", "Smith, J"]

The search API would convert this to filters before sending it to the back end to execute, in this case the "start" parameter would be converted to filters something like: published_date > 2021-01-20 OR (published_date = 2021-01-20 AND author > "Smith, J")

As part of the search response we can include a "next_page_start" value that is a pre-populated index-based pagination "start" value for the next page of results.

4 replies

amercader Jul 28, 2025
Maintainer

As part of the search response we can include a "next_page_start" value that is a pre-populated index-based pagination "start" value for the next page of results.

How we get this? We would need a separate query right?

wardi Jul 28, 2025
Maintainer Author

It can be generated from the fields returned in the last record in the results, assuming validation leaves the corresponding sort fields intact

EricSoroos Jul 28, 2025
Maintainer

next_page_start is going to be tricky with items where there can be duplicates, especially in the horrific corner case of an entire page of the same value.

E.g., if you have a page size of 30, and 45 books by one author, you won't actually be able to paginate through that if you're ordering by author. You can work around it when you have a known fallback order by, but that's something that's going to be somewhat data dependent.

wardi Jul 28, 2025
Maintainer Author

Good point. We need to ensure there's always some unique field to order by and use it as the last item in the sort (like we do with _id for paginating datastore_search)

wardi · 2025-07-22T20:44:02Z

wardi
Jul 22, 2025
Maintainer Author

The "lang" value should be a directive for the search back end that the "q" parameter should be stemmed in this language and searched against a specific index for this language. We'll need some way for a search back end to report the language indexes configured and available.

@EricSoroos mentioned having some success with search using trigrams instead of language-specific stemming rules, that could be another approach to include or consider.

0 replies

wardi · 2025-07-22T20:56:06Z

wardi
Jul 22, 2025
Maintainer Author

It's weird to be returning overlapping values in facets and search_facets, that might have been about maintaining backwards compatibility at some point but we could drop one of them for this new API

Can the "title" and "display name" values for facets reliably be stored by all search back ends? There's seemingly no allowance for multiple language versions of each. Maybe we're better off keeping the simpler facets response instead of the search_facets one?

2 replies

amercader Jul 28, 2025
Maintainer

It's weird to be returning overlapping values in facets and search_facets, that might have been about maintaining backwards compatibility at some point but we could drop one of them for this new API

+1

Can the "title" and "display name" values for facets reliably be stored by all search back ends?

These are all currently computed on the application side after the search query, and only for groups, organizations and licenses. So we need to decide if we keep this facets processing as part of the action like now or we just return the simpler identifiers and let the consumer compute the titles (e.g. the search view, etc)

wardi Jul 28, 2025
Maintainer Author

It's definitely inefficient to make e.g. one call to organization_show for every facet to get titles.

Do we want extending this as part of the search interface for things like multilingual titles? The alternative is making scheming and theme sites override the templates and use helpers to get the data for the facets.

It would be nice if the search interface could return specific fields from a set of known ids, like "give me the title_translated and image for organization ids 14, 16, 27, and 30"

Imagining a generic /api/action/search API and pluggable search back ends #8444

Uh oh!

wardi Sep 17, 2024 Maintainer

Generic Search

Indexing

Replies: 20 comments · 25 replies

Uh oh!

Uh oh!

wardi Sep 18, 2024 Maintainer Author

CKAN Interfaces (rough)

Uh oh!

pwalsh Sep 19, 2024

Uh oh!

Uh oh!

amercader Sep 19, 2024 Maintainer

1. spatial search in ckanext-spatial

2. Indexing and querying pages

Uh oh!

wardi Sep 19, 2024 Maintainer Author

Uh oh!

amercader Sep 20, 2024 Maintainer

Uh oh!

wardi Sep 20, 2024 Maintainer Author

Uh oh!

amercader Sep 19, 2024 Maintainer

Uh oh!

amercader Sep 26, 2024 Maintainer

Uh oh!

Lowprofile666-hub Sep 26, 2024

Uh oh!

wardi Sep 26, 2024 Maintainer Author

Uh oh!

amercader Oct 11, 2024 Maintainer

Uh oh!

amercader Oct 11, 2024 Maintainer

Uh oh!

amercader Nov 11, 2024 Maintainer

Uh oh!

amercader Mar 13, 2025 Maintainer

Uh oh!

amercader Mar 14, 2025 Maintainer

Uh oh!

wardi Mar 15, 2025 Maintainer Author

Uh oh!

Uh oh!

amercader Apr 3, 2025 Maintainer

Filters

Uh oh!

Uh oh!

wardi Apr 4, 2025 Maintainer Author

tangent 1: filter_operations

tangent 2: advanced sorting

tangent 3: language support

Uh oh!

amercader Apr 25, 2025 Maintainer

Uh oh!

amercader Apr 25, 2025 Maintainer

tangent 1: filter_operations

Uh oh!

amercader Apr 25, 2025 Maintainer

tangent 2: advanced sorting

Uh oh!

Uh oh!

wardi Apr 25, 2025 Maintainer Author

Uh oh!

Uh oh!

wardi Apr 26, 2025 Maintainer Author

Uh oh!

Uh oh!

wardi Apr 27, 2025 Maintainer Author

Uh oh!

wardi
Sep 17, 2024
Maintainer

Replies: 20 comments 25 replies

wardi
Sep 18, 2024
Maintainer Author

pwalsh
Sep 19, 2024

amercader
Sep 19, 2024
Maintainer

wardi Sep 19, 2024
Maintainer Author

amercader Sep 20, 2024
Maintainer

wardi Sep 20, 2024
Maintainer Author

amercader
Sep 19, 2024
Maintainer

amercader
Sep 26, 2024
Maintainer

wardi
Sep 26, 2024
Maintainer Author

amercader Oct 11, 2024
Maintainer

amercader
Oct 11, 2024
Maintainer

amercader Nov 11, 2024
Maintainer

amercader
Mar 13, 2025
Maintainer

amercader
Mar 14, 2025
Maintainer

wardi Mar 15, 2025
Maintainer Author

amercader
Apr 3, 2025
Maintainer

wardi
Apr 4, 2025
Maintainer Author

amercader Apr 25, 2025
Maintainer

amercader Apr 25, 2025
Maintainer

amercader Apr 25, 2025
Maintainer

wardi Apr 25, 2025
Maintainer Author

wardi Apr 26, 2025
Maintainer Author

wardi
Apr 27, 2025
Maintainer Author