Replies: 20 comments 25 replies
-
CKAN Interfaces (rough)class ISearchProvider:
def search_query(
self,
query: str, # e.g. 'water data'
filters: dict[str, str | list[str]], # e.g. {'metadata_modified<': '2024-01-01', 'entity_type': ['package']}
sort: list[list[str]], # e.g. [['title'], ['metadata_modified', 'desc']]
lang: str, # for text query language stemming e.g. 'de'
additional_params: dict[str, Any], # custom parameters this provider may process or ignore
return_ids: bool, # True: return records as ids (may increase maximum record limit)
return_entity_types: bool, # True: wrap records with {'entity_type': et, 'data': record} objects
return_facets: bool, # True: return facet counts for available indexes
limit: Optional[int], # maximum records to return, None: maximum provider allows
) -> Optional[SearchResults]:
'''generate search results or return None if another provider
should be used for the query'''
def initialize_search_provider(self, combined_schema: SearchSchema, clear: bool) -> None:
'''create or update indexes for fields based on combined search
schema containing all field names, types and repeating state'''
def index_search_record(self, entity_type: str, id_: str, search_data: dict[str, str | list[str]]) -> None:
'create or update search data record in index'
def delete_search_record(self, entity_type: str, id_: str) -> None:
'remove record from index'
class ISearchFeature:
def entity_types(self) -> list[str]:
'return list of entity types covered by this feature'
def feature_search_schema(self) -> SearchSchema:
'''return index fields names, their types (text, str, date, numeric)
and whether they are repeating'''
def format_search_data(self, entity_type: str, data: dict[str: Any]) -> dict[str, str | list[str]]:
'''convert data for this entity type to search data suitable to be
passed to the search provider index method'''
def existing_record_ids(self, entity_type: str) -> Iterable[str]:
'''return a list or iterable of all record ids for the given entity type
managed by this feature. Return an empty list for core entity
types like 'package' or entity types managed by another feature.
This method is used to identify missing and orphan records in the
search index'''
def fetch_records(self, entity_type: str, records: Optional[Iterable[str]]): -> Iterable[dict[str, Any]]:
'''generator of all records for this entity type managed by this
feature, or only records for the ids passed if not None.
This method is used to rebuild all or some records in the search
index''' Thinking about how to handle cached |
Beta Was this translation helpful? Give feedback.
-
I dont see an equivalent in the aforementioned https://solr.apache.org/guide/8_1/field-types-included-with-solr.html but object, nested and join types as they are defined by Elastic seem quite important, and at very least object and nested are often extensively used in my experience with Elastic: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html It also may be worth looking at postgres types for reference: |
Beta Was this translation helpful? Give feedback.
-
This is looking great. IIUC:
It might be useful to test these proposed interfaces against real use cases to see if they cover them: 1. spatial search in ckanext-spatialThis currently works leveraging custom Solr field types. There are actually two types of search with different Solr field types, but the mechanics would be the same so let's focus on one. To provide spatial search:
An hypothetical Elastic Search powered spatial search would follow the same mechanism, but in that case the spatial field type is already supported out of the box (so no point 1). So to replicate the current Solr based query:
2. Indexing and querying pagesCkanext-sitesearch supports indexing page contents if ckanext-pages is installed, and performing searches. There is no additional setup other than installing and configuring ckanext-pages, but the indexing is done via a separate command ( So in this case:
There are still some blind spots for me but overall sounds like a good direction to follow! |
Beta Was this translation helpful? Give feedback.
-
These are not present in the current search schema so I'd stick with the currently supported types. Additional types like the ones you mentioned would be be easy to offer by the relevant |
Beta Was this translation helpful? Give feedback.
-
Trying to visualize a bit more how the proposed interfaces would play together. In this case testing how the spatial search might look like: Querying # ckan/logic/action/get.py (or an extension for now)
def search(context, data_dict):
# Check auth
backend = get_search_backend()
# Validate data_dict. Any key not in the standard schema is moved to
# additional_params
# Q: Do we validate just the common interface params here or get additional schema
# entries from the search feature plugin to validate additional params like `bbox`?
schema = default_search_query_schema()
for plugin in PluginImplementations(ISearchFeature):
plugin.search_schema(schema)
# Call query method of the relevant backend
query_dict = {
...
}
result = backend.search_query(**query_dict)
return result
# ckanext-search-solr
class SolrSearchProvider(ISearchProvider):
def search_query(self, query, filters,...):
for plugin in PluginImplementations(ISearchFeature):
# Search feature plugins will probably want to modify the query, filters etc parameters
# Q: what's the best way of passing them to the search feature plugins?
# wrapped in a query_dict? modify in place?
plugin.before_search(query, filters, sort, lang, additional_paramas, ...)
# Construct actual Solr query
# Call Solr Client
# Parse results to adapt them to the common interface format
return results
# ckanext-spatial
class SolrSpatialSearch(ISearchFeature):
def before_search(self, query, filters, etc):
# Check additional_params for a `bbox` param
# Add a new filter based on that
# Note: the filters format suggestion was a field_name: field_value type, but this doesn't fit
# this convention as it defines the whole value for fq, so I'm suggesting a convention to add `fq` filters
# directly
filters["fq"] = "{{!field f=spatial_geom}}Intersects(ENVELOPE({minx}, {maxx}, {maxy}, {miny}))" Indexing # This could be a CLI command or an action
def index(entity_type, entity_ids)
# Check auth
# Get whatever data needs to be indexed, e.g. a validated data_dict for datasets or ISearchEntity.fetch_records()
# for custom entities
data_dict = get_search_data_for_entity()
# Call the index method of the relevant backend
backend = get_search_backend()
backend.index_search_record(entity_type, id_, data_dict)
# ckanext-search-solr
class SolrSearchProvider(ISearchProvider):
def index_search_record(entity_type, id_, data_dict):
for plugin in PluginImplementations(ISearchFeature):
plugin.before_index(entity_type, id_, data_dict)
# ckanext-spatial
class SolrSpatialSearch(ISearchFeature):
def before_index(entity_type, id_, data_dict):
# Check if entity is dataset, and if it has a `spatial` geojson field
# if so, add the relevant field to the data to index
data_dict["spatial_field"] = wkt_version_of_geometry |
Beta Was this translation helpful? Give feedback.
-
Instead of a processed = {}
for plugin in PluginImplementations(ISearchFeature):
processed.update(plugin.process_additional_params(additional_params)) then the processed dict could be passed through to search containing trusted values (not arbitrary user-provided ones) and we don't have to mutate the parameters passed. Same sort of thing for indexing. A Let's keep the validated version of the entity in one dict and the record to be indexed in another. Something kind of like:
|
Beta Was this translation helpful? Give feedback.
-
@sagargg @gavram do you feel like you have enough to create a small Proof of Concept? I would focus on small, focused POCs to not complicate things and then tie it all together progressively:
How does this sound? BTW I've put together the various snippets discussed in the discussion in a gist so it's easy to follow: https://gist.github.com/amercader/2f0f54e1fcf33bad5b7d8b10aa2a10f8 |
Beta Was this translation helpful? Give feedback.
-
I've been working on an initial Proof of Concept to validate this approach here, in case someone wants to follow along: https://github.com/amercader/ckanext-search Things look promising but there's still a long way to go |
Beta Was this translation helpful? Give feedback.
-
Going forward maybe we need to re-think what we send to index and what actually gets indexed. Right now we pass the following to the function that indexes datasets in Solr:
Setting aside the issue of space and whether the indexed What if the new search:
If a site really needs the un-validated @wardi does this make sense? |
Beta Was this translation helpful? Give feedback.
-
FiltersWe need to define a generic, provider agnostic way of providing filters to the search methods, i.e. this param of the
Additionally, similar functionality is needed for the DataStore search filters, so let's come up with something that can work on both (#8689) I've done some minimal exploration of the state of the art, even as just for inspiration, :
Just adopting one of the above and calling it a day isn't really an option, all query languages have options tailored to the specific software so at least we should be looking at a subset (the one that we want our searches to support and the ones that the preferred providers can support). I considered also extracting the parse / validation functionality of a Python implementation like the opensearch-py one, but it would mean pulling a lot of extra logic. I like @wardi's suggestion in #8696 of separating fields, operators and values:
What operations should be supported initially?
Keen to hear people's thoughts on this |
Beta Was this translation helpful? Give feedback.
-
@amercader I like this suggested syntax, it's composable, extendable and succinct. We can compose or clauses and operations, e.g. an "or" clause combined with ranges:
We can extend it with new operations using plugins:
Strings / numbers / bools / null are short-form for
tangent 1: filter_operationsOperations appearing only after column names is a limitation, though. There isn't a way to express operations that cover multiple columns like "admin = true or organization = public" We also don't have a way to express pagination based on the sort order like "records where (year, month) > (2024, 5)" where year and month are separate columns. What could these examples look like? We could have a separate argument for filters where the column name must be a parameter, e.g. "admin = true or organization = public" could be:
"records where (year, month) > (2024, 5)" could be:
tangent 2: advanced sortingThat last year, month example almost works for pagination, but if the sort order combines ascending and descending columns we might be better off defining "after" and "before" operations that automatically apply to the sort order e.g.
where Or do we drop support for sorting multiple columns in different orders? I'm not sure how to represent that last example in the general case with a postgresql backend, it's likely not possible in general for all search backends. If we require that sorting/pagination only be done on one unique indexed column at a time then we avoid this problem. That shouldn't be a problem for dataset search except that we need to ensure things that might not be unique (e.g. date_updated) are combined with something that is unique like the package id behind the scenes. tangent 3: language supportWe also need to think about language support. It's reasonable for a user to have multiple language versions of dataset metadata, then e.g. ask for the datasets to be returned sorted by "Spanish title with es_ES collation". If the backend remembers the collation type of each field then this shouldn't affect the search interface, but it will be important to be able to declare fields to be indexed with specific collation types so that users don't get the wrong results. For general search (users passing a value to the |
Beta Was this translation helpful? Give feedback.
-
I think my tangent 1 and tangent 2 are solved by having this We can handle the case of a field starting with "filters": {
"$eq_fields": ["created_date", "modified_date"]
} This also opens up the possibility for the datastore to allow custom advanced queries on large datasets (e.g. to support a visualization or summary) without needing to open up access to the generic So our filters dict keys would take the form:
or
With these short forms:
and if the filters value itself is a list instead of a dict:
Filters that are not a dict or list are invalid, and the validity of parameters depend on the operator. |
Beta Was this translation helpful? Give feedback.
-
As an implementation detail of "$or": {
"field1": "value1",
"field2": "value2"
} and treat them the same as: "$or": [
{ "field1": "value1" },
{ "field2": "value2" }
] AFAICT "$and": [
{ "title": { "contains": "water" } },
{ "title": { "contains": "quality" } }
] I think this is fine, but for operations where requiring "title": { "contains": [ "water", "quality" ] } |
Beta Was this translation helpful? Give feedback.
-
@wardi I'm working on an initial implementation of the filters validation. I think it would be great if search providers got the "expanded" form of the provided filters, i.e. with all the different shorthands expanded to the standard filter form so they only have to worry about that particular part of the syntax. So at the CKAN core level (e.g. in the Do you think that the expansion should take place in the validator function itself or should be done in two steps (first validation, then expansion)? Also I'm wondering if it's worth that we expand all filters to So even if users pass
|
Beta Was this translation helpful? Give feedback.
-
@wardi (and anyone interested!) can you check this and see if I missed something? |
Beta Was this translation helpful? Give feedback.
-
Generic search endpoint (
|
Beta Was this translation helpful? Give feedback.
-
We would be forced to namespace the different entity type results if not all the entities will be part of the same index, right? When they're in separate indexes there's no way to order the results against each other. But then what does it mean to paginate the results? We would be getting the next page of That doesn't sound very intuitive or user-friendly. Instead we could choose limit this API to only entity types that are part of the same index. It would be up to the site to configure the entity types covered, potentially even including datastore rows or unstructured documents. The entity type counts should be available as one of the facets and filters available, even though they aren't "real" fields in the entities themselves. For results I can think of three ways to indicate the entity types but I don't love any of them:
(1) means not modifying the results list which is nice, but requires iterating over parallel lists when we care about the types |
Beta Was this translation helpful? Give feedback.
-
For encouraging index-based pagination the e.g. if the request included
Then a valid value for
The search API would convert this to filters before sending it to the back end to execute, in this case the As part of the search response we can include a |
Beta Was this translation helpful? Give feedback.
-
The @EricSoroos mentioned having some success with search using trigrams instead of language-specific stemming rules, that could be another approach to include or consider. |
Beta Was this translation helpful? Give feedback.
-
It's weird to be returning overlapping values in Can the "title" and "display name" values for facets reliably be stored by all search back ends? There's seemingly no allowance for multiple language versions of each. Maybe we're better off keeping the simpler |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Generic Search
A search API that:
package_search?q=
queriespackage_search?fq=
queries including multiple options for each filter or range queries for date/number fieldsfq
filter values (as available by search back end)We can adapt the existing
package_search
to call this new API by:q=
queries for solr-like syntax and converting it to a general text search or solr-specific parametersort=
andfq=
parametersfl=id
is passedentity_type=package
include_private=false
orinclude_drafts=false
Custom search back ends #7552 would implement querying based on the common requirements of the generic
/api/action/search
but would be free to process back end-specific additional parameters in any way they choose. We document the common parameters and inform users that additional parameters are site or back end-specific.The common parameters are enough to implement site-wide search in CKAN with facets supporting multiple-select and ranges in core. Extensions would be used to add support for additional parameters to the UI e.g. vector or geospatial search.
Indexing
It may be useful to index entities in multiple places, e.g. elastic search + a graph database. Multiple plugins may be enabled for indexing but only a single plugin will handle implementing the generic search API.
Language stemming will be enabled based on a configuration option, falling back to the
ckan.locales_offered
setting if not provided.CKAN will provide an interface to plugins for registering search index schemas. CKAN will merge these schemas on start-up and refuse to start if there is any conflict (e.g. one schema declares
"authors"
as a multiple-text field while another declares it as an integer). The merged schema is used to configure the search back end using ackan search-index
CLI command.Search schemas are a flat dict of fields with types like the ones in https://solr.apache.org/guide/8_1/field-types-included-with-solr.html e.g.:
These fields may be repeated or single-value. Additional text fields are generally discouraged as using a common text field for all text should give better results for default searches.
At the CKAN action level when any entity that could be indexed is created, updated or deleted it will be passed to a generic function to convert it to only a text representation for indexing. Plugins can intercept this conversion, like the current
IPackageController.before_dataset_index
method does but for any entity type.This will allow extensions like
ckanext-scheming
to generate a search schema and convert the data for indexing automatically. This is the last missing piece for flexible and immediately usable custom schemas in CKAN without custom plugin code required.Beta Was this translation helpful? Give feedback.
All reactions