-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
CKAN version
2.9.1
Describe the bug
The postgres datastore creates several indices when a resource is uploaded.
A unique index for the _id
internal field, an FTS index using the _full_text
internal field (which stores all the column values in one field for FTS searches, effectively doubling the width of the table), and one index per text column.
Though this is great for ad-hoc queries using datastore_search
and datastore_search_sql
, and for interactive filtering on the UI, it greatly increases the datastore's storage requirements.
For example, a table with 1.37 million rows and 9 columns is:
- 60 mb as a csv file
- 302 mb for the database table
- 569 mb for the 8 indices (unique index for
_id
, 1 fts index, and 6 indices for the 6 text columns. It did not create indices for the one timestamp and two numeric columns)
I got this info while testing the expanded datastore_info
PR.
Expected behavior
The team should consider giving finer granular control to the CKAN administrator beyond specifying the index method.
Some indexing "knobs" to consider:
- turn on/off FTS index
- leverage the data dictionary to explicitly declare columns that should be indexed. As an added benefit, this should also allow indices for non-text columns
- or alternatively, instead of creating a mini DBA interface on CKAN (a non-trivial task), just read the postgres system catalog and selectively show some indexing info on the data dictionary (leveraging the expanded
datastore_info
), and let the DB/CKAN administrator use robust tools like PgAdmin to manage datastore indices manually. This is consistent with the keeping the Datastore API "as thin as possible to allow you to use the features you would expect from a powerful database management system."
Metadata
Metadata
Assignees
Labels
Type
Projects
Status