Skip to content

Overly aggressive indexing strategy greatly increases datastore storage requirements #5847

@jqnatividad

Description

@jqnatividad

CKAN version
2.9.1

Describe the bug
The postgres datastore creates several indices when a resource is uploaded.

A unique index for the _id internal field, an FTS index using the _full_text internal field (which stores all the column values in one field for FTS searches, effectively doubling the width of the table), and one index per text column.

Though this is great for ad-hoc queries using datastore_search and datastore_search_sql, and for interactive filtering on the UI, it greatly increases the datastore's storage requirements.

For example, a table with 1.37 million rows and 9 columns is:

  • 60 mb as a csv file
  • 302 mb for the database table
  • 569 mb for the 8 indices (unique index for _id, 1 fts index, and 6 indices for the 6 text columns. It did not create indices for the one timestamp and two numeric columns)

I got this info while testing the expanded datastore_info PR.

Expected behavior
The team should consider giving finer granular control to the CKAN administrator beyond specifying the index method.

Some indexing "knobs" to consider:

  • turn on/off FTS index
  • leverage the data dictionary to explicitly declare columns that should be indexed. As an added benefit, this should also allow indices for non-text columns
  • or alternatively, instead of creating a mini DBA interface on CKAN (a non-trivial task), just read the postgres system catalog and selectively show some indexing info on the data dictionary (leveraging the expanded datastore_info), and let the DB/CKAN administrator use robust tools like PgAdmin to manage datastore indices manually. This is consistent with the keeping the Datastore API "as thin as possible to allow you to use the features you would expect from a powerful database management system."

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions