Data dictionary in ckan db #8531

wardi · 2024-11-14T16:39:52Z

wardi
Nov 14, 2024
Maintainer

Since the first introduction of the data dictionary feature #3414 there has been some discomfort with its implementation as json-encoded column comments.

Column comments have the benefit of being removed automatically when a column is removed and being accessible from within the datastore e.g. by datastore_search_sql (this hasn't been widely used AFAIK)

But there are some real drawbacks:

data dictionary contents are lost when a datastore table is deleted
we can't populate a data dictionary for data not loaded or not yet loaded into the datastore
only field data is supported, nothing table-wide can be stored in the data dictionary
data dictionaries are stored as json text, not with an enforced data schema, potentially leading to errors

Let's consider moving the column comment data from the datastore database to the main ckan database. Fields can be indexed by resource id and column name so they will retain information if columns are removed or reordered.

The datastore_create and datastore_info APIs will continue to be able to update and read the data dictionary for backwards compatibility, but new endpoints will be added for CRUD operations on data dictionaries that don't rely on the datastore.

We'll need some way to prune or clean up old data dictionary entries that no longer apply, but we have a pattern for this with other CLI commands.

Note

I'm avoiding the more complicated issue of merging data dictionaries and table schemas used in ckanext-validation in this discussion to focus on this smaller step to improve functionality while maintaining compatibility for current users

brooks-eco · 2024-11-20T00:24:05Z

brooks-eco
Nov 20, 2024

Could the resource dictionary then be preserved as a default at dataset level so that a given resource could be deleted+replaced without losing the dictionary? or could load multiple resources with the same schema adopting the dataset dictionary (could test first to see if the dataset default dict is OK for adoption by each resource). Would simplify data sets where annual data is appended.

0 replies

wardi · 2025-06-11T17:30:00Z

wardi
Jun 11, 2025
Maintainer Author

Suggested model to support many types of data dictionaries:

compatible with current DataStore data dictionaries
supports any almost table schema standard
even non-tabular data can be documented (e.g. explaining layers on a geojson resource)
can be owned and edited as part of a resource (like current data dictionaries)
can be shared within an organization, reused, linked and edited by org members (custom permissions w/ IAuthFunctions plugin)
can be site-wide, reused and linked by anyone, and edited by sysadmins (custom permissions w/ IAuthFunctions plugin)

`ckan/model/data_dictionary.py`:

Column('id', UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
Let's use real UUID columns on new models. We could discuss starting to use uuid7 as well if there's general agreement.
Column('name', UnicodeText, nullable=True, UniqueConstraint('name', name='data_dictionary_name_unique'))
names must be unique on the site, but not required for resource-owned dictionaries
Column('type', UnicodeText, nullable=False)
IDataDictionaryForm plugins will be able to be selectively enabled for this dictionary by the type value set at creation time
currently all IDataDictionary plugins have their features merged for all data dictionaries because there are no types
Column('title', UnicodeText, nullable=False, default='')
Column('notes', UnicodeText, nullable=False, default='')

Column(
    'resource_id', UnicodeText, ForeignKey('resource.id'), nullable=True,
    onupdate='CASCADE', ondelete='CASCADE',
    UniqueConstraint('resource_id', name='data_dictionary_resource_id_unique')
)

set when owned by a resource
a resource may only own one data dictionary

Column('owner_org', UnicodeText, ForeignKey('group.id', nullable=True, onupdate='CASCADE', ondelete='CASCADE')
set when owned by an org
Column('private', Boolean, default=False, nullable=False)
only for org-owned dictionaries: apply org private-visibility permissions to this data dictionary
resource-owned dictionaries apply the dataset permissions for visibility/editing
site dictionaries are always public and only editable by sysadmin accounts (can be overridden)
Column('plugin_data', JSONB, nullable=False, default=dict)
IDataDictionaryForm will now be able to add fields at the root level of a data dictionary and store them in plugin_data
```
Column('fields', JSONB, nullable=False, default=list, CheckConstraint(
    """
    jsonb_typeof(fields) = 'array' and not jsonb_path_exists(fields, '$[*] ? (@.type() <> "object")')"
    """,
    name='data_dictionary_fields',
))
```
fields is a list of objects, other constraints to be implemented by specific IDataDictionaryForm plugins
conventions include using id for the field identifier and type for the data type as is done by the DataStore, but these are enforced only by the plugins enabled for the type of dictionary

Additional Constraints:

CheckConstraint('num_nonnulls(resource_id, owner_org) <= 1', name='data_dictionary_one_owner')
ensure that the data dictionary has one clear owner: a resource (like current behavior), an org, or the whole site (both null)
CheckConstraint("resource_id is not null or name is not null", name='data_dictionary_resource_or_name')
a name is required for site or org data dictionaries
Index('idx_data_dictionary_field_id', "jsonb_path_query_array(j,'$[*].id')", postgresql_using='gin')
index all the field ids (column names) so we can find existing data dictionaries for new resources uploaded

`ckan/model/resource.py`:

one new resource column:

Column('data_dictionaries', ARRAY(UUID(as_uuid=True)))
resources may link to one or more data dictionaries so we can support different types of dictionaries and reuse partial dictionaries (some fields are common between many tables)
Index('idx_resource_data_dictionary', 'data_dictionaries', postgresql_using='gin')

When a resource owns a data dictionary it does not need to be referenced in the data_dictionaries values, it will be displayed automatically.

2 replies

amercader Jun 16, 2025
Maintainer

This all looks great. Maybe we could add created / updated fields to track changes on schemas?

amercader Jun 16, 2025
Maintainer

Nevermind, saw your comment in the other thread

EricSoroos · 2025-06-13T09:30:29Z

EricSoroos
Jun 13, 2025
Maintainer

We could discuss starting to use uuid7 as well if there's general agreement.

Postgresql's support of uuid7 is only in extensions at the moment, so I'd say probably not unless we know that we can provide enough functionality via the app side only and it passed through on a generic postgres uuid field.
This is storing the fields as a json blob, decoding to a list of field blobs. Does this help the issue of "data dictionaries are stored as json text, not with an enforced data schema, potentially leading to errors"? Understanding that if we're migrating existing data dictionaries over, then we won't have any sort of schema on the existing fields.
I'm not clear how individual fields here could be shared between tables if they're part of a JSON blob, or what it means to link to two data dictionaries for an individual resource.
Would i18n for this be stored in plugin_data?

3 replies

wardi Jun 13, 2025
Maintainer Author

Postgresql's support of uuid7 is only in extensions at the moment, so I'd say probably not unless we know that we can provide enough functionality via the app side only and it passed through on a generic postgres uuid field.

The idea is to use a generic postgres uuid field and the uuid7 python module for id generation. Possibly worth a separate discussion but main benefit I see is having a natural ordering of created ids (because uuid7 starts with a time stamp) and the main drawback is having all the id prefixes being the same (in some cases it's nice to only need to look at the first few hex digits when comparing uuid4 values)

This reminds me I didn't add any created or modified date fields to the model. Those would be good to have.

This is storing the fields as a json blob, decoding to a list of field blobs. Does this help the issue of "data dictionaries are stored as json text, not with an enforced data schema, potentially leading to errors"? Understanding that if we're migrating existing data dictionaries over, then we won't have any sort of schema on the existing fields.

The current data dictionaries are stored as json in a text field and these are stored as jsonb in a column. IIUC jsonb is more efficient for querying and storing large values but does not record white space between elements (not a big deal) or key order in objects (possibly a problem)

The current data dictionaries don't have a schema for the "info" object for historical reasons, but every other field has a schema enforced by the installed IDataDictionaryForm plugins. I'm planning on keeping this behavior: schemas are enforced in python by the IDataDictionaryForm plugins and validators applied. This gives us maximum flexibility to represent any sort of data dictionary, including for things like a glossary of terms (fields don't have ids because they aren't referencing columns in the data) or json schema (stored in plugin_data as a json schema blob instead of separated into the fields)

I'm not clear how individual fields here could be shared between tables if they're part of a JSON blob, or what it means to link to two data dictionaries for an individual resource.

A resource may have exactly one "owned" resource data dictionary, and may have one or more "shared" org or site-wide data dictionary linked from its data_dictionaries field. Plugins implementing IDataDictionaryForm would include template overrides that determine how the dictionaries are combined and displayed to end-users.

e.g. a table might have 20 columns but 15 of those columns are standardized for many of an organization's datasets. This dataset could have some custom field descriptions in its "owned" resource data dictionary that is merged with a linked shared org-wide data dictionary (for fields with the same ids) when it is displayed to users. It could also include a link to a shared site-wide glossary of terms that would appear below the merged data dictionary display.

Would i18n for this be stored in plugin_data?

Yes i18n versions of title and notes could be stored in plugin_data and i18n versions of field names and descriptions could be stored as separate fields in fields (like it can be implemented in the current data dictionaries).

Here I'm following the way the rest of ckan works with translated metdata although I did consider going straight to title_translated and notes_translated fields in the model.

amercader Jun 17, 2025
Maintainer

schemas are enforced in python by the IDataDictionaryForm plugins and validators applied

I understand the need for flexibility and that this might not be possible, but I think it would be nice to have some minimal consistency across fields schemas, so they can be used across different plugins. Say for instance that I need to describe the data fields in an ML Croissant RecordSet, not only for DataStore tables, but for GeoJSON, shapefiles, etc. It would be nice to know that I just need to check for a type key in the fields mapping (as opposed to field_type, ftype, datatype etc). Same for id to know the field name, etc.
I guess the alternative is to have IDataDictionaryForm provide a way of advertising its own fields schema and integrators need to harmonize it themselves

wardi Jun 17, 2025
Maintainer Author

Having some standard for the stored type values in addition to the id value behavior makes sense to me.

We can make this part of the IDataDictionaryForm documentation
When we're implementing json schema or data package table schema or csvw schemas (schemas defined completely in plugin_data) as data dictionaries we can map their types to one of the standard field type values and store them in the fields jsonb along with the ids for db queries.

wardi · 2025-06-14T03:14:30Z

wardi
Jun 14, 2025
Maintainer Author

data dictionary-related features compared between CSVW, json schema and data package table schema:

CSVW	json schema	data package table schema
`"primaryKey": "𝓍"`		`"primaryKey": "𝓍"`
`"primaryKey": ["𝓍", "𝓎"]`		`"primaryKey": ["𝓍", "𝓎"]`
`"foreignKeys": [{` `"columnReference": ["𝓍"],` `"reference": {` `"resource": "https://𝓎"` `"columnReference": ["𝓏"],` `}}]`		`"foreignKeys": [{` `"fields": ["𝓍"],` `"reference": {` `"resource": "𝓎",` `"fields": ["𝓏"]` `}}]`
`"columns": [` `{"name": "𝓍", …},` `…` `]`	`"properties": {` `"𝓍": {…},` `…` `}`	`"fields": [` `{"name": "𝓍", …},` `…` `]`
`"titles": "𝓍"`	`"title": "𝓍"`	`"title": "𝓍"`
`"titles": {"en": "𝓍", "fr": ["𝓎", "𝓏"]}`
`"dc:description": "𝓍"`	`"description": "𝓍"`	`"description": "𝓍"`
`"datatype": "string",` `"required": true`	`"type": "string"`	`"type": "string",` `"constraints": {"required": true}`
`"datatype": "boolean"`	`"type": ["boolean", "null"]`	`"type": "boolean"`
	`"enum": [𝓍, 𝓎, 𝓏]`	`"constraints": {"enum": [𝓍, 𝓎, 𝓏]}`
`"datatype": {` `"base": "date", "format": "yyyy-d-M"` `}`	`"type": "string",` `"format": "date"`	`"type": "date",` `"format": "default"`
	`"examples": [𝓍]`	`"example": 𝓍`

1 reply

jqnatividad Jun 18, 2025

+1 on using JSON Schema as it widely supported, has a lot of available tooling, and can also be used for validation.

You may also want to take into account DCAT3, in particular - dcat:describedBy.

DOI-DO/dcat-us#138

IMHO - a machine-readable, standards based data dictionary is the best kind of metadata that can potentially describe not just columns' data type and description - with an "extended" dictionary, we can potentially describe what's INSIDE the dataset if summary statistics and frequency tables are included in the data dictionary, perhaps, as additional JSON properties in the JSON schema.

With summary statistics, one can even have a global "record-level" search, using a two-pass approach - the first pass will just search the data dictionary's summary statistics (as we can store enumerations, ranges, etc.) and present that to the user as "top-level" search results. Once the user clicks on a "top-level" search result, a "true" row-level datastore_search_sql search is executed on the selected dataset.

This effectively gives CKAN a very efficient, global record-level search and offers the possibility of just using PostgreSQL for search (using JSONB search for the first pass, and plain-old SQL for the second pass - which is very efficient as the summary statistics allow us to make intelligent choices for what to index)

See https://github.com/dathere/qsv/blob/master/scripts/NYC_311_SR_2010-2020-sample-1M.csv.schema.json which was created with the qsv schema command from a million-row, 41-column sample of NYC's 311 data in ~1 second (when an index is available. 8 seconds without an index).

EricSoroos · 2025-06-18T10:19:18Z

EricSoroos
Jun 18, 2025
Maintainer

Also noting here as an argument for: The datastore schema info is not accessible due to table locking when the datapusher has done a truncate/load in a separate thread. For loads that take a significant amount of time, this can lead to gateway errors on the front end as the datadictionary on the resource page is waiting on a lock that might take minutes or more in the case of a large table.

1 reply

wardi Jun 18, 2025
Maintainer Author

sounds like it's worth adding a short timeout on that query so the page doesn't time out.

wardi · 2025-06-27T13:59:06Z

wardi
Jun 27, 2025
Maintainer Author

@PatLittle mentioned this useful tool: https://github.com/WPRDC/little-lexicographer

0 replies

Data dictionary in ckan db #8531

Uh oh!

wardi Nov 14, 2024 Maintainer

Replies: 6 comments · 7 replies

Uh oh!

brooks-eco Nov 20, 2024

Uh oh!

Uh oh!

wardi Jun 11, 2025 Maintainer Author

ckan/model/data_dictionary.py:

ckan/model/resource.py:

Uh oh!

amercader Jun 16, 2025 Maintainer

Uh oh!

amercader Jun 16, 2025 Maintainer

Uh oh!

Uh oh!

EricSoroos Jun 13, 2025 Maintainer

Uh oh!

wardi Jun 13, 2025 Maintainer Author

Uh oh!

amercader Jun 17, 2025 Maintainer

Uh oh!

wardi Jun 17, 2025 Maintainer Author

Uh oh!

Uh oh!

wardi Jun 14, 2025 Maintainer Author

Uh oh!

Uh oh!

jqnatividad Jun 18, 2025

Uh oh!

EricSoroos Jun 18, 2025 Maintainer

Uh oh!

wardi Jun 18, 2025 Maintainer Author

Uh oh!

wardi Jun 27, 2025 Maintainer Author

wardi
Nov 14, 2024
Maintainer

Replies: 6 comments 7 replies

brooks-eco
Nov 20, 2024

wardi
Jun 11, 2025
Maintainer Author

`ckan/model/data_dictionary.py`:

`ckan/model/resource.py`:

amercader Jun 16, 2025
Maintainer

amercader Jun 16, 2025
Maintainer

EricSoroos
Jun 13, 2025
Maintainer

wardi Jun 13, 2025
Maintainer Author

amercader Jun 17, 2025
Maintainer

wardi Jun 17, 2025
Maintainer Author

wardi
Jun 14, 2025
Maintainer Author

EricSoroos
Jun 18, 2025
Maintainer

wardi Jun 18, 2025
Maintainer Author

wardi
Jun 27, 2025
Maintainer Author