Enhancement: support all parquet file metadata

Parquet metadata was added in #1905 (per #1899), but it does not seem to include all parquet file metadata.

Here's a python script to generate a parquet file, and then print out its schema metadata:

```python
from datetime import datetime
from pprint import pprint

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

d = {'col1': [1, 2]}
df = pd.DataFrame(data=d)
schema = pa.Schema.from_pandas(df).with_metadata(
    {"updated": datetime.utcnow().isoformat() + "Z"},
)
table = pa.Table.from_pandas(df, schema=schema)
pq.write_table(table, "test.parquet")

t = pq.read_table("test.parquet")
pprint(t.schema.metadata)
```

It outputs:

```
$ python write_parquet_test.py
{b'pandas': b'{"index_columns": [], "column_indexes": [{"name": null, "field_n'
            b'ame": null, "pandas_type": "unicode", "numpy_type": "object", "m'
            b'etadata": {"encoding": "UTF-8"}}], "columns": [{"name": "col1", '
            b'"field_name": "col1", "pandas_type": "int64", "numpy_type": "int'
            b'64", "metadata": null}], "creator": {"library": "pyarrow", "vers'
            b'ion": "6.0.0"}, "pandas_version": "1.3.4"}',
 b'updated': b'2021-11-04T02:58:27.536549Z'}
```

However, when we run `parquet_metadata` and `parquet_schema` against this file, most of that data isn't available:

```
D SELECT * FROM parquet_metadata('test.parquet');
┌──────────────┬──────────────┬────────────────────┬───────────────────────┬─────────────────┬───────────┬─────────────┬────────────┬────────────────┬───────┬───────────┬───────────┬──────────────────┬──────────────────────┬─────────────────┬─────────────────┬─────────────┬──────────────────────────────┬───────────────────┬────────────────────────┬──────────────────┬───────────────────────┬─────────────────────────┐
│  file_name   │ row_group_id │ row_group_num_rows │ row_group_num_columns │ row_group_bytes │ column_id │ file_offset │ num_values │ path_in_schema │ type  │ stats_min │ stats_max │ stats_null_count │ stats_distinct_count │ stats_min_value │ stats_max_value │ compression │          encodings           │ index_page_offset │ dictionary_page_offset │ data_page_offset │ total_compressed_size │ total_uncompressed_size │
├──────────────┼──────────────┼────────────────────┼───────────────────────┼─────────────────┼───────────┼─────────────┼────────────┼────────────────┼───────┼───────────┼───────────┼──────────────────┼──────────────────────┼─────────────────┼─────────────────┼─────────────┼──────────────────────────────┼───────────────────┼────────────────────────┼──────────────────┼───────────────────────┼─────────────────────────┤
│ test.parquet │ 0            │ 2                  │ 1                     │ 100             │ 0         │ 108         │ 2          │ col1           │ INT64 │ 1         │ 2         │ 0                │                      │ 1               │ 2               │ SNAPPY      │ PLAIN_DICTIONARY, PLAIN, RLE │ 0                 │ 4                      │ 36               │ 104                   │ 100                     │
└──────────────┴──────────────┴────────────────────┴───────────────────────┴─────────────────┴───────────┴─────────────┴────────────┴────────────────┴───────┴───────────┴───────────┴──────────────────┴──────────────────────┴─────────────────┴─────────────────┴─────────────┴──────────────────────────────┴───────────────────┴────────────────────────┴──────────────────┴───────────────────────┴─────────────────────────┘
D SELECT * FROM parquet_schema('test.parquet');
┌──────────────┬────────┬─────────┬─────────────┬─────────────────┬──────────────┬────────────────┬───────┬───────────┬──────────┬──────────────┐
│  file_name   │  name  │  type   │ type_length │ repetition_type │ num_children │ converted_type │ scale │ precision │ field_id │ logical_type │
├──────────────┼────────┼─────────┼─────────────┼─────────────────┼──────────────┼────────────────┼───────┼───────────┼──────────┼──────────────┤
│ test.parquet │ schema │ BOOLEAN │ 0           │ REQUIRED        │ 1            │ UTF8           │ 0     │ 0         │ 0        │              │
│ test.parquet │ col1   │ INT64   │ 0           │ OPTIONAL        │ 0            │ UTF8           │ 0     │ 0         │ 0        │              │
└──────────────┴────────┴─────────┴─────────────┴─────────────────┴──────────────┴────────────────┴───────┴───────────┴──────────┴──────────────┘
```

So my request for enhancement would be that all the information printed by pyarrow's `Table.schema.metadata` be available from within duckdb - it would be very convenient for out-of-band information like the file's creation date, a hash of the source file contents, or the git hash of the program that created the parquet file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhancement: support all parquet file metadata #2534

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhancement: support all parquet file metadata #2534

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions