Add `parquet_metadata` function

We should add a `parquet_metadata` function that allows us to inspect and query characteristics of parquet files (e.g. number of row groups, stats of row groups, etc). Suggested example usage:

```sql
SELECT * FROM parquet_metadata('data.parquet');
```

For inspiration on what the metadata should return we should look at how [arrow handles this](https://mungingdata.com/pyarrow/parquet-metadata-min-max-statistics/):

```python
> parquet_file.metadata
<pyarrow._parquet.FileMetaData object at 0x10a3d8650>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 2
  num_rows: 2
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 531

> parquet_file.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x10a3dcdc0>
  num_columns: 2
  num_rows: 2
  total_byte_size: 158

> parquet_file.metadata.row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x10a413a00>
  file_offset: 78
  file_path:
  physical_type: BYTE_ARRAY
  num_values: 2
  path_in_schema: first_name
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x10a413a50>
      has_min_max: True
      min: jon
      max: jose
      null_count: 0
      distinct_count: 0
      num_values: 2
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: SNAPPY
  encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 35
  total_compressed_size: 74
  total_uncompressed_size: 70
```

For example, we could envision returning the following table (with many more columns!):

| row_group_id | column_id | file_offset | file_path | physical_type |                 statistics                 |
|--------------|-----------|-------------|-----------|---------------|--------------------------------------------|
| 0            | 0         | 78          | NULL      | BYTE_ARRAY    | {'min': jon, 'max': jose, 'null_count': 0} |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `parquet_metadata` function #1899

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add parquet_metadata function #1899

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add `parquet_metadata` function #1899