-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Closed
Description
We should add a parquet_metadata
function that allows us to inspect and query characteristics of parquet files (e.g. number of row groups, stats of row groups, etc). Suggested example usage:
SELECT * FROM parquet_metadata('data.parquet');
For inspiration on what the metadata should return we should look at how arrow handles this:
> parquet_file.metadata
<pyarrow._parquet.FileMetaData object at 0x10a3d8650>
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 2
num_rows: 2
num_row_groups: 1
format_version: 1.0
serialized_size: 531
> parquet_file.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x10a3dcdc0>
num_columns: 2
num_rows: 2
total_byte_size: 158
> parquet_file.metadata.row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x10a413a00>
file_offset: 78
file_path:
physical_type: BYTE_ARRAY
num_values: 2
path_in_schema: first_name
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x10a413a50>
has_min_max: True
min: jon
max: jose
null_count: 0
distinct_count: 0
num_values: 2
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 35
total_compressed_size: 74
total_uncompressed_size: 70
For example, we could envision returning the following table (with many more columns!):
row_group_id | column_id | file_offset | file_path | physical_type | statistics |
---|---|---|---|---|---|
0 | 0 | 78 | NULL | BYTE_ARRAY | {'min': jon, 'max': jose, 'null_count': 0} |
Metadata
Metadata
Assignees
Labels
No labels