Skip to content

Conversation

jerryshao
Copy link
Contributor

What changes were proposed in this pull request?

This PR propose the schema and type spec for Unified Catalog. This spec is used to describe how metadata is organized in the system.

Why are the changes needed?

This PR defines the basic metadata schema model, which will be used in the system for memory structure, on-wire protocol and serialization protocol.

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

N/A

@jerryshao jerryshao requested a review from xunliu May 11, 2023 08:36
@jerryshao jerryshao self-assigned this May 11, 2023
@jerryshao jerryshao requested a review from JunpingDu May 11, 2023 08:38
@jerryshao jerryshao closed this May 11, 2023
@jerryshao jerryshao reopened this May 11, 2023
@jerryshao jerryshao closed this May 11, 2023
@jerryshao jerryshao reopened this May 11, 2023
2. We will further use Substrait to represent our logical plans (for example, like view, function and others), so using Substrait’s type system will reduce some converting works later on.
2. We choose JSON protocol as our user-faced protocol, which is easy to debug for users and systems.
3. We choose Protobuf binary layout to store the schema, the main considerations are here:
1. Binary layout is much more concise compared to HMS’s schema layout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to perform Search or other operations on schema?
If so, Use Binary storage would make these operations difficult to support.
Maybe we can refer to SnowflakeDB's metadata use of AVRO format for storage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please give an example of how we search the metadata? Also, AVRO is also a binary format IIUC.

You could check the details of Snowflake's metadata design, it also says that using AVRO makes use hard to query the metadata compared to SQL DB, so Snowflake builds a series of CLI tools for users to maintain the metadata.


| Field Name | Field Type | Description | Optional |
| ------------------- | --------------- | ------------------------------------------------------------ | -------- |
| connection_id (TBD) | uint32 | The unique id to represent the connector which used to get physical table | Required |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a new word to differentiate two types of different data source connect:

  • collector: connect data source get/put metadata.
  • connector: connect data source get/put data.
    What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a placeholder, I will update the doc when you finish the connection-related design

xunliu
xunliu previously approved these changes May 15, 2023
@jerryshao
Copy link
Contributor Author

@xunliu would you please review this again when you have time, thanks.

Copy link
Member

@xunliu xunliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants