Skip to content

Re-design the Schema serialization and code organization #3303

@slinkydeveloper

Description

@slinkydeveloper

There are many points of criticality in the Schema design:

  • Schema as it is duplicates many metadata, due to how it is taken and serialized as is, including the indexes, rather than distinguishing between the in memory materialized view, and the serialized shape. This makes the metadata size bigger for no good reason, and we have hard constraints on the schema registry size due to the network message size limit and the metadata size limit.
  • The double storing of metadata and their index has the downside that every time i need to add a new field, I need to think about backward/frontward compatibility constraints for all the duplicated metadata. This makes the schema registry hard to mantain, hard to apply in-flight changes to env variables, and potentially fragile and brittle.
  • Because of the fact that for the "latest" service we store more metadata than for the old service revisions, there are few missing metadata that prevent us from implement new features like "rollback what is the latest service revision" or blue/green deployments.
  • On top of this, the Schema leaks its internal types used for storing to the Admin API. This is another extremely brittle situation, because changing some field in the Admin REST API potentially breaks backward/frontward compatibility (I think it happened already few times).
  • Last but not least, crucial business logic of the schema registry is now split between what's inside restate_types and the updater in restate_admin module. This makes hard to follow what's going on, and is another potential source of bugs.

I would like to go ahead with the following plan:

  • Define a new data model for Schema, storing the tree of deployments -> services -> handlers. Indexes are built when deserializing the data structure (this happens once in a while anyway). 77939c5 -> done in 1.4
  • Reorganize the code, moving updater inside restate_types. Doing so allows the internal representation of Schema to be public, so can't be leaked anywhere, and makes the code more straightforward to read.
  • Have Schema use the new data structure, with new indexes, but still store the previous one
  • Before releasing 1.5, swap the default to store the new data structure. As soon as users will be on 1.5 and perform the first schema update/propagation, the new data structure will be used.
  • Big cleanup

All in all, what's important here is the following:

  • Core change is that we don't store the index anymore, meaning slimmer schema with the trade-off that on deserialization we pay a little cost for building the index.
  • No semantical changes of the Schema APIs.
  • Code re-organization, hiding things that shouldn't be public and making less error prone schema updates.
  • A good cleanup of duplicated types.
  • All of this affects only the schema registry.

Some more context of where this came from #3295 (comment)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions