Skip to content

Refactor AST codegen #4442

@overlookmotel

Description

@overlookmotel

I've been reading through the AST codegen (most recent version as in draft PR #4404). @rzvxa and I are meeting to discuss this evening, but just to put some thoughts down on "paper" in advance...

My opinion of current state

The codegen is doing a great job of automating creating a ton of Oxc's code now. It makes it possible to make changes to AST without having to make corresponding updates in a bunch of other places (and probably making mistakes/omissions in the process). This is hugely valuable.

However, personally I find the implementation of the codegen itself complicated and hard to understand. The primary reasons for this are:

  1. Using syn types throughout, which are unfamiliar to those of us who are not macro maestros.
  2. Mixing those syn types with our own "metadata" types, but with similar names (e.g. TypeDef, TypeRef).
  3. Lack of code comments.
  4. (in my view) excessive abstraction in some places.

Proposed refactor

I propose that we refactor it along these lines:

Generate from schema

How it is now

Currently the way codegen works (roughly speaking, I may not have understood it correctly) is:

  • Parse .rs source files and generate REnum + RStruct types, which are mostly wrappers around syn types.
  • Generate new files mostly directly from these R* types.
  • Also separately generate a Schema object which is a simpler representation of the types.

Proposal: Simplify the pipeline

  1. Read + parse the input .rs files.
  2. Build schema including all info that generators need.
  3. Generate new files from the schema (not going back to the original syn types).

Arguments against

There are valid arguments against what I'm suggesting:

  1. The current implementation is more efficient than what I'm proposing - it avoids e.g. generating syn Idents where there was already an Ident in source file AST that can be reused.
  2. Rzvxa has said that advantage of working with the syn types is to make the code flexible so it could converted easily to macros if we want to do that.

However, I feel like we are optimizing for the wrong things here.

  1. The codegen runs rarely, so its performance is not really important. Making it comprehensible and maintainable is much more valuable.
  2. I think we're fairly clear what is going to be in macros, and what in codegen (very little in macros, basically). So, as I see it, current approach is buying ourselves flexibility to do things that we're pretty clear we don't intend to do, at a cost of making everything a great deal more complicated.

Arguments for

The main argument is simplicity.

I have high hopes that the codegen is going to grow and grow. We can use it to generate more code, and also pull off some more advanced tricks which would be infeasible/unsafe to do by hand. e.g.:

  • Replace the code for inherit_variants! (including calculating an optimal set of values for the enum discriminants).
  • Then generate optimized methods which use unsafe code to take advantage of patterns in these discriminants for speed.
  • Generate Serialize impls.
  • Generate more traits.
  • Calculate type layouts (current area of focus).
  • Building mechanisms for AST transfer.

As we get into more complex areas, working from a simpler model will be very valuable. Especially when generating unsafe code, it's vitally important that the codegen creating that code is easy to understand, so can validate the soundness of its logic.

Building + using the schema

I suggest the following steps (mostly as it is now, but a little bit more structured):

  1. Compile list of all source .rs files codegen need to read.
  2. Read and parse all these files.
  3. Build a Vec of (TypeId, TypeName, Item) containing all types with #[ast] attr. Throw away everything else. TypeId is index into this Vec. Simultaneously build a HashMap<TypeName, TypeId>.
  4. Pass through that Vec again, this time fully parsing the structs + enums to create StructDef / EnumDef objects for each. "Links" between types can use TypeIds (as we can resolve type names to IDs after step 3). This is our schema.
  5. Pass the schema to each generator.
  6. Write generated files to disk.

NB: If some generators need access to original syn types for some reason, that's not impossible - they can get them via the Vec built in step 3.

Code style

Again, aiming for simplicity... While we are refactoring, I think we could also move to a more imperative style in parts. In my opinion, the "depth" of abstractions is a little excessive at present. If the structure was less labyrinthine, I think we also won't need RefCells any more, which will further simplify the code. I would hope this will make the code easier to follow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-cleanupCategory - technical debt or refactoring. Solution not expected to change behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions