Refactor AST codegen

I've been reading through the AST codegen (most recent version as in draft PR #4404). @rzvxa and I are meeting to discuss this evening, but just to put some thoughts down on "paper" in advance...

## My opinion of current state

The codegen is doing a *great* job of automating creating a ton of Oxc's code now. It makes it possible to make changes to AST without having to make corresponding updates in a bunch of other places (and probably making mistakes/omissions in the process). This is hugely valuable.

However, personally I find the implementation of the codegen itself complicated and hard to understand. The primary reasons for this are:

1. Using `syn` types throughout, which are unfamiliar to those of us who are not macro maestros.
2. Mixing those `syn` types with our own "metadata" types, but with similar names (e.g. `TypeDef`, `TypeRef`).
3. Lack of code comments.
4. (in my view) excessive abstraction in some places.

## Proposed refactor

I propose that we refactor it along these lines:

### Generate from schema

#### How it is now

Currently the way codegen works (roughly speaking, I may not have understood it correctly) is:

* Parse `.rs` source files and generate `REnum` + `RStruct` types, which are mostly wrappers around `syn` types.
* Generate new files mostly directly from these `R*` types.
* Also separately generate a `Schema` object which is a simpler representation of the types.

#### Proposal: Simplify the pipeline

1. Read + parse the input `.rs` files.
2. Build schema including *all* info that generators need.
3. Generate new files from the schema (not going back to the original `syn` types).

#### Arguments against

There are valid arguments *against* what I'm suggesting:

1. The current implementation is more efficient than what I'm proposing - it avoids e.g. generating `syn` `Ident`s where there was already an `Ident` in source file AST that can be reused.
2. Rzvxa has said that advantage of working with the `syn` types is to make the code flexible so it could converted easily to macros if we want to do that.

However, I feel like we are optimizing for the wrong things here.

1. The codegen runs rarely, so its performance is not really important. Making it comprehensible and maintainable is much more valuable.
2. I think we're fairly clear what is going to be in macros, and what in codegen (very little in macros, basically). So, as I see it, current approach is buying ourselves flexibility to do things that we're pretty clear we don't intend to do, at a cost of making everything a great deal more complicated.

#### Arguments for

The main argument is simplicity.

I have high hopes that the codegen is going to grow and grow. We can use it to generate more code, and also pull off some more advanced tricks which would be infeasible/unsafe to do by hand. e.g.:

* Replace the code for `inherit_variants!` (including calculating an optimal set of values for the enum discriminants).
* Then generate optimized methods which use unsafe code to take advantage of patterns in these discriminants for speed.
* Generate `Serialize` impls.
* Generate more traits.
* Calculate type layouts (current area of focus).
* Building mechanisms for AST transfer.

As we get into more complex areas, working from a simpler model will be very valuable. Especially when generating `unsafe` code, it's vitally important that the codegen creating that code is easy to understand, so can validate the soundness of its logic.

### Building + using the schema

I suggest the following steps (mostly as it is now, but a little bit more structured):

1. Compile list of all source `.rs` files codegen need to read.
2. Read and parse all these files.
3. Build a `Vec` of `(TypeId, TypeName, Item)` containing all types with `#[ast]` attr. Throw away everything else. `TypeId` is index into this `Vec`. Simultaneously build a `HashMap<TypeName, TypeId>`.
4. Pass through that `Vec` again, this time fully parsing the structs + enums to create `StructDef` / `EnumDef` objects for each. "Links" between types can use `TypeId`s (as we can resolve type names to IDs after step 3). This is our schema.
5. Pass the schema to each generator.
6. Write generated files to disk.

NB: If some generators need access to original `syn` types for some reason, that's not impossible - they can get them via the `Vec` built in step 3.

### Code style

Again, aiming for simplicity... While we are refactoring, I think we could also move to a more imperative style in parts. In my opinion, the "depth" of abstractions is a little excessive at present. If the structure was less labyrinthine, I think we also won't need `RefCell`s any more, which will further simplify the code. I would hope this will make the code easier to follow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Refactor AST codegen #4442

My opinion of current state

Proposed refactor

Generate from schema

How it is now

Proposal: Simplify the pipeline

Arguments against

Arguments for

Building + using the schema

Code style

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Refactor AST codegen #4442

Description

My opinion of current state

Proposed refactor

Generate from schema

How it is now

Proposal: Simplify the pipeline

Arguments against

Arguments for

Building + using the schema

Code style

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions