-
-
Notifications
You must be signed in to change notification settings - Fork 647
Description
I've been reading through the AST codegen (most recent version as in draft PR #4404). @rzvxa and I are meeting to discuss this evening, but just to put some thoughts down on "paper" in advance...
My opinion of current state
The codegen is doing a great job of automating creating a ton of Oxc's code now. It makes it possible to make changes to AST without having to make corresponding updates in a bunch of other places (and probably making mistakes/omissions in the process). This is hugely valuable.
However, personally I find the implementation of the codegen itself complicated and hard to understand. The primary reasons for this are:
- Using
syn
types throughout, which are unfamiliar to those of us who are not macro maestros. - Mixing those
syn
types with our own "metadata" types, but with similar names (e.g.TypeDef
,TypeRef
). - Lack of code comments.
- (in my view) excessive abstraction in some places.
Proposed refactor
I propose that we refactor it along these lines:
Generate from schema
How it is now
Currently the way codegen works (roughly speaking, I may not have understood it correctly) is:
- Parse
.rs
source files and generateREnum
+RStruct
types, which are mostly wrappers aroundsyn
types. - Generate new files mostly directly from these
R*
types. - Also separately generate a
Schema
object which is a simpler representation of the types.
Proposal: Simplify the pipeline
- Read + parse the input
.rs
files. - Build schema including all info that generators need.
- Generate new files from the schema (not going back to the original
syn
types).
Arguments against
There are valid arguments against what I'm suggesting:
- The current implementation is more efficient than what I'm proposing - it avoids e.g. generating
syn
Ident
s where there was already anIdent
in source file AST that can be reused. - Rzvxa has said that advantage of working with the
syn
types is to make the code flexible so it could converted easily to macros if we want to do that.
However, I feel like we are optimizing for the wrong things here.
- The codegen runs rarely, so its performance is not really important. Making it comprehensible and maintainable is much more valuable.
- I think we're fairly clear what is going to be in macros, and what in codegen (very little in macros, basically). So, as I see it, current approach is buying ourselves flexibility to do things that we're pretty clear we don't intend to do, at a cost of making everything a great deal more complicated.
Arguments for
The main argument is simplicity.
I have high hopes that the codegen is going to grow and grow. We can use it to generate more code, and also pull off some more advanced tricks which would be infeasible/unsafe to do by hand. e.g.:
- Replace the code for
inherit_variants!
(including calculating an optimal set of values for the enum discriminants). - Then generate optimized methods which use unsafe code to take advantage of patterns in these discriminants for speed.
- Generate
Serialize
impls. - Generate more traits.
- Calculate type layouts (current area of focus).
- Building mechanisms for AST transfer.
As we get into more complex areas, working from a simpler model will be very valuable. Especially when generating unsafe
code, it's vitally important that the codegen creating that code is easy to understand, so can validate the soundness of its logic.
Building + using the schema
I suggest the following steps (mostly as it is now, but a little bit more structured):
- Compile list of all source
.rs
files codegen need to read. - Read and parse all these files.
- Build a
Vec
of(TypeId, TypeName, Item)
containing all types with#[ast]
attr. Throw away everything else.TypeId
is index into thisVec
. Simultaneously build aHashMap<TypeName, TypeId>
. - Pass through that
Vec
again, this time fully parsing the structs + enums to createStructDef
/EnumDef
objects for each. "Links" between types can useTypeId
s (as we can resolve type names to IDs after step 3). This is our schema. - Pass the schema to each generator.
- Write generated files to disk.
NB: If some generators need access to original syn
types for some reason, that's not impossible - they can get them via the Vec
built in step 3.
Code style
Again, aiming for simplicity... While we are refactoring, I think we could also move to a more imperative style in parts. In my opinion, the "depth" of abstractions is a little excessive at present. If the structure was less labyrinthine, I think we also won't need RefCell
s any more, which will further simplify the code. I would hope this will make the code easier to follow.