Improve performance of AST deserialization #7935

tangent-vector · 2025-07-25T23:49:17Z

The primary goal of these changes is to reduce the total time spent in the global session's loadBuiltinModule(), which gets called as part of global session creation to load the core module, and thus impacts every invocation of slangc and every user of the Slang compiler API. The majority of the time is spent simply deserializing the core module's AST and IR and, of those two, the AST takes significantly longer to load than the IR (in the ballpark of 5x the time).

This change is focused on the serialization infrastructure but, given the performance situation described above, the focus is first and foremost on deserialization performance for the Slang AST, when using the fossil format. That focus shows through in the changes that have been implemented.

Change serialization framework to use template instead of virtual =====================================================================

The recently-introduced serialization framework in slang-serialize.h was centered around a dynamically-dispatched ISerializerImpl interface. As a result, every single invocation of a serialize(...) call ultimately went through virtual function dispatch. While the overhead of the virtual calls themselves does not have a major impact on the total deserialization performance, those calls end up serving as a barrier to further optimization.

This change changes operations that used to take a Serializer const& (which wraps an ISerializerImpl*), to instead declare a template parameter <typename S> and take an S const&.

The main consequence of the change is that serialize() functions for user-defined types will need to be template functions, and thus either be defined in headers (alongside the type that they serialize) or else in the specific source file that handles serialization (as is currently being done for the AST-related types in slang-serialize-ast.cpp).

Note that if we later decide that we want the ability to perform serialization through a dynamically-dispatched interface (e.g., to easily toggle between different serialization back-ends), it will be easier to layer a dynamically-dispatched implementation on top of the statically-dispatched template version than the other way around.

Generous use of `SLANG_FORCE_INLINE`

In order to unlock further optimizations, a bunch of operations were marked with SLANG_FORCE_INLINE. It is important to note that forcing inlining like this is a big hammer, and needs to be approached with at least a little caution.

The simplest cases are:

trivial wrapper function that just delegate to another function
functions that only have a single call site (but exist to keep abstractions clean)

Externalize Scope for `begin`/`end` Operations

The old ISerializerImpl interface had a bunch of paired begin/end operations that define the hierarchical structure of data being read. Most serializer implementations (whether for reading or writing) use these operations to help maintain some kind of internal stack for tracking state in the hierarchy.

The overhead of maintaining such a stack with something like a List<T> amortizes out over many operations, but even that overhead is unnecessary when the begin/end pairs are already mirroring the call stack of the code invoking serialization.

This change modifies the ScopedSerializerFoo types so that they each provide a piece of stack-allocated storage to the serializer back-end's beginFoo() and endFoo() operations. Currently only the Fossil::SerialReader is making use of that facility, but the other implementations of readers and writers in the codebase could be adapted if we ever wanted to.

Streamline `Fossil::SerialReader`

The most significant performance gains came from changes to the Fossil::SerialReader type, aimed at minimizing the cycles spent in the core _readValPtr() routine. That function used to have a large-ish switch statement that implemented superficially very different reading logic depending on the outer container/object being read from.

The new logic pushes more work back on the begin and end operations (which get invoked far less frequently than simple scalar/pointer values get read), so that they always set up the state of the reader with direct pointers to the data and layout for the next fossilized value to be read.

The remaining work in _readValPtr() has been factored into a differnt subroutine - _advanceCursor() - that takes responsibility for advancing the data pointer, and updating the various other fields. The _advanceCursor() routine is still messier than is ideal, because it has to deal with the various different kinds of logic required for navigating to the next value.

Various other conditionals inside the SerialReader implementation were streamlined, mostly by collapsing the State::Type enumeration down to only represent the cases that are truly semantically distinct.

Evaluated: Streamline Layout Rules for Fossil

One potential approach that I implemented but then reverted (after finding it had little to no performance impact) was changing the fossil format to always write things with 4-byte alignment/granularity. That would mean values smaller than 4 bytes would get inflated to a full 4 bytes, and scalar values larger than 4 bytes get written with only 4-byte alignment (requiring unaligned loads to read them).

I found that the only way to take advantage of the simplified layout rules to improve read performance would be to more-or-less eliminate the use of the layout information embedded in the fossil data, which would make it very difficult to validate that the data is correctly structured.

Possible Future Work: Further Type Specialization

As it stands, the biggest overhead remaining on the critical path of _readValPtr() is the way the _advanceCursor() logic needs to take different approaches depending on the type of the surrounding context (advancing through elements of a container is very different than advancing through fields of a struct, for example).

The interesting thing to note is that at the use site within a serialize() function, it is usually manifestly obvious which case something is in. If the code uses SLANG_SCOPED_SERIALIZER_ARRAY it is in a container, while if it uses SLANG_SCOPED_SERIALIZER_STRUCT it is in a struct. This means that the contextual information is staticaly available, but just isn't exposed in a way that lets the core reading logic take advantage of it.

A logical extension of the work here would be to expand on the Scope idea added in this change such that most of the serialization operations (handleInt32, handleString, etc.) are actually dispatched through the scope, and then have each of the SLANG_SCOPED_SERIALIZER_... macros instantiate a different scope type (still dependent on the serializer).

The primary goal of these changes is to reduce the total time spent in the global session's `loadBuiltinModule()`, which gets called as part of global session creation to load the core module, and thus impacts every invocation of `slangc` and every user of the Slang compiler API. The majority of the time is spent simply deserializing the core module's AST and IR and, of those two, the AST takes significantly longer to load than the IR (in the ballpark of 5x the time). This change is focused on the serialization infrastructure but, given the performance situation described above, the focus is first and foremost on *deserialization* performance for the Slang *AST*, when using the *fossil* format. That focus shows through in the changes that have been implemented. Change serialization framework to use `template` instead of `virtual` ===================================================================== The recently-introduced serialization framework in `slang-serialize.h` was centered around a dynamically-dispatched `ISerializerImpl` interface. As a result, every single invocation of a `serialize(...)` call ultimately went through `virtual` function dispatch. While the overhead of the `virtual` calls themselves does not have a major impact on the total deserialization performance, those calls end up serving as a barrier to further optimization. This change changes operations that used to take a `Serializer const&` (which wraps an `ISerializerImpl*`), to instead declare a template parameter `<typename S>` and take an `S const&`. The main consequence of the change is that `serialize()` functions for user-defined types will need to be template functions, and thus either be defined in headers (alongside the type that they serialize) or else in the specific source file that handles serialization (as is currently being done for the AST-related types in `slang-serialize-ast.cpp`). Note that if we later decide that we want the ability to perform serialization through a dynamically-dispatched interface (e.g., to easily toggle between different serialization back-ends), it will be easier to layer a dynamically-dispatched implementation on top of the statically-dispatched `template` version than the other way around. Generous use of `SLANG_FORCE_INLINE` ==================================== In order to unlock further optimizations, a bunch of operations were marked with `SLANG_FORCE_INLINE`. It is important to note that forcing inlining like this is a big hammer, and needs to be approached with at least a little caution. The simplest cases are: * trivial wrapper function that just delegate to another function * functions that only have a single call site (but exist to keep abstractions clean) Externalize Scope for `begin`/`end` Operations ============================================== The old `ISerializerImpl` interface had a bunch of paired begin/end operations that define the hierarchical structure of data being read. Most serializer implementations (whether for reading or writing) use these operations to help maintain some kind of internal stack for tracking state in the hierarchy. The overhead of maintaining such a stack with something like a `List<T>` amortizes out over many operations, but even that overhead is unnecessary when the begin/end pairs are *already* mirroring the call stack of the code invoking serialization. This change modifies the `ScopedSerializerFoo` types so that they each provide a piece of stack-allocated storage to the serializer back-end's `beginFoo()` and `endFoo()` operations. Currently only the `Fossil::SerialReader` is making use of that facility, but the other implementations of readers and writers in the codebase could be adapted if we ever wanted to. Streamline `Fossil::SerialReader` ================================= The most significant performance gains came from changes to the `Fossil::SerialReader` type, aimed at minimizing the cycles spent in the core `_readValPtr()` routine. That function used to have a large-ish `switch` statement that implemented superficially very different reading logic depending on the outer container/object being read from. The new logic pushes more work back on the `begin` and `end` operations (which get invoked far less frequently than simple scalar/pointer values get read), so that they always set up the state of the reader with direct pointers to the data and layout for the next fossilized value to be read. The remaining work in `_readValPtr()` has been factored into a differnt subroutine - `_advanceCursor()` - that takes responsibility for advancing the data pointer, and updating the various other fields. The `_advanceCursor()` routine is still messier than is ideal, because it has to deal with the various different kinds of logic required for navigating to the next value. Various other conditionals inside the `SerialReader` implementation were streamlined, mostly by collapsing the `State::Type` enumeration down to only represent the cases that are truly semantically distinct. Evaluated: Streamline Layout Rules for Fossil ============================================= One potential approach that I implemented but then reverted (after finding it had little to no performance impact) was changing the fossil format to always write things with 4-byte alignment/granularity. That would mean values smaller than 4 bytes would get inflated to a full 4 bytes, and scalar values larger than 4 bytes get written with only 4-byte alignment (requiring unaligned loads to read them). I found that the only way to take advantage of the simplified layout rules to improve read performance would be to more-or-less eliminate the use of the layout information embedded in the fossil data, which would make it very difficult to validate that the data is correctly structured. Possible Future Work: Further Type Specialization ================================================= As it stands, the biggest overhead remaining on the critical path of `_readValPtr()` is the way the `_advanceCursor()` logic needs to take different approaches depending on the type of the surrounding context (advancing through elements of a container is very different than advancing through fields of a `struct`, for example). The interesting thing to note is that at the use site within a `serialize()` function, it is usually manifestly obvious which case something is in. If the code uses `SLANG_SCOPED_SERIALIZER_ARRAY` it is in a container, while if it uses `SLANG_SCOPED_SERIALIZER_STRUCT` it is in a struct. This means that the contextual information is staticaly available, but just isn't exposed in a way that lets the core reading logic take advantage of it. A logical extension of the work here would be to expand on the `Scope` idea added in this change such that most of the serialization operations (`handleInt32`, `handleString`, etc.) are actually dispatched through the scope, and then have each of the `SLANG_SCOPED_SERIALIZER_...` macros instantiate a *different* scope type (still dependent on the serializer).

tangent-vector · 2025-07-25T23:51:34Z

To get concrete, in my own testing (on my dev machine, for a release x64 Windows build) the changes in this PR bring the total time for loadBuiltinModule() down from ~250ms to ~180ms. The time spent in readSerializedModuleAST specifically went from ~200ms to ~140ms.

tangent-vector · 2025-07-25T23:51:44Z

/format

slangbot · 2025-07-25T23:52:32Z

🌈 Formatted, please merge the changes from this PR

tangent-vector · 2025-07-28T22:57:01Z

/format

slangbot · 2025-07-28T22:57:55Z

🌈 Formatted, please merge the changes from this PR

…ization-performance Format code for PR shader-slang#7935

source/slang/slang-serialize.h

csyonghe · 2025-07-29T07:07:18Z

source/slang/slang-serialize.h


 private:
-    Serializer _serializer;
+    S _serializer;


Should we static_assert the size is less than 64 bytes?

At the moment the code is all statically specialized via templates so there is no need to enforce that constraint.

I would expect that if/when we introduce an adapter type like:

template<typename T> class PolymorphicSerializerImpl : ISerializerImpl { ... }

that adapts a concrete serializer implementation like Fossil::SerialReader to form a type that implements the virtual functions of ISerializerImpl, that is the place where we'd inject the static_asserts that the T::Scope type is small enough, because the PolymorphicSerializerImpl would be the thing that has to implement storing a T::Scope in the one-size-fits-all structure.

Overview -------- This change basically just flips a `#define` switch to enable the changes that were already checked in with PR shader-slang#7482. That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place. PR shader-slang#8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature. Impact ------ Let's look at some anecdotal performance numbers, collected on my dev box using a `hello-world.exe` from a Release x64 Windows build. The key performance stats from a build before this change are: ``` [*] loadBuiltinModule 1 254.29ms [*] checkAllTranslationUnits 1 6.14ms ``` After this change, we see: ``` [*] loadBuiltinModule 1 91.75ms [*] checkAllTranslationUnits 1 11.40ms ``` This change reduces the time spent in `loadBuiltinModule()` by just over 162ms, and increases the time spent in `checkAllTranslationUnits()` by about 5.25ms (the time spent in other compilation steps seems to be unaffected). Because `loadBuiltinModule()` is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain. For this example, the time spent in `checkAllTranslationUnits()` has almost doubled, due to operations that force AST declarations from the core module to be deserialized. Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used). Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded. After compiling the code for the `hello-world` example, approximately 20% of the top-level declarations have been demand-loaded. Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in `loadBuiltinModule()` further. The data above also implies that `loadBuiltinModule()` may include large fixed overheads, which should also be scrutinized further. Relationship to PR shader-slang#7935 ------------------------ PR shader-slang#7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance. On a branch with those optimizations in place (but not this change), the corresponding timings are: ``` [*] loadBuiltinModule 1 176.62ms [*] checkAllTranslationUnits 1 6.04ms ``` It remains to be seen how performance fares when this change and the optimizations in PR shader-slang#7935 are combined. In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem. We thus expect the combination of the two to be better than either alone but, of course, testing will be required.

…lization-performance

Overview -------- This change basically just flips a `#define` switch to enable the changes that were already checked in with PR #7482. That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place. PR #8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature. Impact ------ Let's look at some anecdotal performance numbers, collected on my dev box using a `hello-world.exe` from a Release x64 Windows build. The key performance stats from a build before this change are: ``` [*] loadBuiltinModule 1 254.29ms [*] checkAllTranslationUnits 1 6.14ms ``` After this change, we see: ``` [*] loadBuiltinModule 1 91.75ms [*] checkAllTranslationUnits 1 11.40ms ``` This change reduces the time spent in `loadBuiltinModule()` by just over 162ms, and increases the time spent in `checkAllTranslationUnits()` by about 5.25ms (the time spent in other compilation steps seems to be unaffected). Because `loadBuiltinModule()` is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain. For this example, the time spent in `checkAllTranslationUnits()` has almost doubled, due to operations that force AST declarations from the core module to be deserialized. Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used). Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded. After compiling the code for the `hello-world` example, approximately 20% of the top-level declarations have been demand-loaded. Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in `loadBuiltinModule()` further. The data above also implies that `loadBuiltinModule()` may include large fixed overheads, which should also be scrutinized further. Relationship to PR #7935 ------------------------ PR #7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance. On a branch with those optimizations in place (but not this change), the corresponding timings are: ``` [*] loadBuiltinModule 1 176.62ms [*] checkAllTranslationUnits 1 6.04ms ``` It remains to be seen how performance fares when this change and the optimizations in PR #7935 are combined. In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem. We thus expect the combination of the two to be better than either alone but, of course, testing will be required.

ArielG-NV

re-approval

tangent-vector requested a review from a team as a code owner July 25, 2025 23:49

tangent-vector added the pr: non-breaking PRs without breaking changes label Jul 25, 2025

slangbot mentioned this pull request Jul 25, 2025

Format code for PR #7935 tangent-vector/slang#28

Merged

fixup

27b237f

format code

9ccd48b

tangent-vector added 2 commits July 28, 2025 15:58

Merge pull request #28 from slangbot/format-7935-improve-ast-deserial…

2ea6339

…ization-performance Format code for PR shader-slang#7935

Merge branch 'master' into improve-ast-deserialization-performance

dbf61bf

tangent-vector requested a review from kaizhangNV July 28, 2025 23:45

csyonghe previously approved these changes Jul 29, 2025

View reviewed changes

tangent-vector requested review from expipiplus1 and a team and removed request for kaizhangNV July 29, 2025 21:13

typo

1913153

tangent-vector dismissed csyonghe’s stale review via 1913153 July 29, 2025 21:18

Merge branch 'master' into improve-ast-deserialization-performance

0ee7342

tangent-vector enabled auto-merge July 29, 2025 21:19

csyonghe previously approved these changes Jul 29, 2025

View reviewed changes

expipiplus1 previously approved these changes Jul 30, 2025

View reviewed changes

tangent-vector added this pull request to the merge queue Jul 30, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 30, 2025

csyonghe added this pull request to the merge queue Jul 31, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 31, 2025

csyonghe added this pull request to the merge queue Jul 31, 2025

github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Jul 31, 2025

tangent-vector mentioned this pull request Aug 6, 2025

Enable on-demand deserialization of AST decls #8095

Merged

Merge remote-tracking branch 'origin/master' into improve-ast-deseria…

74887f2

…lization-performance

tangent-vector dismissed stale reviews from expipiplus1 and csyonghe via 74887f2 August 6, 2025 21:33

tangent-vector enabled auto-merge August 6, 2025 21:34

Merge branch 'master' into improve-ast-deserialization-performance

4b2dbbf

tangent-vector added 2 commits August 6, 2025 16:23

Merge branch 'master' into improve-ast-deserialization-performance

b6f54ec

Merge branch 'master' into improve-ast-deserialization-performance

3277726

ArielG-NV approved these changes Aug 7, 2025

View reviewed changes

tangent-vector added this pull request to the merge queue Aug 7, 2025

Merged via the queue into shader-slang:master with commit 631a4c3 Aug 7, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance of AST deserialization #7935

Improve performance of AST deserialization #7935

Uh oh!

tangent-vector commented Jul 25, 2025

Uh oh!

tangent-vector commented Jul 25, 2025

Uh oh!

tangent-vector commented Jul 25, 2025

Uh oh!

slangbot commented Jul 25, 2025

Uh oh!

tangent-vector commented Jul 28, 2025

Uh oh!

slangbot commented Jul 28, 2025

Uh oh!

Uh oh!

csyonghe Jul 29, 2025

Uh oh!

tangent-vector Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

ArielG-NV left a comment

Uh oh!

Uh oh!

Uh oh!

Improve performance of AST deserialization #7935

Improve performance of AST deserialization #7935

Uh oh!

Conversation

tangent-vector commented Jul 25, 2025

Generous use of SLANG_FORCE_INLINE

Externalize Scope for begin/end Operations

Streamline Fossil::SerialReader

Evaluated: Streamline Layout Rules for Fossil

Possible Future Work: Further Type Specialization

Uh oh!

tangent-vector commented Jul 25, 2025

Uh oh!

tangent-vector commented Jul 25, 2025

Uh oh!

slangbot commented Jul 25, 2025

Uh oh!

tangent-vector commented Jul 28, 2025

Uh oh!

slangbot commented Jul 28, 2025

Uh oh!

Uh oh!

csyonghe Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

tangent-vector Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ArielG-NV left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Generous use of `SLANG_FORCE_INLINE`

Externalize Scope for `begin`/`end` Operations

Streamline `Fossil::SerialReader`