Skip to content

Conversation

tangent-vector
Copy link
Contributor

The primary goal of these changes is to reduce the total time spent in the global session's loadBuiltinModule(), which gets called as part of global session creation to load the core module, and thus impacts every invocation of slangc and every user of the Slang compiler API. The majority of the time is spent simply deserializing the core module's AST and IR and, of those two, the AST takes significantly longer to load than the IR (in the ballpark of 5x the time).

This change is focused on the serialization infrastructure but, given the performance situation described above, the focus is first and foremost on deserialization performance for the Slang AST, when using the fossil format. That focus shows through in the changes that have been implemented.

Change serialization framework to use template instead of virtual =====================================================================

The recently-introduced serialization framework in slang-serialize.h was centered around a dynamically-dispatched ISerializerImpl interface. As a result, every single invocation of a serialize(...) call ultimately went through virtual function dispatch. While the overhead of the virtual calls themselves does not have a major impact on the total deserialization performance, those calls end up serving as a barrier to further optimization.

This change changes operations that used to take a Serializer const& (which wraps an ISerializerImpl*), to instead declare a template parameter <typename S> and take an S const&.

The main consequence of the change is that serialize() functions for user-defined types will need to be template functions, and thus either be defined in headers (alongside the type that they serialize) or else in the specific source file that handles serialization (as is currently being done for the AST-related types in slang-serialize-ast.cpp).

Note that if we later decide that we want the ability to perform serialization through a dynamically-dispatched interface (e.g., to easily toggle between different serialization back-ends), it will be easier to layer a dynamically-dispatched implementation on top of the statically-dispatched template version than the other way around.

Generous use of SLANG_FORCE_INLINE

In order to unlock further optimizations, a bunch of operations were marked with SLANG_FORCE_INLINE. It is important to note that forcing inlining like this is a big hammer, and needs to be approached with at least a little caution.

The simplest cases are:

  • trivial wrapper function that just delegate to another function

  • functions that only have a single call site (but exist to keep abstractions clean)

Externalize Scope for begin/end Operations

The old ISerializerImpl interface had a bunch of paired begin/end operations that define the hierarchical structure of data being read. Most serializer implementations (whether for reading or writing) use these operations to help maintain some kind of internal stack for tracking state in the hierarchy.

The overhead of maintaining such a stack with something like a List<T> amortizes out over many operations, but even that overhead is unnecessary when the begin/end pairs are already mirroring the call stack of the code invoking serialization.

This change modifies the ScopedSerializerFoo types so that they each provide a piece of stack-allocated storage to the serializer back-end's beginFoo() and endFoo() operations. Currently only the Fossil::SerialReader is making use of that facility, but the other implementations of readers and writers in the codebase could be adapted if we ever wanted to.

Streamline Fossil::SerialReader

The most significant performance gains came from changes to the Fossil::SerialReader type, aimed at minimizing the cycles spent in the core _readValPtr() routine. That function used to have a large-ish switch statement that implemented superficially very different reading logic depending on the outer container/object being read from.

The new logic pushes more work back on the begin and end operations (which get invoked far less frequently than simple scalar/pointer values get read), so that they always set up the state of the reader with direct pointers to the data and layout for the next fossilized value to be read.

The remaining work in _readValPtr() has been factored into a differnt subroutine - _advanceCursor() - that takes responsibility for advancing the data pointer, and updating the various other fields. The _advanceCursor() routine is still messier than is ideal, because it has to deal with the various different kinds of logic required for navigating to the next value.

Various other conditionals inside the SerialReader implementation were streamlined, mostly by collapsing the State::Type enumeration down to only represent the cases that are truly semantically distinct.

Evaluated: Streamline Layout Rules for Fossil

One potential approach that I implemented but then reverted (after finding it had little to no performance impact) was changing the fossil format to always write things with 4-byte alignment/granularity. That would mean values smaller than 4 bytes would get inflated to a full 4 bytes, and scalar values larger than 4 bytes get written with only 4-byte alignment (requiring unaligned loads to read them).

I found that the only way to take advantage of the simplified layout rules to improve read performance would be to more-or-less eliminate the use of the layout information embedded in the fossil data, which would make it very difficult to validate that the data is correctly structured.

Possible Future Work: Further Type Specialization

As it stands, the biggest overhead remaining on the critical path of _readValPtr() is the way the _advanceCursor() logic needs to take different approaches depending on the type of the surrounding context (advancing through elements of a container is very different than advancing through fields of a struct, for example).

The interesting thing to note is that at the use site within a serialize() function, it is usually manifestly obvious which case something is in. If the code uses SLANG_SCOPED_SERIALIZER_ARRAY it is in a container, while if it uses SLANG_SCOPED_SERIALIZER_STRUCT it is in a struct. This means that the contextual information is staticaly available, but just isn't exposed in a way that lets the core reading logic take advantage of it.

A logical extension of the work here would be to expand on the Scope idea added in this change such that most of the serialization operations (handleInt32, handleString, etc.) are actually dispatched through the scope, and then have each of the SLANG_SCOPED_SERIALIZER_... macros instantiate a different scope type (still dependent on the serializer).

The primary goal of these changes is to reduce the total time spent in the global session's `loadBuiltinModule()`, which gets called as part of global session creation to load the core module, and thus impacts every invocation of `slangc` and every user of the Slang compiler API.
The majority of the time is spent simply deserializing the core module's AST and IR and, of those two, the AST takes significantly longer to load than the IR (in the ballpark of 5x the time).

This change is focused on the serialization infrastructure but, given the performance situation described above, the focus is first and foremost on *deserialization* performance for the Slang *AST*, when using the *fossil* format.
That focus shows through in the changes that have been implemented.

Change serialization framework to use `template` instead of `virtual`
=====================================================================

The recently-introduced serialization framework in `slang-serialize.h` was centered around a dynamically-dispatched `ISerializerImpl` interface.
As a result, every single invocation of a `serialize(...)` call ultimately went through `virtual` function dispatch.
While the overhead of the `virtual` calls themselves does not have a major impact on the total deserialization performance, those calls end up serving as a barrier to further optimization.

This change changes operations that used to take a `Serializer const&` (which wraps an `ISerializerImpl*`), to instead declare a template parameter `<typename S>` and take an `S const&`.

The main consequence of the change is that `serialize()` functions for user-defined types will need to be template functions, and thus either be defined in headers (alongside the type that they serialize) or else in the specific source file that handles serialization (as is currently being done for the AST-related types in `slang-serialize-ast.cpp`).

Note that if we later decide that we want the ability to perform serialization through a dynamically-dispatched interface (e.g., to easily toggle between different serialization back-ends), it will be easier to layer a dynamically-dispatched implementation on top of the statically-dispatched `template` version than the other way around.

Generous use of `SLANG_FORCE_INLINE`
====================================

In order to unlock further optimizations, a bunch of operations were marked with `SLANG_FORCE_INLINE`.
It is important to note that forcing inlining like this is a big hammer, and needs to be approached with at least a little caution.

The simplest cases are:

* trivial wrapper function that just delegate to another function

* functions that only have a single call site (but exist to keep abstractions clean)

Externalize Scope for `begin`/`end` Operations
==============================================

The old `ISerializerImpl` interface had a bunch of paired begin/end operations that define the hierarchical structure of data being read.
Most serializer implementations (whether for reading or writing) use these operations to help maintain some kind of internal stack for tracking state in the hierarchy.

The overhead of maintaining such a stack with something like a `List<T>` amortizes out over many operations, but even that overhead is unnecessary when the begin/end pairs are *already* mirroring the call stack of the code invoking serialization.

This change modifies the `ScopedSerializerFoo` types so that they each provide a piece of stack-allocated storage to the serializer back-end's `beginFoo()` and `endFoo()` operations.
Currently only the `Fossil::SerialReader` is making use of that facility, but the other implementations of readers and writers in the codebase could be adapted if we ever wanted to.

Streamline `Fossil::SerialReader`
=================================

The most significant performance gains came from changes to the `Fossil::SerialReader` type, aimed at minimizing the cycles spent in the core `_readValPtr()` routine.
That function used to have a large-ish `switch` statement that implemented superficially very different reading logic depending on the outer container/object being read from.

The new logic pushes more work back on the `begin` and `end` operations (which get invoked far less frequently than simple scalar/pointer values get read), so that they always set up the state of the reader with direct pointers to the data and layout for the next fossilized value to be read.

The remaining work in `_readValPtr()` has been factored into a differnt subroutine - `_advanceCursor()` - that takes responsibility for advancing the data pointer, and updating the various other fields.
The `_advanceCursor()` routine is still messier than is ideal, because it has to deal with the various different kinds of logic required for navigating to the next value.

Various other conditionals inside the `SerialReader` implementation were streamlined, mostly by collapsing the `State::Type` enumeration down to only represent the cases that are truly semantically distinct.

Evaluated: Streamline Layout Rules for Fossil
=============================================

One potential approach that I implemented but then reverted (after finding it had little to no performance impact) was changing the fossil format to always write things with 4-byte alignment/granularity.
That would mean values smaller than 4 bytes would get inflated to a full 4 bytes, and scalar values larger than 4 bytes get written with only 4-byte alignment (requiring unaligned loads to read them).

I found that the only way to take advantage of the simplified layout rules to improve read performance would be to more-or-less eliminate the use of the layout information embedded in the fossil data, which would make it very difficult to validate that the data is correctly structured.

Possible Future Work: Further Type Specialization
=================================================

As it stands, the biggest overhead remaining on the critical path of `_readValPtr()` is the way the `_advanceCursor()` logic needs to take different approaches depending on the type of the surrounding context (advancing through elements of a container is very different than advancing through fields of a `struct`, for example).

The interesting thing to note is that at the use site within a `serialize()` function, it is usually manifestly obvious which case something is in.
If the code uses `SLANG_SCOPED_SERIALIZER_ARRAY` it is in a container, while if it uses `SLANG_SCOPED_SERIALIZER_STRUCT` it is in a struct.
This means that the contextual information is staticaly available, but just isn't exposed in a way that lets the core reading logic take advantage of it.

A logical extension of the work here would be to expand on the `Scope` idea added in this change such that most of the serialization operations (`handleInt32`, `handleString`, etc.) are actually dispatched through the scope, and then have each of the `SLANG_SCOPED_SERIALIZER_...` macros instantiate a *different* scope type (still dependent on the serializer).
@tangent-vector tangent-vector requested a review from a team as a code owner July 25, 2025 23:49
@tangent-vector tangent-vector added the pr: non-breaking PRs without breaking changes label Jul 25, 2025
@tangent-vector
Copy link
Contributor Author

To get concrete, in my own testing (on my dev machine, for a release x64 Windows build) the changes in this PR bring the total time for loadBuiltinModule() down from ~250ms to ~180ms. The time spent in readSerializedModuleAST specifically went from ~200ms to ~140ms.

@tangent-vector
Copy link
Contributor Author

/format

@slangbot
Copy link
Contributor

🌈 Formatted, please merge the changes from this PR

@tangent-vector
Copy link
Contributor Author

/format

@slangbot
Copy link
Contributor

🌈 Formatted, please merge the changes from this PR

csyonghe
csyonghe previously approved these changes Jul 29, 2025

private:
Serializer _serializer;
S _serializer;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we static_assert the size is less than 64 bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the code is all statically specialized via templates so there is no need to enforce that constraint.

I would expect that if/when we introduce an adapter type like:

template<typename T>
class PolymorphicSerializerImpl : ISerializerImpl { ... }

that adapts a concrete serializer implementation like Fossil::SerialReader to form a type that implements the virtual functions of ISerializerImpl, that is the place where we'd inject the static_asserts that the T::Scope type is small enough, because the PolymorphicSerializerImpl would be the thing that has to implement storing a T::Scope in the one-size-fits-all structure.

@tangent-vector tangent-vector requested review from expipiplus1 and a team and removed request for kaizhangNV July 29, 2025 21:13
@tangent-vector tangent-vector enabled auto-merge July 29, 2025 21:19
csyonghe
csyonghe previously approved these changes Jul 29, 2025
expipiplus1
expipiplus1 previously approved these changes Jul 30, 2025
@tangent-vector tangent-vector added this pull request to the merge queue Jul 30, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 30, 2025
@csyonghe csyonghe added this pull request to the merge queue Jul 31, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 31, 2025
@csyonghe csyonghe added this pull request to the merge queue Jul 31, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Jul 31, 2025
tangent-vector added a commit to tangent-vector/slang that referenced this pull request Aug 6, 2025
Overview
--------

This change basically just flips a `#define` switch to enable the changes that were already checked in with PR shader-slang#7482.
That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place.
PR shader-slang#8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature.

Impact
------

Let's look at some anecdotal performance numbers, collected on my dev box using a `hello-world.exe` from a Release x64 Windows build.
The key performance stats from a build before this change are:

```
[*]              loadBuiltinModule      1         254.29ms
[*]       checkAllTranslationUnits      1           6.14ms
```

After this change, we see:

```
[*]              loadBuiltinModule      1          91.75ms
[*]       checkAllTranslationUnits      1          11.40ms
```

This change reduces the time spent in `loadBuiltinModule()` by just over 162ms, and increases the time spent in `checkAllTranslationUnits()` by about 5.25ms (the time spent in other compilation steps seems to be unaffected).
Because `loadBuiltinModule()` is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain.

For this example, the time spent in `checkAllTranslationUnits()` has almost doubled, due to operations that force AST declarations from the core module to be deserialized.
Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used).

Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded.
After compiling the code for the `hello-world` example, approximately 20% of the top-level declarations have been demand-loaded.
Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in `loadBuiltinModule()` further.
The data above also implies that `loadBuiltinModule()` may include large fixed overheads, which should also be scrutinized further.

Relationship to PR shader-slang#7935
------------------------

PR shader-slang#7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance.
On a branch with those optimizations in place (but not this change), the corresponding timings are:

```
[*]              loadBuiltinModule      1         176.62ms
[*]       checkAllTranslationUnits      1           6.04ms
```

It remains to be seen how performance fares when this change and the optimizations in PR shader-slang#7935 are combined.
In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem.
We thus expect the combination of the two to be better than either alone but, of course, testing will be required.
@tangent-vector tangent-vector dismissed stale reviews from expipiplus1 and csyonghe via 74887f2 August 6, 2025 21:33
github-merge-queue bot pushed a commit that referenced this pull request Aug 6, 2025
Overview
--------

This change basically just flips a `#define` switch to enable the changes that were already checked in with PR #7482.
That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place.
PR #8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature.

Impact
------

Let's look at some anecdotal performance numbers, collected on my dev box using a `hello-world.exe` from a Release x64 Windows build.
The key performance stats from a build before this change are:

```
[*]              loadBuiltinModule      1         254.29ms
[*]       checkAllTranslationUnits      1           6.14ms
```

After this change, we see:

```
[*]              loadBuiltinModule      1          91.75ms
[*]       checkAllTranslationUnits      1          11.40ms
```

This change reduces the time spent in `loadBuiltinModule()` by just over 162ms, and increases the time spent in `checkAllTranslationUnits()` by about 5.25ms (the time spent in other compilation steps seems to be unaffected).
Because `loadBuiltinModule()` is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain.

For this example, the time spent in `checkAllTranslationUnits()` has almost doubled, due to operations that force AST declarations from the core module to be deserialized.
Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used).

Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded.
After compiling the code for the `hello-world` example, approximately 20% of the top-level declarations have been demand-loaded.
Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in `loadBuiltinModule()` further.
The data above also implies that `loadBuiltinModule()` may include large fixed overheads, which should also be scrutinized further.

Relationship to PR #7935
------------------------

PR #7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance.
On a branch with those optimizations in place (but not this change), the corresponding timings are:

```
[*]              loadBuiltinModule      1         176.62ms
[*]       checkAllTranslationUnits      1           6.04ms
```

It remains to be seen how performance fares when this change and the optimizations in PR #7935 are combined.
In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem.
We thus expect the combination of the two to be better than either alone but, of course, testing will be required.
Copy link
Collaborator

@ArielG-NV ArielG-NV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-approval

@tangent-vector tangent-vector added this pull request to the merge queue Aug 7, 2025
Merged via the queue into shader-slang:master with commit 631a4c3 Aug 7, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr: non-breaking PRs without breaking changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants