-
Notifications
You must be signed in to change notification settings - Fork 345
Improve performance of AST deserialization #7935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of AST deserialization #7935
Conversation
The primary goal of these changes is to reduce the total time spent in the global session's `loadBuiltinModule()`, which gets called as part of global session creation to load the core module, and thus impacts every invocation of `slangc` and every user of the Slang compiler API. The majority of the time is spent simply deserializing the core module's AST and IR and, of those two, the AST takes significantly longer to load than the IR (in the ballpark of 5x the time). This change is focused on the serialization infrastructure but, given the performance situation described above, the focus is first and foremost on *deserialization* performance for the Slang *AST*, when using the *fossil* format. That focus shows through in the changes that have been implemented. Change serialization framework to use `template` instead of `virtual` ===================================================================== The recently-introduced serialization framework in `slang-serialize.h` was centered around a dynamically-dispatched `ISerializerImpl` interface. As a result, every single invocation of a `serialize(...)` call ultimately went through `virtual` function dispatch. While the overhead of the `virtual` calls themselves does not have a major impact on the total deserialization performance, those calls end up serving as a barrier to further optimization. This change changes operations that used to take a `Serializer const&` (which wraps an `ISerializerImpl*`), to instead declare a template parameter `<typename S>` and take an `S const&`. The main consequence of the change is that `serialize()` functions for user-defined types will need to be template functions, and thus either be defined in headers (alongside the type that they serialize) or else in the specific source file that handles serialization (as is currently being done for the AST-related types in `slang-serialize-ast.cpp`). Note that if we later decide that we want the ability to perform serialization through a dynamically-dispatched interface (e.g., to easily toggle between different serialization back-ends), it will be easier to layer a dynamically-dispatched implementation on top of the statically-dispatched `template` version than the other way around. Generous use of `SLANG_FORCE_INLINE` ==================================== In order to unlock further optimizations, a bunch of operations were marked with `SLANG_FORCE_INLINE`. It is important to note that forcing inlining like this is a big hammer, and needs to be approached with at least a little caution. The simplest cases are: * trivial wrapper function that just delegate to another function * functions that only have a single call site (but exist to keep abstractions clean) Externalize Scope for `begin`/`end` Operations ============================================== The old `ISerializerImpl` interface had a bunch of paired begin/end operations that define the hierarchical structure of data being read. Most serializer implementations (whether for reading or writing) use these operations to help maintain some kind of internal stack for tracking state in the hierarchy. The overhead of maintaining such a stack with something like a `List<T>` amortizes out over many operations, but even that overhead is unnecessary when the begin/end pairs are *already* mirroring the call stack of the code invoking serialization. This change modifies the `ScopedSerializerFoo` types so that they each provide a piece of stack-allocated storage to the serializer back-end's `beginFoo()` and `endFoo()` operations. Currently only the `Fossil::SerialReader` is making use of that facility, but the other implementations of readers and writers in the codebase could be adapted if we ever wanted to. Streamline `Fossil::SerialReader` ================================= The most significant performance gains came from changes to the `Fossil::SerialReader` type, aimed at minimizing the cycles spent in the core `_readValPtr()` routine. That function used to have a large-ish `switch` statement that implemented superficially very different reading logic depending on the outer container/object being read from. The new logic pushes more work back on the `begin` and `end` operations (which get invoked far less frequently than simple scalar/pointer values get read), so that they always set up the state of the reader with direct pointers to the data and layout for the next fossilized value to be read. The remaining work in `_readValPtr()` has been factored into a differnt subroutine - `_advanceCursor()` - that takes responsibility for advancing the data pointer, and updating the various other fields. The `_advanceCursor()` routine is still messier than is ideal, because it has to deal with the various different kinds of logic required for navigating to the next value. Various other conditionals inside the `SerialReader` implementation were streamlined, mostly by collapsing the `State::Type` enumeration down to only represent the cases that are truly semantically distinct. Evaluated: Streamline Layout Rules for Fossil ============================================= One potential approach that I implemented but then reverted (after finding it had little to no performance impact) was changing the fossil format to always write things with 4-byte alignment/granularity. That would mean values smaller than 4 bytes would get inflated to a full 4 bytes, and scalar values larger than 4 bytes get written with only 4-byte alignment (requiring unaligned loads to read them). I found that the only way to take advantage of the simplified layout rules to improve read performance would be to more-or-less eliminate the use of the layout information embedded in the fossil data, which would make it very difficult to validate that the data is correctly structured. Possible Future Work: Further Type Specialization ================================================= As it stands, the biggest overhead remaining on the critical path of `_readValPtr()` is the way the `_advanceCursor()` logic needs to take different approaches depending on the type of the surrounding context (advancing through elements of a container is very different than advancing through fields of a `struct`, for example). The interesting thing to note is that at the use site within a `serialize()` function, it is usually manifestly obvious which case something is in. If the code uses `SLANG_SCOPED_SERIALIZER_ARRAY` it is in a container, while if it uses `SLANG_SCOPED_SERIALIZER_STRUCT` it is in a struct. This means that the contextual information is staticaly available, but just isn't exposed in a way that lets the core reading logic take advantage of it. A logical extension of the work here would be to expand on the `Scope` idea added in this change such that most of the serialization operations (`handleInt32`, `handleString`, etc.) are actually dispatched through the scope, and then have each of the `SLANG_SCOPED_SERIALIZER_...` macros instantiate a *different* scope type (still dependent on the serializer).
To get concrete, in my own testing (on my dev machine, for a release x64 Windows build) the changes in this PR bring the total time for |
/format |
🌈 Formatted, please merge the changes from this PR |
/format |
🌈 Formatted, please merge the changes from this PR |
…ization-performance Format code for PR shader-slang#7935
|
||
private: | ||
Serializer _serializer; | ||
S _serializer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we static_assert the size is less than 64 bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment the code is all statically specialized via templates so there is no need to enforce that constraint.
I would expect that if/when we introduce an adapter type like:
template<typename T>
class PolymorphicSerializerImpl : ISerializerImpl { ... }
that adapts a concrete serializer implementation like Fossil::SerialReader
to form a type that implements the virtual
functions of ISerializerImpl
, that is the place where we'd inject the static_assert
s that the T::Scope
type is small enough, because the PolymorphicSerializerImpl
would be the thing that has to implement storing a T::Scope
in the one-size-fits-all structure.
Overview -------- This change basically just flips a `#define` switch to enable the changes that were already checked in with PR shader-slang#7482. That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place. PR shader-slang#8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature. Impact ------ Let's look at some anecdotal performance numbers, collected on my dev box using a `hello-world.exe` from a Release x64 Windows build. The key performance stats from a build before this change are: ``` [*] loadBuiltinModule 1 254.29ms [*] checkAllTranslationUnits 1 6.14ms ``` After this change, we see: ``` [*] loadBuiltinModule 1 91.75ms [*] checkAllTranslationUnits 1 11.40ms ``` This change reduces the time spent in `loadBuiltinModule()` by just over 162ms, and increases the time spent in `checkAllTranslationUnits()` by about 5.25ms (the time spent in other compilation steps seems to be unaffected). Because `loadBuiltinModule()` is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain. For this example, the time spent in `checkAllTranslationUnits()` has almost doubled, due to operations that force AST declarations from the core module to be deserialized. Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used). Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded. After compiling the code for the `hello-world` example, approximately 20% of the top-level declarations have been demand-loaded. Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in `loadBuiltinModule()` further. The data above also implies that `loadBuiltinModule()` may include large fixed overheads, which should also be scrutinized further. Relationship to PR shader-slang#7935 ------------------------ PR shader-slang#7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance. On a branch with those optimizations in place (but not this change), the corresponding timings are: ``` [*] loadBuiltinModule 1 176.62ms [*] checkAllTranslationUnits 1 6.04ms ``` It remains to be seen how performance fares when this change and the optimizations in PR shader-slang#7935 are combined. In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem. We thus expect the combination of the two to be better than either alone but, of course, testing will be required.
…lization-performance
74887f2
Overview -------- This change basically just flips a `#define` switch to enable the changes that were already checked in with PR #7482. That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place. PR #8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature. Impact ------ Let's look at some anecdotal performance numbers, collected on my dev box using a `hello-world.exe` from a Release x64 Windows build. The key performance stats from a build before this change are: ``` [*] loadBuiltinModule 1 254.29ms [*] checkAllTranslationUnits 1 6.14ms ``` After this change, we see: ``` [*] loadBuiltinModule 1 91.75ms [*] checkAllTranslationUnits 1 11.40ms ``` This change reduces the time spent in `loadBuiltinModule()` by just over 162ms, and increases the time spent in `checkAllTranslationUnits()` by about 5.25ms (the time spent in other compilation steps seems to be unaffected). Because `loadBuiltinModule()` is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain. For this example, the time spent in `checkAllTranslationUnits()` has almost doubled, due to operations that force AST declarations from the core module to be deserialized. Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used). Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded. After compiling the code for the `hello-world` example, approximately 20% of the top-level declarations have been demand-loaded. Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in `loadBuiltinModule()` further. The data above also implies that `loadBuiltinModule()` may include large fixed overheads, which should also be scrutinized further. Relationship to PR #7935 ------------------------ PR #7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance. On a branch with those optimizations in place (but not this change), the corresponding timings are: ``` [*] loadBuiltinModule 1 176.62ms [*] checkAllTranslationUnits 1 6.04ms ``` It remains to be seen how performance fares when this change and the optimizations in PR #7935 are combined. In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem. We thus expect the combination of the two to be better than either alone but, of course, testing will be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-approval
The primary goal of these changes is to reduce the total time spent in the global session's
loadBuiltinModule()
, which gets called as part of global session creation to load the core module, and thus impacts every invocation ofslangc
and every user of the Slang compiler API. The majority of the time is spent simply deserializing the core module's AST and IR and, of those two, the AST takes significantly longer to load than the IR (in the ballpark of 5x the time).This change is focused on the serialization infrastructure but, given the performance situation described above, the focus is first and foremost on deserialization performance for the Slang AST, when using the fossil format. That focus shows through in the changes that have been implemented.
Change serialization framework to use
template
instead ofvirtual
=====================================================================The recently-introduced serialization framework in
slang-serialize.h
was centered around a dynamically-dispatchedISerializerImpl
interface. As a result, every single invocation of aserialize(...)
call ultimately went throughvirtual
function dispatch. While the overhead of thevirtual
calls themselves does not have a major impact on the total deserialization performance, those calls end up serving as a barrier to further optimization.This change changes operations that used to take a
Serializer const&
(which wraps anISerializerImpl*
), to instead declare a template parameter<typename S>
and take anS const&
.The main consequence of the change is that
serialize()
functions for user-defined types will need to be template functions, and thus either be defined in headers (alongside the type that they serialize) or else in the specific source file that handles serialization (as is currently being done for the AST-related types inslang-serialize-ast.cpp
).Note that if we later decide that we want the ability to perform serialization through a dynamically-dispatched interface (e.g., to easily toggle between different serialization back-ends), it will be easier to layer a dynamically-dispatched implementation on top of the statically-dispatched
template
version than the other way around.Generous use of
SLANG_FORCE_INLINE
In order to unlock further optimizations, a bunch of operations were marked with
SLANG_FORCE_INLINE
. It is important to note that forcing inlining like this is a big hammer, and needs to be approached with at least a little caution.The simplest cases are:
trivial wrapper function that just delegate to another function
functions that only have a single call site (but exist to keep abstractions clean)
Externalize Scope for
begin
/end
OperationsThe old
ISerializerImpl
interface had a bunch of paired begin/end operations that define the hierarchical structure of data being read. Most serializer implementations (whether for reading or writing) use these operations to help maintain some kind of internal stack for tracking state in the hierarchy.The overhead of maintaining such a stack with something like a
List<T>
amortizes out over many operations, but even that overhead is unnecessary when the begin/end pairs are already mirroring the call stack of the code invoking serialization.This change modifies the
ScopedSerializerFoo
types so that they each provide a piece of stack-allocated storage to the serializer back-end'sbeginFoo()
andendFoo()
operations. Currently only theFossil::SerialReader
is making use of that facility, but the other implementations of readers and writers in the codebase could be adapted if we ever wanted to.Streamline
Fossil::SerialReader
The most significant performance gains came from changes to the
Fossil::SerialReader
type, aimed at minimizing the cycles spent in the core_readValPtr()
routine. That function used to have a large-ishswitch
statement that implemented superficially very different reading logic depending on the outer container/object being read from.The new logic pushes more work back on the
begin
andend
operations (which get invoked far less frequently than simple scalar/pointer values get read), so that they always set up the state of the reader with direct pointers to the data and layout for the next fossilized value to be read.The remaining work in
_readValPtr()
has been factored into a differnt subroutine -_advanceCursor()
- that takes responsibility for advancing the data pointer, and updating the various other fields. The_advanceCursor()
routine is still messier than is ideal, because it has to deal with the various different kinds of logic required for navigating to the next value.Various other conditionals inside the
SerialReader
implementation were streamlined, mostly by collapsing theState::Type
enumeration down to only represent the cases that are truly semantically distinct.Evaluated: Streamline Layout Rules for Fossil
One potential approach that I implemented but then reverted (after finding it had little to no performance impact) was changing the fossil format to always write things with 4-byte alignment/granularity. That would mean values smaller than 4 bytes would get inflated to a full 4 bytes, and scalar values larger than 4 bytes get written with only 4-byte alignment (requiring unaligned loads to read them).
I found that the only way to take advantage of the simplified layout rules to improve read performance would be to more-or-less eliminate the use of the layout information embedded in the fossil data, which would make it very difficult to validate that the data is correctly structured.
Possible Future Work: Further Type Specialization
As it stands, the biggest overhead remaining on the critical path of
_readValPtr()
is the way the_advanceCursor()
logic needs to take different approaches depending on the type of the surrounding context (advancing through elements of a container is very different than advancing through fields of astruct
, for example).The interesting thing to note is that at the use site within a
serialize()
function, it is usually manifestly obvious which case something is in. If the code usesSLANG_SCOPED_SERIALIZER_ARRAY
it is in a container, while if it usesSLANG_SCOPED_SERIALIZER_STRUCT
it is in a struct. This means that the contextual information is staticaly available, but just isn't exposed in a way that lets the core reading logic take advantage of it.A logical extension of the work here would be to expand on the
Scope
idea added in this change such that most of the serialization operations (handleInt32
,handleString
, etc.) are actually dispatched through the scope, and then have each of theSLANG_SCOPED_SERIALIZER_...
macros instantiate a different scope type (still dependent on the serializer).