-
Notifications
You must be signed in to change notification settings - Fork 829
Description
Description
Potentially excessive memory allocations made by Serilog under high load.
Reproduction
Can provide samples as necessary, Serilog has been in use in a large environment for over eight years but I am not seeing anything that would necessarily contribute to this issue specifically other than possibly volume of use but that is why I want to start the discussion.
Relevant package, tooling and runtime versions
Investigation started after a few OOM kills in K8 but the behavior can be reproduced outside of K8 (Windows dev box under the right conditions).
Serilog 8.0.0
Serilog.Sinks.Seq 8.0.0
Serilog.Enrichers.Environment 3.0.1
Serilog.AspNetCore 8.0.1
Serilog.Extensions.Hosting 8.0.0
Serilog.Extensions.Logging 8.0.0
Additional context
I have attached some interesting dotMemory runs of local development environment (reproducible) that resulted in 233,000 messages making it to Seq, though interestingly the expected number was closer to 1M w/verification of no dropped messages from the WAF and no identifiable throttling from Seq, though that is not the reason for this ticket.
Aa mentioned above we make heavy use of Serilog throughout our architecture. Additonally use use Seq and a few other sinks, though no other sicks are in scope for this discussion.
What was witnessed during an OOM kill event of a production pod was a substantial number of allocations for Serilog and the Seq sink, largely tied to either the serialization, the collections for properties attached to those messages and the stringbuilders used to create the payloads being emitted.
I have included the visuals below. In the reports I identified two enrichers we have implemented, one involving the HttpRequestContext and obtaining the display URL from it due to differing behavior we required from the provided enricher for web requests and another related to an enrichers that contains environmental details (pod name, etc) that while all of the data is static, there was a missed opportunity to pre-create the properties (CreateProperty) calls one time only which we will be making and I expect that to drop off the list.
All of the memory allocations, which puts significant pressure on the GC and in more than one case led to an OOM kill event are driven by simple log messages in a foreach loop. Once that contains a simple message (no template, just a string) and another with a property within the template (@) that is serialized. The DTO is relatively small itself but contains about 20 properties (strings, ints, guids), none of which contain much data.
Design decision aside, this came to light due to a file that was uploaded externally and had quite a few rows in it (500K to be exact), all but one of which were just empty lines, which meant the DTO's in question were likewise empty/default values. All told, there would have been 1M log messages (a lot, yes) during this period of time (~90s). Ultimately boiling down to a foreach loop that logs one string message (no template) per item in the file (i.e. "Add a row to the list") and another that has the DTO in its template, which is the smallish DTO mentioned above which was all empty/default values for all but one of the log messages.
That resulted in 29.77GB of memory allocations over 90 seconds, of which only 1.76GB is not attributable to Serilog or the Seq sink.
Do not get me wrong, these log messages should not have been where they were, they are abusive under the correct conditions, but this scenario gave me something to pull on and I think I found other cases, both with memory usage surges, which will often by GC'd over time. I am also starting to look into message template caching as I seem to have found a correlation between invalid message templates (interpolated strings that someone reused in an exception as well as in a log message) and what appears to be a slow "memory leak" over time whereby the cache seems to be growing beyond its expected limit, though I am still working on verifying this.
Is the time of memory usage (allocations) for what is described (1M log messages, one with just a string, other with a structured variable) + the enrichers wildly unexpected or within the realm of reason. I have included the output of one of the messages (redacted) so that you have an idea of the contents of the messages themselves with the enrichers.