Skip to content

Conversation

szegedi
Copy link
Contributor

@szegedi szegedi commented Jul 14, 2025

What does this PR do?

Establishes an upper limit on the number of non-CPU timeline events that can be gathered within a single upload period.

Motivation

It's a bad idea for it to be unbounded.

Additional Notes

The upper limit is rather conservative and is heuristically derived from the CPU sampling rate, the upload period, and the number of libuv worker threads (and available parallelism.) E.g. if the CPU sampling rate is 10ms and we upload every 60s, that means at most 6000 CPU samples. We'd like to be able to observe oversubscription on the libuv thread pool. If there are 4 libuv threads, we'd like to be able to observe at least 6000*5 events (events executing on at least 4 threads + 1 waiting for detecting oversubscription). Finally, another 6000 events are allowed for GC events so that gives us uploadPeriod / samplingPeriod * (threads + 2)) as the maximum number of events we wish to sample. With our defaults of 65s upload period, 99Hz sampling rate, and 4 libuv threads we'll collect at most 38613 events (when more are gathered, events get randomly discarded.) Because we already employ sampling I believe this upper bound is not very limiting in ordinary usage – its purpose is to prevent pathological memory usage runoff.

Jira: PROF-12127

Copy link

codecov bot commented Jul 14, 2025

Codecov Report

❌ Patch coverage is 98.03922% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 83.14%. Comparing base (0852a41) to head (422ea7e).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...ackages/dd-trace/src/profiling/profilers/events.js 96.77% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6089      +/-   ##
==========================================
+ Coverage   82.80%   83.14%   +0.33%     
==========================================
  Files         476      477       +1     
  Lines       19661    19692      +31     
==========================================
+ Hits        16281    16372      +91     
+ Misses       3380     3320      -60     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

github-actions bot commented Jul 14, 2025

Overall package size

Self size: 11.12 MB
Deduped: 110.7 MB
No deduping: 111.09 MB

Dependency sizes | name | version | self size | total size | |------|---------|-----------|------------| | @datadog/libdatadog | 0.7.0 | 35.02 MB | 35.02 MB | | @datadog/native-appsec | 10.0.1 | 20.3 MB | 20.3 MB | | @datadog/native-iast-taint-tracking | 4.0.0 | 11.72 MB | 11.73 MB | | @datadog/pprof | 5.9.0 | 9.77 MB | 10.14 MB | | @opentelemetry/core | 1.30.1 | 908.66 kB | 7.16 MB | | protobufjs | 7.5.3 | 2.95 MB | 5.6 MB | | @datadog/wasm-js-rewriter | 4.0.1 | 2.85 MB | 3.58 MB | | @datadog/native-metrics | 3.1.1 | 1.02 MB | 1.43 MB | | @opentelemetry/api | 1.8.0 | 1.21 MB | 1.21 MB | | jsonpath-plus | 10.3.0 | 617.18 kB | 1.08 MB | | import-in-the-middle | 1.14.2 | 122.36 kB | 850.93 kB | | lru-cache | 10.4.3 | 804.3 kB | 804.3 kB | | source-map | 0.7.4 | 226 kB | 226 kB | | opentracing | 0.14.7 | 194.81 kB | 194.81 kB | | pprof-format | 2.1.0 | 111.69 kB | 111.69 kB | | @datadog/sketches-js | 2.1.1 | 109.9 kB | 109.9 kB | | lodash.sortby | 4.7.0 | 75.76 kB | 75.76 kB | | ignore | 7.0.5 | 63.38 kB | 63.38 kB | | istanbul-lib-coverage | 3.2.2 | 34.37 kB | 34.37 kB | | rfdc | 1.4.1 | 27.15 kB | 27.15 kB | | dc-polyfill | 0.1.10 | 26.73 kB | 26.73 kB | | @isaacs/ttlcache | 1.4.1 | 25.2 kB | 25.2 kB | | tlhunter-sorted-set | 0.1.0 | 24.94 kB | 24.94 kB | | shell-quote | 1.8.3 | 23.74 kB | 23.74 kB | | limiter | 1.1.5 | 23.17 kB | 23.17 kB | | retry | 0.13.1 | 18.85 kB | 18.85 kB | | semifies | 1.0.0 | 15.84 kB | 15.84 kB | | jest-docblock | 29.7.0 | 8.99 kB | 12.76 kB | | crypto-randomuuid | 1.0.0 | 11.18 kB | 11.18 kB | | ttl-set | 1.0.0 | 4.61 kB | 9.69 kB | | mutexify | 1.4.0 | 5.71 kB | 8.74 kB | | path-to-regexp | 0.1.12 | 6.6 kB | 6.6 kB | | koalas | 1.0.2 | 6.47 kB | 6.47 kB | | module-details-from-path | 1.0.4 | 3.96 kB | 3.96 kB |

🤖 This report was automatically generated by heaviest-objects-in-the-universe

@szegedi szegedi force-pushed the szegedi/timeline-event-max branch 2 times, most recently from e18818a to 16c0275 Compare July 14, 2025 14:59
@pr-commenter
Copy link

pr-commenter bot commented Jul 14, 2025

Benchmarks

Benchmark execution time: 2025-07-29 11:39:38

Comparing candidate commit 422ea7e in PR branch szegedi/timeline-event-max with baseline commit 0852a41 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 1272 metrics, 51 unstable metrics.

@szegedi szegedi force-pushed the szegedi/timeline-event-max branch 2 times, most recently from 6c0b312 to 672e321 Compare July 14, 2025 15:21
@szegedi szegedi force-pushed the szegedi/timeline-event-max branch from 672e321 to 23f4c49 Compare July 14, 2025 15:38
@szegedi szegedi marked this pull request as ready for review July 14, 2025 15:49
@szegedi szegedi requested a review from a team as a code owner July 14, 2025 15:49
@szegedi szegedi enabled auto-merge (squash) July 14, 2025 15:49
@szegedi szegedi disabled auto-merge July 16, 2025 13:57
Copy link
Collaborator

@watson watson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of the test event code added to packages/dd-trace/src/profiling/profilers/events.js. Is there really no good way to make this test without littering the source code with code only meant for testing?

@szegedi
Copy link
Contributor Author

szegedi commented Jul 21, 2025

I'm not a fan of the test event code added to packages/dd-trace/src/profiling/profilers/events.js. Is there really no good way to make this test without littering the source code with code only meant for testing?

@watson I did actually think about this quite a lot before committing to this implementation (and discussed it with folks in my team, but happy to discuss it with you as well.) I considered writing a test that'll generate events by doing a bunch of legitimate, currently captured events: either filesystem operations on a temporary file, or a bunch of DNS lookups. My main issue with that approach is that they involve IO operations (specifically, IO operations executed on the libuv thread pool) and I was worried that the test could become flaky, especially when run on the traditionally underresourced macOS platform. Having a way to directly inject events gated behind an env var felt like the lesser evil.

Some other alternatives I considered were:

  • make a diagnostic channel the main way of adding events even internally, have everything work off of it, and then it wouldn't be a test-only feature, but it felt a bit convoluted to add a pub-sub channel into control flow where not needed.
  • make it so that the test events have their own type instead of being able to inject any kind of event, but then it would've been even more code, and since the functionality is gated behind an env var it wouldn't be increasing safety.

szegedi and others added 2 commits July 21, 2025 16:32
Co-authored-by: Thomas Watson <thomas.watson@datadoghq.com>
@szegedi szegedi force-pushed the szegedi/timeline-event-max branch from 899a38f to 62d7f75 Compare July 22, 2025 12:31
@szegedi szegedi requested a review from watson July 22, 2025 14:14
@szegedi
Copy link
Contributor Author

szegedi commented Jul 22, 2025

@watson I managed to replace the integration test with a unit test that needs no extra testing support in shipped code.

// current sample too to be the dropped one, with the equal likelihood as
// any other sample.
const replacementIndex = Math.floor(Math.random() * (this.maxSamples + 1)) - 1
if (replacementIndex !== -1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand the algorithm correct, that replacementIndex will only ever be -1 if maxSamples is 0? If so wouldn't it be simpler to make the wrapping else block into an else if (this.maxSamples !== 0) block and then get rid of this if block completely?

Then I think you could also simplify the random algorithm to rnd(max), right?

Copy link
Contributor Author

@szegedi szegedi Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's not the intent. The idea is as follows: I have m samples in the array already which is the maximum (maxSamples in the code.) I get one more so now I have m+1 samples and I need to discard one to get back to m samples. I could just throw away a sample from the array but that favorably treats the new sample – it is not considered to be discarded. It is fairer if the new sample is also a discard candidate so the algorithm doesn't have a bias towards newer samples (I want a random distribution over the entire sampled period.) So I need a random number between [0, m+1) instead of [0, m). The idea is that if the generated random number is in [0, m) I throw away one of samples already in the array at that index. If the number is m though, I throw away the newly arrived sample.

I found it more elegant to implement this logic by subtracting 1 from the generated random value, so now I get a number that is in [-1, m) and instead of comparing to m I can compare to -1 instead. An alternative but equivalent solution would be:

const replacementIndex = Math.floor(Math.random() * (this.maxSamples + 1)) 
      if (replacementIndex !== this.maxSamples) {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: hum, that does not look right:
the last added event will have a probability of being kept of 1 - 1 / (maxSamples+1) whereas it should be (for a true random sampling): maxSamples/n (where n is the total number of events).
What you need here is reservoir sampling (where n is the total number of processed events, including the current one):

const replacementIndex = Math.floor(Math.random() * n)
   if (replacementIndex < this.maxSamples) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, of course; not sure what I was thinking with my bespoke probability calculation. Fixed.

@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Jul 24, 2025

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 422ea7e | Docs | Was this helpful? Give us feedback!

@szegedi szegedi requested a review from nsavoire July 24, 2025 15:50
@szegedi szegedi merged commit 33e1c64 into master Jul 29, 2025
688 checks passed
@szegedi szegedi deleted the szegedi/timeline-event-max branch July 29, 2025 13:09
dd-octo-sts bot pushed a commit that referenced this pull request Jul 30, 2025
Co-authored-by: Thomas Watson <thomas.watson@datadoghq.com>
@dd-octo-sts dd-octo-sts bot mentioned this pull request Jul 30, 2025
BridgeAR pushed a commit that referenced this pull request Aug 6, 2025
Co-authored-by: Thomas Watson <thomas.watson@datadoghq.com>
tlhunter pushed a commit that referenced this pull request Aug 22, 2025
Co-authored-by: Thomas Watson <thomas.watson@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants