Skip to content

fix(agent): Correctly truncate the disk buffer #16697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 24, 2025

Conversation

nagaflokhu
Copy link
Contributor

@nagaflokhu nagaflokhu commented Mar 26, 2025

Summary

The disk buffer grows and never shrinks, as documented here.

Checklist

  • No AI generated code was used in this PR

Related issues

resolves #16314
resolves #16500
resolves #16615
resolves #16696

@telegraf-tiger
Copy link
Contributor

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

@nagaflokhu
Copy link
Contributor Author

This fix appears to work in the tests I have run, but I am not confident that it actually always works.

@nagaflokhu
Copy link
Contributor Author

it might be a good idea to consider different approaches to dealing with the fact that the wal lib does not allow truncating all of its elements. four that come to mind (the first two being very similar to each other):

  1. every time there is a write to the WAL, add a "sentinel" element in addition to the metrics being added. the sentinel is a well-known value that is impossible to obtain from serializing a metric. when reading from the WAL, ignore the sentinels. when truncating all metrics from the WAL, make sure to leave the latest write's sentinel in the WAL.
  2. when trunctaing all metrics from the WAL, add a sentinel first, and then truncate everything except the sentinel. (why shouldn't this be possible?)
  3. change the wal library such as to allow truncating everything from the WAL.
  4. don't use the WAL library. it seems to me that, at an abstract level, what we want for this buffer is an on-disk deque, i.e. the possibility to add to/read from/remove from the front/back.

Comment on lines 149 to 150
offset := 0
for batchSize > 0 && readIndex < endIndex {
for ; batchSize > 0 && readIndex < endIndex; readIndex, offset = readIndex+1, offset+1 {
Copy link
Member

@srebhan srebhan Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see because we are adding this before the rest of the code and you do it at the end... In this case it would be good to keep the readIndex as is and use the following

	for offset := 0; batchSize > 0 && readIndex < endIndex; offset++ {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we were going to attempt to complete this PR? it seems like a pretty major issue with the disk buffer

@nagaflokhu
Copy link
Contributor Author

!signed-cla

@srebhan
Copy link
Member

srebhan commented Apr 14, 2025

@nagaflokhu sorry for this being so slow! We do appreciate your effort on this! For moving forward with the PR, we need to check why the tests are failing now. If the tests were wrong initially, we need an explanation on why they were wrong and a fix to correct the test. Otherwise, if the tests are correct, we need to fix the code to get the tests to pass.
We also need to make sure that the present error is covered by a unit-test, i.e. add a test if there is none yet.

@nagaflokhu nagaflokhu changed the title Attempt to fix disk buffer Fix: disk buffer Apr 15, 2025
@telegraf-tiger telegraf-tiger bot added the fix pr to fix corresponding bug label Apr 15, 2025
@nagaflokhu nagaflokhu changed the title Fix: disk buffer fix(buffer): disk buffer Apr 15, 2025
@nagaflokhu
Copy link
Contributor Author

@srebhan I've now fixed the tests that were failing and added an assertion for regression purposes. I just ran the additional assertion on the code from the latest master branch commit, and it shows that the previous implementation was not working: accepting a transaction and ending it did not change the amount of entries in the buffer. the other assertion in the test I modified was passing because the mask had a length of 2.

@srebhan srebhan assigned srebhan and unassigned mstrandboge Apr 16, 2025
@srebhan srebhan force-pushed the disk-buffer-never-shrinks branch from e20a5d5 to 6e8b3b7 Compare April 16, 2025 14:01
@srebhan srebhan changed the title fix(buffer): disk buffer fix(agent): Correctly truncate the disk buffer Apr 16, 2025
@srebhan
Copy link
Member

srebhan commented Apr 16, 2025

@nagaflokhu I took the liberty to push some commits to your branch including a unit-test and some missing pieces for fixing the issue. What do you think?

@telegraf-tiger
Copy link
Contributor

@nagaflokhu
Copy link
Contributor Author

@nagaflokhu I took the liberty to push some commits to your branch including a unit-test and some missing pieces for fixing the issue. What do you think?

very nice!

fwiw, I wonder if an alternative disk buffer design might not be simpler than having to manage a mask, but it looks to me like we're not going down that route, so let's move forward I guess.

@srebhan
Copy link
Member

srebhan commented Apr 16, 2025

An alternative WAL implementation will not happen anytime soon so let's fix this!

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nagaflokhu for pushing this forward!

@srebhan
Copy link
Member

srebhan commented Apr 16, 2025

@mstrandboge and @skartikey approved this but as I'm involved in the code, please review both! Furthermore, review but do not merge yet until we get some confirmations it fixes the issue(s).

@skartikey
Copy link
Contributor

The proposed fix looks solid and appears to resolve the issue with the buffer not shrinking properly. It correctly maintains the mask, handles the empty buffer case more consistently, and adjusts offsets after truncation as expected.

While reading through the code, I considered checking whether the condition b.isEmpty = b.entries()-removeIdx <= 0 reliably detects when the buffer becomes empty. This would be a key area to investigate if shrinking were to fail. However, I later noticed that the TestDiskBufferTruncate test case already covers this scenario effectively. It verifies the buffer’s behavior across multiple batches and confirms that, when logically empty, no metrics are returned and the remaining masked entry is handled appropriately due to the WAL design.

The fix aligns well with what this test validates. Great work on this!

@srebhan
Copy link
Member

srebhan commented Apr 17, 2025

Thanks @skartikey for the thorough review!

@srebhan srebhan added area/aws AWS plugins including cloudwatch, ecs, kinesis area/agent ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. labels Apr 23, 2025
@srebhan srebhan assigned skartikey and mstrandboge and unassigned srebhan Apr 23, 2025
@skartikey skartikey removed their assignment Apr 23, 2025
Copy link
Member

@mstrandboge mstrandboge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great work!

@mstrandboge mstrandboge merged commit 5329835 into influxdata:master Apr 24, 2025
39 checks passed
@github-actions github-actions bot added this to the v1.34.3 milestone Apr 24, 2025
@nagaflokhu nagaflokhu deleted the disk-buffer-never-shrinks branch May 1, 2025 18:25
srebhan added a commit that referenced this pull request May 5, 2025
Co-authored-by: Sven Rebhan <srebhan@influxdata.com>
(cherry picked from commit 5329835)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent area/aws AWS plugins including cloudwatch, ecs, kinesis fix pr to fix corresponding bug ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
4 participants