Skip to content

Conversation

WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented Jul 16, 2025

Manual backport of #26830. cherry-pick conflict due to altered timequery_test.cc from previous version of redpanda (Boost.Test instead of gtest in this version)

Fixes #26838.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

Bug Fixes

  • Fixes a bug in timequeries performed over local storage which could lead to inconsistent or undefined results.

…uery_config& cfg)`

This was buggy for non-monotonic `log`s.

From KIP-33 [1]:

"When searching by timestamp, broker will start from the earliest log segment
and check the last time index entry. If the timestamp of the last time index
entry is greater than the target timestamp, the broker will do binary search
on that time index to find the closest index entry and scan the log from there.
Otherwise it will move on to the next log segment."

Also,

"The time index is not monotonically increasing for the segments of a
partition. Instead, it is only monotonically increasing within each
individual time index file. i.e. It is possible that the time index file for
a later log segment contains smaller timestamp than some timestamp in the time
index file of an earlier segment."

Clearly, it is anticipated that `segment`s may not have monotonically increasing
reference timestamps in their time index (analogous to `redpanda`'s `segment_index`).
Ergo, the use of a binary search via `_seg.slower_bound(cfg.time)` is clearly
incorrect and can lead to some buggy behavior with timequeries.

This can be clearly demonstrated by a simple reproducer in which a Kafka client
produces to a `partition`, issues a timequery, waits for a `segment` roll, and
then issues another timequery without producing after the roll. Because
the `max_timestamp` of the `segment_index` defaults to `0`, the invalidation
of monotonicity leads to a different result for the timequery due to the use
of binary search.

This is particularly dangerous for a `DeleteRecords` request issued through
`rpk topic trim-prefix foo -o "@{timestamp}"`; using the above scenario, the
following behavior was observed:

```
$ rpk topic trim-prefix a -o "@1744708378681"
TOPIC  PARTITION  PRIOR-START-OFFSET  NEW-START-OFFSET
a      0          0                   3
a      1          0                   0
a      2          0                   0

? Confirm deletion of all data before the new start offsets? (Y/n)
```

(Wait for `segment` roll and repeat the same command)

```
$ rpk topic trim-prefix a -o "@1744708378681"
TOPIC  PARTITION  PRIOR-START-OFFSET  NEW-START-OFFSET
a      1          0                   9
a      2          0                   9
a      0          0                   9

? Confirm deletion of all data before the new start offsets? (Y/n)
```

Fix the issue by using a linear search in place of a binary search
in the `lock_manager::range_lock()` during a timequery.

[1]: https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index

(cherry picked from commit 171c2a9)
@WillemKauf WillemKauf force-pushed the backport-pr-26830-v25.1.x-630 branch from f074043 to d8af993 Compare July 16, 2025 04:24
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#69071
test_class test_method test_arguments test_kind job_url test_status passed reason
distributed_kv_stm_tests_rpunit distributed_kv_stm_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/69071#0198117b-741f-4911-b399-ddcd36d62ead FLAKY 1/2
gtest_cloud_storage_rpfixture gtest_cloud_storage_rpfixture unit https://buildkite.com/redpanda/redpanda/builds/69071#0198117b-741f-4a9a-a202-4d3770cdb4dc FLAKY 1/2
PartitionReassignmentsTest test_add_partitions_with_inprogress_reassignments ducktape https://buildkite.com/redpanda/redpanda/builds/69071#019811b8-a425-4100-b8d9-a529613e8f71 FLAKY 20/21 upstream reliability is '89.1566265060241'. current run reliability is '95.23809523809523'. drift is -6.08147 and the allowed drift is set to 50. The test should PASS
TopicDeleteCloudStorageTest drop_lifecycle_marker_test {"cloud_storage_type": 2} ducktape https://buildkite.com/redpanda/redpanda/builds/69071#019811b8-a41f-4f2c-ae4a-345f13e24905 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS

@WillemKauf WillemKauf requested a review from rockwotj July 16, 2025 12:56
@WillemKauf WillemKauf merged commit f6f045c into redpanda-data:v25.1.x Jul 16, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants