-
Notifications
You must be signed in to change notification settings - Fork 37.7k
util: improve FindByte() performance #19690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR was suggested by @hebasto #16981 (comment) (thank you!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance gain looks substantial - didn't verified.
src/streams.h
Outdated
nReadPos++; | ||
size_t n = vchBuf.size() - start; | ||
if (n > nSrcPos - nReadPos) | ||
n = nSrcPos - nReadPos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, join with above line.
How was it measured? |
Any idea why this is so much faster? As far as I know, there is no faster algorithm to look for the first occurrence of a single byte in a memory array than a linear iteration over it. I'd expect |
I ran:
with master (1m20s) and with master + PR (3s).
I think you're correct that there's no faster way than a loop with runtime proportional to the number of bytes to scan, but I assume I just noticed that the repetition count on the test is set to a large number (50000000) and I meant to reduce it for the commit (3 seconds is too long to add to the unit test suite). I'll reduce that number in force push in a minute. This test doesn't really need to be in this PR ( |
ab412ec
to
a31aa32
Compare
Force-push a small fix to the test, so it doesn't take 3 seconds to run. |
That's true, it's possible to optimize that with assembly language (definitely with specific instruction sets). it still surprises me because you'd expect the I/O, to read the data from disk, to dominate greatly in the block importing. Not looking for a character already in memory! It just seems out of proportion. |
src/test/streams_tests.cpp
Outdated
fwrite(&b, 1, 1, file); | ||
rewind(file); | ||
CBufferedFile bf(file, fileSize * 2, fileSize, 0, 0); | ||
for (int rep = 0; rep < 100; ++rep) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand how the performance increase is so significant, but why not a bench if you're worried about burdening the unit tests? I tried to do this but I must be doing something wrong because I can't seem to reproduce the speedup. 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance impact may be very compiler/architecture/stdlib dependent. I'm kind of surprised std::find has optimizations beyond the naive loop implementation in the first place on some platforms, so I certainly wouldn't be surprised if others don't have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gzhao408, thank you, I wasn't aware of bench
. I suspect the iteration count, 100, is far too low and the difference is swamped by the noise. I increased the iteration count to 10m (1e7
) and it showed the expected difference, master: 1,659.30 ns/op, PR: 52.66 ns/op (ratio is about 31).
I just force-pushed (diff) to remove the unit test (which was only for benchmarking, not really testing anything) and cherry-pick Gloria's benchmark. I added one more commit to increase the iteration count.
|
a31aa32
to
8b07e17
Compare
@gmaxwell pointed out to me why this is so much faster: it's not that std::find is amazing, but that the original code (which I wrote in 2012, it seems!) is doing a modulus operation for every character (which is often orders of magnitude slower than the byte comparison or addition/subtraction). Thinking about this a bit more high level: the end goal is just to scan quickly for the 4-byte network magic in a file. If this is relevant for performance (and it seems it may be), it may be better to have a function that does exactly that in CBufferedFile, rather than a search for one byte + memcmp. |
Here's a version that's very close in implementation to master that eliminates the
and that does make a significant difference; I do like @sipa's suggestion to generalize |
I just added a new commit (25413ab, can squash later) to implement @sipa's suggestion to add a method to It didn't work out to use |
Looks like one of the sanitizers is finding an implicit-integer-sign-change problem in the latest update:
|
25413ab
to
8f2e8c2
Compare
Thanks, @adamjonas, force-pushed a fix for that signed-unsigned CI failure, added another unit test, some improvements to |
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsNo conflicts as of last run. |
static void FindByte(benchmark::Bench& bench) | ||
{ | ||
// Setup | ||
FILE* file = fsbridge::fopen("streams_tmp", "w+b"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe use something like memfd_create(2)
for the benchmark, to decrease I/O noise?
Anyone knows if there is a windows equivalent or a higher abstraction in boost::filesystem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmemopen(3)
also looks interesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this can give us that? https://www.boost.org/doc/libs/1_72_0/libs/iostreams/doc/classes/mapped_file.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions, but I don't think this matters, because the file IO (read) occurs only on the very first benchmark loop iteration; after that, the data is in memory and there is no IO at all. Each iteration repositions the stream pointer to zero (bf.SetPos(0)
) and then searches forward for the value 1 (bf.FindByte(1)
), which is 200 bytes away. But after the first iteration, all of these bytes are in memory, and we're just moving the position between 0 and 200.
The reason for the 200, by the way, is that with random data, searching for a random byte value the average distance would be half of 256. But the data in blk.dat
files (which is the only use of this code currently) isn't quite random; zero and 0xff
occur more often (and we never search for those values). So I guessed that 200 is close to a typical distance that this function would move through before finding the requested byte. Maybe I should explain these points in a comment in bench/streams_findbyte.cpp
.
Code review ACK 8f2e8c2, this looks like a better abstraction, and it's an improvement not to do a modulus for every byte. |
8f2e8c2
to
566c2e2
Compare
Force-pushed to rationalize the commits, so I think it's in a good state for merging. I reordered the commit that adds the benchmark test, b576994, to be first, so it's easy for reviewers to checkout that commit and run the new benchmark test without the improvements:
I re-ran those tests just now, and on my laptop, the ns/op without the PR is 1,840, and with the PR it's 55 (an improvement of more than a factor of 33). |
Code review ACK 566c2e2. I think it'd be good to move the refactoring to FindByte in the last commit to a separate commit. |
566c2e2
to
134de90
Compare
Force-pushed to implement latest review suggestion, changes are more cleanly separated among the commits (no code changes overall), thanks @sipa. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK.
08cf6a9
to
dacd331
Compare
@john-moffett - Good idea to bring back the earlier commit (the condition branch instruction theory makes sense); I just restored (force-pushed) it as you suggested. On my x86 (ns/op, lower is better): master: 307 |
ACK dacd331 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach ACK dacd331
I'm now seeing bench performance improvement on my M1 again, from ~150ns/op -> ~85ns/op.
src/streams.h
Outdated
@@ -744,13 +744,22 @@ class CBufferedFile | |||
//! search for a given byte in the stream, and remain positioned on it | |||
void FindByte(uint8_t ch) | |||
{ | |||
const std::byte byte{static_cast<std::byte>(ch)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that there is only a single non-test callsite of FindByte()
, would it make sense to just update the fn signature to take std::byte
directly?
src/streams.h
Outdated
@@ -744,13 +744,22 @@ class CBufferedFile | |||
//! search for a given byte in the stream, and remain positioned on it | |||
void FindByte(uint8_t ch) | |||
{ | |||
const std::byte byte{static_cast<std::byte>(ch)}; | |||
size_t buf_offset = m_read_pos % vchBuf.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: and a bunch more of those
size_t buf_offset = m_read_pos % vchBuf.size(); | |
size_t buf_offset{m_read_pos % vchBuf.size()}; |
@@ -744,13 +744,22 @@ class CBufferedFile | |||
//! search for a given byte in the stream, and remain positioned on it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of the rationale of this implementation should be in the docs so future contributors don't simplify the code again to unintentionally undo the performance gains, e.g. why the modulo operator is kept outside of the while loop seems quite important and non-trivial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but I think it would also be helpful to improve the docs to specify that if ch
is not found, std::ios_base::failure
is thrown (from Fill()
). It's an essential part of the interface imo.
Perhaps worth improving on this behaviour, and have FindByte()
throw its own error, by wrapping the Fill()
command in a try/catch? Orthogonal to this PR, though. (And I also don't like that we're catching a general std::exception
for a FindByte()
failure, but again, orthogonal.)
dacd331
to
52804a5
Compare
Force-pushed for review comments (thanks, @stickies-v), verified benchmark performance is unchanged (ns/op with PR: 34.76, without PR: 302). Summary of force-push changes:
|
acef167
to
0fe832c
Compare
Force-pushed again to fix CI failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 0fe832c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of changing the callsites of FindByte()
, how about adding a uint8_t
overload? I think it keeps the implementation clean, but since it can easily be argued that uint8_t
also is a byte, this keeps the callsites straightforward and reduces the diff.
git diff
diff --git a/src/bench/streams_findbyte.cpp b/src/bench/streams_findbyte.cpp
index 175564fe9..7b2e65da2 100644
--- a/src/bench/streams_findbyte.cpp
+++ b/src/bench/streams_findbyte.cpp
@@ -20,7 +20,7 @@ static void FindByte(benchmark::Bench& bench)
bench.run([&] {
bf.SetPos(0);
- bf.FindByte(std::byte(1));
+ bf.FindByte(1);
});
// Cleanup
diff --git a/src/streams.h b/src/streams.h
index 2558bd830..9280fa013 100644
--- a/src/streams.h
+++ b/src/streams.h
@@ -763,6 +763,8 @@ public:
if (buf_offset >= vchBuf.size()) buf_offset = 0;
}
}
+
+ void FindByte(uint8_t byte) { return FindByte(static_cast<std::byte>(byte)); }
};
#endif // BITCOIN_STREAMS_H
diff --git a/src/test/fuzz/buffered_file.cpp b/src/test/fuzz/buffered_file.cpp
index 2f7ce60c7..67cac8fa4 100644
--- a/src/test/fuzz/buffered_file.cpp
+++ b/src/test/fuzz/buffered_file.cpp
@@ -53,7 +53,7 @@ FUZZ_TARGET(buffered_file)
return;
}
try {
- opt_buffered_file->FindByte(std::byte(fuzzed_data_provider.ConsumeIntegral<uint8_t>()));
+ opt_buffered_file->FindByte(fuzzed_data_provider.ConsumeIntegral<uint8_t>());
} catch (const std::ios_base::failure&) {
}
},
diff --git a/src/test/streams_tests.cpp b/src/test/streams_tests.cpp
index 79bc7b7c0..1db5b61f1 100644
--- a/src/test/streams_tests.cpp
+++ b/src/test/streams_tests.cpp
@@ -462,7 +462,7 @@ BOOST_AUTO_TEST_CASE(streams_buffered_file_rand)
size_t find = currentPos + InsecureRandRange(8);
if (find >= fileSize)
find = fileSize - 1;
- bf.FindByte(std::byte(find));
+ bf.FindByte(find);
// The value at each offset is the offset.
BOOST_CHECK_EQUAL(bf.GetPos(), find);
currentPos = find;
diff --git a/src/validation.cpp b/src/validation.cpp
index a79b81add..b42b39861 100644
--- a/src/validation.cpp
+++ b/src/validation.cpp
@@ -4438,7 +4438,7 @@ void Chainstate::LoadExternalBlockFile(
try {
// locate a header
unsigned char buf[CMessageHeader::MESSAGE_START_SIZE];
- blkdat.FindByte(std::byte(params.MessageStart()[0]));
+ blkdat.FindByte(params.MessageStart()[0]);
nRewind = blkdat.GetPos() + 1;
blkdat >> buf;
if (memcmp(buf, params.MessageStart(), CMessageHeader::MESSAGE_START_SIZE)) {
if (m_read_pos == nSrcPos) { | ||
// No more bytes available; read from the file into the buffer, | ||
// setting nSrcPos to one beyond the end of the new data. | ||
// Throws exception if end-of-file reached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that it's part of the interface, I think this needs to be documented on the function level so devs wanting to use FindByte know how it behaves when the byte isn't found - they shouldn't need to dive into the implementation.
{ | ||
// For best performance, avoid mod operation within the loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
// For best performance, avoid mod operation within the loop. | |
// The modulus operation is much more expensive than byte | |
// comparison and addition, so we keep it out of the loop to | |
// improve performance (see #19690 for discussion). |
ACK 0fe832c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 0fe832c
Silent merge conflict with master:
|
Avoid use of the expensive mod operator (%) when calculating the buffer offset. No functional difference. Co-authored-by: Hennadii Stepanov <32963518+hebasto@users.noreply.github.com>
0fe832c
to
72efc26
Compare
Force pushed rebase to fix hidden merge conflict, thanks @achow101 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-ACK 72efc26
Verified that the only difference is to include <util/fs.h>
instead of <fs.h>
(introduced by 00e9b97)
% git range-diff HEAD~2 0fe832c4a4b2049fdf967bca375468d5ac285563 HEAD
1: 5842d92c8 ! 1: 604df63f6 [bench] add streams findbyte
@@ src/bench/streams_findbyte.cpp (new)
+
+#include <bench/bench.h>
+
-+#include <fs.h>
++#include <util/fs.h>
+#include <streams.h>
+
+static void FindByte(benchmark::Bench& bench)
2: 0fe832c4a = 2: 72efc2643 util: improve streams.h:FindByte() performance
re-ACK 72efc26 |
72efc26 util: improve streams.h:FindByte() performance (Larry Ruane) 604df63 [bench] add streams findbyte (gzhao408) Pull request description: This PR is strictly a performance improvement; there is no functional change. The `CBufferedFile::FindByte()` method searches for the next occurrence of the given byte in the file. Currently, this is done by explicitly inspecting each byte in turn. This PR takes advantage of `std::find()` to do the same more efficiently, improving its CPU runtime by a factor of about 25 in typical use. ACKs for top commit: achow101: re-ACK 72efc26 stickies-v: re-ACK 72efc26 Tree-SHA512: ddf0bff335cc8aa34f911aa4e0558fa77ce35d963d602e4ab1c63090b4a386faf074548daf06ee829c7f2c760d06eed0125cf4c34e981c6129cea1804eb3b719
This PR is strictly a performance improvement; there is no functional change. The
CBufferedFile::FindByte()
method searches for the next occurrence of the given byte in the file. Currently, this is done by explicitly inspecting each byte in turn. This PR takes advantage ofstd::find()
to do the same more efficiently, improving its CPU runtime by a factor of about 25 in typical use.