-
Notifications
You must be signed in to change notification settings - Fork 37.7k
Add logging and error handling for file syncing #13039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
54299bf
to
5657cc5
Compare
Strong concept ACK! Nice! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concept ACK
src/util.cpp
Outdated
if (fflush(file) != 0) { // harmless if redundantly called | ||
LogPrintf("%s: fflush failed: %d\n", __func__, errno); | ||
return false; | ||
} | ||
#ifdef WIN32 | ||
HANDLE hFile = (HANDLE)_get_osfhandle(_fileno(file)); | ||
FlushFileBuffers(hFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value should probably be checked as well: https://msdn.microsoft.com/en-us/library/windows/desktop/aa364439(v=vs.85).aspx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
#ifdef WIN32 | ||
HANDLE hFile = (HANDLE)_get_osfhandle(_fileno(file)); | ||
FlushFileBuffers(hFile); | ||
#else | ||
#if defined(__linux__) || defined(__NetBSD__) | ||
fdatasync(fileno(file)); | ||
if (fdatasync(fileno(file)) != 0 && errno != EINVAL) { // Ignore EINVAL for filesystems that don't support sync | ||
LogPrintf("%s: fdatasync failed: %d\n", __func__, errno); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't come up with a specific reason other than "gut feeling", but potentially logging to the same disk that just failed to sync feels icky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not necessarily the case, e.g. they might be logging to stdout, or piping the log to something else over the network. Logging the error is extremely important for troubleshooting in any case, even if it might fail.
src/validation.cpp
Outdated
fclose(fileOld); | ||
} | ||
|
||
fileOld = OpenUndoFile(posOld); | ||
if (fileOld) { | ||
if (fFinalize) | ||
TruncateFile(fileOld, vinfoBlockFile[nLastBlockFile].nUndoSize); | ||
FileCommit(fileOld); | ||
flush_failed = FileCommit(fileOld); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, and the logic for these is backwards. They should be:
flush_failed |= !FileCommit(fileOld);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes this should be flush_ok
src/validation.cpp
Outdated
|
||
FILE *fileOld = OpenBlockFile(posOld); | ||
if (fileOld) { | ||
if (fFinalize) | ||
TruncateFile(fileOld, vinfoBlockFile[nLastBlockFile].nSize); | ||
FileCommit(fileOld); | ||
flush_failed = FileCommit(fileOld); | ||
fclose(fileOld); | ||
} | ||
|
||
fileOld = OpenUndoFile(posOld); | ||
if (fileOld) { | ||
if (fFinalize) | ||
TruncateFile(fileOld, vinfoBlockFile[nLastBlockFile].nUndoSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't hurt to check the return value here too, but I guess it's not crucial since any error would just propagate to the sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
@@ -49,7 +49,8 @@ bool SerializeFileDB(const std::string& prefix, const fs::path& path, const Data | |||
|
|||
// Serialize | |||
if (!SerializeDB(fileout, data)) return false; | |||
FileCommit(fileout.Get()); | |||
if (!FileCommit(fileout.Get())) | |||
return error("%s: Failed to flush file %s", __func__, pathTmp.string()); | |||
fileout.fclose(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this attempt to close before returning? I assume we're in bad shape already if we get here, but I'm afraid that some one-off sync failure could lead to multiple copies open and weird fd behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't CAutofile destructor take care of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, nevermind missed that this was a CAutofile.
src/validation.cpp
Outdated
|
||
FILE *fileOld = OpenBlockFile(posOld); | ||
if (fileOld) { | ||
if (fFinalize) | ||
TruncateFile(fileOld, vinfoBlockFile[nLastBlockFile].nSize); | ||
FileCommit(fileOld); | ||
flush_failed = FileCommit(fileOld); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See fclose comment above.
@@ -4760,7 +4765,8 @@ bool DumpMempool(void) | |||
} | |||
|
|||
file << mapDeltas; | |||
FileCommit(file.Get()); | |||
if (!FileCommit(file.Get())) | |||
throw std::runtime_error("FileCommit failed"); | |||
file.fclose(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aaand here :)
src/util.cpp
Outdated
#ifdef WIN32 | ||
HANDLE hFile = (HANDLE)_get_osfhandle(_fileno(file)); | ||
FlushFileBuffers(hFile); | ||
#else | ||
#if defined(__linux__) || defined(__NetBSD__) | ||
fdatasync(fileno(file)); | ||
if (fdatasync(fileno(file)) != 0 && errno != EINVAL) { // Ignore EINVAL for filesystems that don't support sync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, leveldb does not compare the result to EINVAL. Which makes sense for a database, as if writes can't be synchronized at any point, the db can't really ever be trusted.
I think we might want to operate the same way. If a FlushBlockFile() fails due to a disk/filesystem that can't sync, do we really want to continue with updating the index db?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leveldb does compare against EINVAL, we had this issue a long time ago where leveldb didn't work with certain filesystems (e.g. NTFS mounted from Linux). This was eventually fixed by adding the compare against EINVAL.
We certainly don't want to repeat this issue with the block files!
$ git grep fsync
util/env_posix.cc: if (fsync(fd) < 0 && errno != EINVAL) {
``
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. leveldb checks the fsync() in SyncDirIfManifest() against EINVAL, but not the fdatasync() in Sync().
I have no knowledge of the past issue though, so I'll defer to you on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as if writes can't be synchronized at any point, the db can't really ever be trusted.
I don't agree. When shutting down bitcoind and the OS cleanly, f(data)sync
is a no-op. It's only an issue if there are sudden crashes or power outages. In which case the user will just have to accept potential corruption on crashes when choosing to use a filesystem that doesn't support these things. Certainly for the block files, which are huge and more commonly hosted externally.
My intent here is not to break any current usecases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK after squash.
utACK 36424a4f09de38e1ab54381aa7844781688f5c75 (squash recommended). |
Add logging and error handling inside, and outside of FileCommit. Functions such as fsync, fdatasync will return error in case of hardware I/O errors, and ignoring this means it can silently continue through data corruption. (c.f. https://lwn.net/SubscriberLink/752063/12b232ab5039efbe/)
squashed 36424a4 → cf02779 |
36424a4
to
cf02779
Compare
cf02779 Add logging and error handling for file syncing (Wladimir J. van der Laan) Pull request description: Add logging and error handling inside, and outside of FileCommit. Functions such as fsync, fdatasync will return error in case of hardware I/O errors, and ignoring this means it can silently continue through data corruption. (c.f. https://lwn.net/SubscriberLink/752063/12b232ab5039efbe/) EINVAL is handled specially to avoid crashing out on (network, fuse) filesystems that don't handle `f[data]sync`. I checked that the syncing inside leveldb is already generating an I/O error as appropriate. Tree-SHA512: 64cc9bbedca3ecc97ff4bac0a7b7ac6526a7ed763c66f6786d03ca4f2e9e366e42b152cb908299c060448d98ca39ff03395280bffaca51d592e728aa2516f5dd
cf02779 Add logging and error handling for file syncing (Wladimir J. van der Laan) Pull request description: Add logging and error handling inside, and outside of FileCommit. Functions such as fsync, fdatasync will return error in case of hardware I/O errors, and ignoring this means it can silently continue through data corruption. (c.f. https://lwn.net/SubscriberLink/752063/12b232ab5039efbe/) EINVAL is handled specially to avoid crashing out on (network, fuse) filesystems that don't handle `f[data]sync`. I checked that the syncing inside leveldb is already generating an I/O error as appropriate. Tree-SHA512: 64cc9bbedca3ecc97ff4bac0a7b7ac6526a7ed763c66f6786d03ca4f2e9e366e42b152cb908299c060448d98ca39ff03395280bffaca51d592e728aa2516f5dd
cf02779 Add logging and error handling for file syncing (Wladimir J. van der Laan) Pull request description: Add logging and error handling inside, and outside of FileCommit. Functions such as fsync, fdatasync will return error in case of hardware I/O errors, and ignoring this means it can silently continue through data corruption. (c.f. https://lwn.net/SubscriberLink/752063/12b232ab5039efbe/) EINVAL is handled specially to avoid crashing out on (network, fuse) filesystems that don't handle `f[data]sync`. I checked that the syncing inside leveldb is already generating an I/O error as appropriate. Tree-SHA512: 64cc9bbedca3ecc97ff4bac0a7b7ac6526a7ed763c66f6786d03ca4f2e9e366e42b152cb908299c060448d98ca39ff03395280bffaca51d592e728aa2516f5dd
cf02779 Add logging and error handling for file syncing (Wladimir J. van der Laan) Pull request description: Add logging and error handling inside, and outside of FileCommit. Functions such as fsync, fdatasync will return error in case of hardware I/O errors, and ignoring this means it can silently continue through data corruption. (c.f. https://lwn.net/SubscriberLink/752063/12b232ab5039efbe/) EINVAL is handled specially to avoid crashing out on (network, fuse) filesystems that don't handle `f[data]sync`. I checked that the syncing inside leveldb is already generating an I/O error as appropriate. Tree-SHA512: 64cc9bbedca3ecc97ff4bac0a7b7ac6526a7ed763c66f6786d03ca4f2e9e366e42b152cb908299c060448d98ca39ff03395280bffaca51d592e728aa2516f5dd
…+ Add file syncing logging and error handling d593e7e Migrate remaining FLATDATA serialization, we are natively supporting it now. (furszy) 051970d addrdb: Remove temporary files created in SerializeFileDB. Fixes non-determinism in unit tests. (practicalswift) 4be426f Add logging and error handling for file syncing (Wladimir J. van der Laan) 8662fb3 Remove unused double_safe_addition & double_safe_multiplication functions. (furszy) a7c3885 Add native support for serializing char arrays without FLATDATA (Pieter Wuille) 459ecb9 Deduplicate addrdb.cpp and use CHashWriter/Verifier (Pieter Wuille) a5d2f8a Fix copypasted comment. (Pavel Janík) Pull request description: Digging down the peers and ban databases commits path in upstream found some good stuff, back ported the following PRs: * bitcoin#9216. * bitcoin#10248. * bitcoin#12740. * bitcoin#13039. * bitcoin#16212. ACKs for top commit: random-zebra: ACK d593e7e Fuzzbawls: ACK d593e7e Tree-SHA512: 8d567c68a2c36f43c749d5faa4b7f8a299dbfdda133495df434ac642c1a2ac96dc2259123ad85decc9039c62c46602665f6be4aadf6d5bf729344c51692ec60e
cf02779 Add logging and error handling for file syncing (Wladimir J. van der Laan) Pull request description: Add logging and error handling inside, and outside of FileCommit. Functions such as fsync, fdatasync will return error in case of hardware I/O errors, and ignoring this means it can silently continue through data corruption. (c.f. https://lwn.net/SubscriberLink/752063/12b232ab5039efbe/) EINVAL is handled specially to avoid crashing out on (network, fuse) filesystems that don't handle `f[data]sync`. I checked that the syncing inside leveldb is already generating an I/O error as appropriate. Tree-SHA512: 64cc9bbedca3ecc97ff4bac0a7b7ac6526a7ed763c66f6786d03ca4f2e9e366e42b152cb908299c060448d98ca39ff03395280bffaca51d592e728aa2516f5dd
Add logging and error handling inside, and outside of FileCommit.
Functions such as fsync, fdatasync will return error in case of hardware I/O errors, and ignoring this means it can silently continue through data corruption.
(c.f. https://lwn.net/SubscriberLink/752063/12b232ab5039efbe/)
EINVAL is handled specially to avoid crashing out on (network, fuse) filesystems that don't handle
f[data]sync
.I checked that the syncing inside leveldb is already generating an I/O error as appropriate.