Skip to content

Conversation

laanwj
Copy link
Member

@laanwj laanwj commented Oct 30, 2015

Use a simple Win32WritableFile, equivalent to PosixWritableFile in posix_env.cc

The goal is to solve leveldb corruption issues on windows - this issue has been reported many times, see #5865, #6727, #5610, ...

In my testing with this change I was unable to cause the database to be corrupted when crashing a windows laptop in various ways using notmyfault. Without the change it happened the first time I tried.

Use a simple Win32WritableFile, equivalent to PosixWritableFile in
posix_env.cc
@sipa
Copy link
Member

sipa commented Oct 30, 2015

Please submit to https://github.com/bitcoin/leveldb first?

@laanwj
Copy link
Member Author

laanwj commented Oct 30, 2015

Thought about that, but I think it gets more exposure and testing here.
(obviously it should still go there when accepted...)

@sipa
Copy link
Member

sipa commented Oct 30, 2015

@laanwj Sure!

@jgarzik
Copy link
Contributor

jgarzik commented Oct 30, 2015

ut ACK

1 similar comment
@dcousens
Copy link
Contributor

ut ACK

@sipa
Copy link
Member

sipa commented Oct 31, 2015 via email

@gmaxwell
Copy link
Contributor

@sipa Yes, and if its much slower it's reason to figure out whats strictly required for the syncs. And if it's not, we could call it done. Reindex time right now is ruined by signature validation, if someone benchmarks that they must be sure to bypass.

@laanwj
Copy link
Member Author

laanwj commented Oct 31, 2015

This is the same as how it works on POSIX already, directly converted to WIN32 API. By all means, benchmark, but IMO we should apply this nevertheless (if it has no new bugs). We should first get rid of stability issues and only then worry about optimization. This will likely save many people from having to do a reindex at all.

@maflcko
Copy link
Member

maflcko commented Oct 31, 2015

Concept ACK

Flush() is supposed to clear application-side buffers (like fflush). As
this writable file implementation uses the OS function directly, there
are no buffers to flush.
@laanwj
Copy link
Member Author

laanwj commented Oct 31, 2015

Last commit is optional and potentially risky. From what I understand, leveldb's WritableFiles are supposed to do their own caching, and flush() is supposed to flush this cache to the OS (its implementation is fflush(FILE*) in POSIX). As this implementation uses OS calls directly, there is no need for that. With that reasoning, Flush() can be empty.

... it's of course also possible to use libc FILE* fwrite fflush on Windows instead of using the WIN32 API directly, then use this hack to sync, this will provide local buffering, but I was afraid that extra level of buffering may introduce syncing issues (see my post below for an implementation of this)

Then again these are all performance concerns. As said above it may be wrong to worry about that here, I'm mostly concerned with this being correct.

@Diapolo
Copy link

Diapolo commented Oct 31, 2015

Isn't LevelDB dead at all? Their Github repo seemed to be not actively developed on the last time I checked.

@laanwj
Copy link
Member Author

laanwj commented Oct 31, 2015

@Diapolo That is a good observation. That's why I'm doing an attempt at maintaining it.
Can you help testing?

@Diapolo
Copy link

Diapolo commented Oct 31, 2015

@laanwj Yes, will first do a -reindex on testnet to be able to compare before and after... after that I'm going to kill the testnet instance and see if it's corrupting.

@laanwj
Copy link
Member Author

laanwj commented Oct 31, 2015

Thanks!

@laanwj
Copy link
Member Author

laanwj commented Oct 31, 2015

Here is an alternative implementation of Win32WritableFile that uses application-buffered libc primitives (fopen, fwrite, ...): https://github.com/laanwj/bitcoin/tree/2015_10_leveldb_win_nomap2 . Barely tested, and I'd not necessarily suggest using it instead of this, but if someone is going to benchmark it should be included (as it reduces the number of system calls in case of lots of small writes, which we don't do, but leveldb may internally...).

@laanwj
Copy link
Member Author

laanwj commented Nov 1, 2015

@jonasschnelli it would be useful to have executables here, mind pointing your nightly builder at this?

@jonasschnelli
Copy link
Contributor

@NicolasDorier
Copy link
Contributor

I am so glad about it you can't imagine the time I lost because of this bug. This is the only reason I don't have a full node on my laptop. Trying it right now.

return ((x + y - 1) / y) * y;
std::wstring path;
ToWidePath(fname, path);
DWORD Flag = PathFileExistsW(path.c_str()) ? OPEN_EXISTING : CREATE_ALWAYS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use OPEN_ALWAYS (which should open if exists, and create if not, according to https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858%28v=vs.85%29.aspx)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this code is just moved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to have f=fopen(filename, "wb") semantics. It's possible that this could be improved, but indeed I just copied the code from the previous WritableFile implementation.

@sipa
Copy link
Member

sipa commented Nov 1, 2015

@laanwj We don't do small writes anyway, so I doubt an application-level cache would help.

@sipa
Copy link
Member

sipa commented Nov 1, 2015

Concept ACK in any case.

@laanwj
Copy link
Member Author

laanwj commented Nov 2, 2015

@laanwj We don't do small writes anyway, so I doubt an application-level cache would help.

We don't, but maybe leveldb does internally? (WritableFiles are used for all of leveldb's file creation and writing, also when building/rebuilding tables, not just to write out database batches. Not sure how careful it is to not, say, call Append multiple times to write different parts of data structures)

@jonasschnelli
Copy link
Contributor

I just have started a fresh node (mainnet / this PR) on a win8.1pro with enabled mcafee antivirus (SSD / 2GB RAM / 1.4 GHZ Core i5) ... will report soon.

@jonasschnelli
Copy link
Contributor

Have synced up to 318301,.. twice restarted bitcoin-qt (just to see if stop / start / checkblocks works). No problems with the database so far.

My node is still syncing (very slow, serval blocks per minutes).
But a strange problem appeared.
Qt console window says I'm no block 318301, ... looking in the debug.log i can see that im 318613...
Peers windows is updating... but somehow with missing data.

bildschirmfoto-2015-11-03-um-10 19 21

Peers:
bildschirmfoto 2015-11-03 um 10 18 41

Taskmanager:
bildschirmfoto 2015-11-03 um 10 23 31

@laanwj
Copy link
Member Author

laanwj commented Nov 3, 2015

@jonasschnelli that's a known issue and unrelated to these changes, see #5664

@NicolasDorier
Copy link
Contributor

Pro tip for corruption problem : Attach a disk, index the blockchain data on it, unplug savagely.

I had corruption everytimes my cat played with the ethernet cable between my computer and my network disk.

I did not managed to reproduce the problem so far by just killing the process, I'll try once I find a spare disk somewhere.

@sipa
Copy link
Member

sipa commented Nov 4, 2015 via email

@NicolasDorier
Copy link
Contributor

Mapped Drive, then just pointing to SMB share. I'm not at my office, so I'll try with external drive unplugged savagely instead of real network disk as soon as I can.

@sipa
Copy link
Member

sipa commented Nov 4, 2015

Well it's good to know that it seems to work even with a network
filesystem. But network filesystems usually offer much weaker
synchronization guarantees, so often are inappropriate for databases with
strong consistency requirements.

@NicolasDorier
Copy link
Contributor

Yes I agree, this is why I switched to a normal VM later on Microsoft Azure. But since I've moved to VM they are sometimes rebooted automatically for maintainance purposes, and it got corrupted several time.

The good thing about mapped disk is that thanks to weaker synchronization guarantee, it makes easier to reproduce data corruption problems ! :)

@laanwj
Copy link
Member Author

laanwj commented Nov 4, 2015

The issue that this change tries to improve is extremely easy to reproduce already: it happens nearly every time bitcoind or bitcoin-qt running on Windows crashes, or the Windows crashes, or the power is pulled. This happens for local filesystems on drives connected through SATA.

The PR does not purport to fix issues with external harddisks, network filesystems, corrupting cables, felines, and other additional sources of complications. Being robust to those would be great, if realistically possible, but is not the immediate goal here.

@jonasschnelli
Copy link
Contributor

I have powered down (unclean sudden shutdown) my Win8.1 VM and restarted Windows and Bitcoin-Qt. Bitcoin-Qt started without issues and is ~ on the same height (353900).
I'll wait now a couple of minutes and will power down (unplug) the VM host system (MacMini) and see, what happens then.

Would it be worth to do the same procedure with the current master WITHOUT this PR to compare it?

@laanwj
Copy link
Member Author

laanwj commented Nov 4, 2015

Would it be worth to do the same procedure with the current master WITHOUT this PR to compare it?

Sure. If you have a VM environment where you can take a snapshot, that's helpful, you can return to that after messing up the db.

@jonasschnelli
Copy link
Contributor

I made a snapshot before the first sudden shutdown. But just powered down the host by unplugging the power cable... Best chain could be activated and the node continues syncing. Will do some more shutdowns and see if nothing happens,... then switch to current master and try again.

@jonasschnelli
Copy link
Contributor

Now, after a "power off" (sudden shutdown), the db is corrupted (still using this PR).

screen shot 2015-11-04 at 14 16 41

"Rebuild the block databases" ends with "error opening database":
bildschirmfoto 2015-11-04 um 14 20 15

Log:
bildschirmfoto 2015-11-04 um 14 21 34

@laanwj
Copy link
Member Author

laanwj commented Nov 4, 2015

Re: "bad undo data", see #6923 - it's a potential issue, but not something that could be solved by this (as it isn't a leveldb problem)

@jonasschnelli
Copy link
Contributor

I now did restore serval times back to my snapshot and did some sudden power downs. ~50% it will end up with a corrupt database (100% with "bad undo data").

Will now try to run my snapshot with current master without this PR.

@laanwj
Copy link
Member Author

laanwj commented Nov 4, 2015

That's strange - I only managed to get the undo data error once in all my tries, whereas (without this PR) it produced consistent leveldb errors.

@gmaxwell
Copy link
Contributor

gmaxwell commented Nov 4, 2015

Difference in VM flushing behavior, I guess --- it may just be that the VM buffers and reorders writes and ignores fsync.

Although... we appear to be missing a FileCommit in UndoWriteToDisk-- I think we need to have synced the blocks and undo before calling the insert.

It would probably be better for performance if it wrote the block then undo, then did the syncs on both however.

Edit: oh nevermind, FlushBlockFile hits the undo file too. :-/

@jonasschnelli
Copy link
Contributor

With the current master (without this PR), after every sudden shut down, i get "Corruptions: error in middle of record".

bildschirmfoto 2015-11-04 um 16 48 15

This PR looks after an improvement but getting #6923 for roughly 50% of power-off shutdowns.

@laanwj
Copy link
Member Author

laanwj commented Nov 4, 2015

If it improves things, that is good.

Filed this for the leveldb repository: bitcoin-core/leveldb-old#9

@laanwj
Copy link
Member Author

laanwj commented Nov 4, 2015

Closing the pull here, as it has been moved to the leveldb repository. It should come back to the bitcoin repository in a subtree update.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants