Split storage ranges to parallelize execution #7733

damian-orzechowski · 2024-11-06T14:09:20Z

Changes

At the end of state ranges sync, when only large storage trie is left to be synced, snap batches will be processed sequentially. This change splits the range left to be requested for a given storage trie, if the processed data is less than 50% of requested range (maybe this shouldn't be 50%). The aim is to speed up processing of large storage tries, especially towards the end of snap ranges phase, when progress gets "stuck" at 100%. This is to aid the same problem as #7688, but with a slightly different approach.

Tested on OP-Mainnet - snap sync time:

Master: ~1:42 minutes (1h "stuck" at 100%)
After change: ~1:11 minutes (28min stuck at 93%)

Chain	Master	PR
op-mainnet 1	102 min	71 min
op-mainnet 2	138 min	90 min
op-mainnet 3	180 min	60 min
mainnet 1	71 min	66 min
mainnet 2	69 min	65 min

Types of changes

What types of changes does your code introduce?

Bugfix (a non-breaking change that fixes an issue)
New feature (a non-breaking change that adds functionality)
Breaking change (a change that causes existing functionality not to work as expected)
Optimization
Refactoring
Documentation update
Build-related changes
Other: Description

Testing

Requires testing

Yes
No

If yes, did you write tests?

Yes
No

Notes on testing

Optional. Remove if not applicable.

Documentation

Requires documentation update

Yes
No

If yes, link the PR to the docs update or the issue with the details labeled docs. Remove if not applicable.

Requires explanation in Release Notes

Yes
No

…ange.

LukaszRozmej

Please wait for @asdacap review too

src/Nethermind/Nethermind.Synchronization/SnapSync/ProgressTracker.cs

LukaszRozmej · 2024-11-08T19:26:59Z

src/Nethermind/Nethermind.Synchronization/SnapSync/ProgressTracker.cs

@@ -29,6 +29,7 @@ public class ProgressTracker : IDisposable
        public const int HIGH_STORAGE_QUEUE_SIZE = STORAGE_BATCH_SIZE * 100;
        private const int CODES_BATCH_SIZE = 1_000;
        public const int HIGH_CODES_QUEUE_SIZE = CODES_BATCH_SIZE * 5;
+        private const uint StorageRangeSplitFactor = 2;


have you experimented with more radical splitting factors? 4? 8? 16? 32?

src/Nethermind/Nethermind.Synchronization/SnapSync/ProgressTracker.cs

asdacap · 2024-11-08T23:40:08Z

Can add unit test?

damian-orzechowski · 2024-11-11T13:46:55Z

Can add unit test?

Added

Scooletz

Provided some comments. This seems similar to Paprika's ref counting.

I don't know how often the lock concurrent dictionary is used, but if it is not contended we could replace the concurrent with a lock (potentially, RWLSlim) and use some global atomicity. It might be the case that due to construction of the snap sync, we know that once the account is processed, it will never appear again. If this is the case it can be leveraged and used to know when to remove a thing from the dictionary (current condition).

In Paprika's case, read-only transactions use just a lock and a simple dictionary. Nothing fancy, but due to the fact that each tx is ref-counted, cannot be resurrected, once it's 0, it takes lock on the dictionary and removes itself.

Scooletz · 2024-12-05T08:50:18Z

src/Nethermind/Nethermind.Synchronization/SnapSync/ProgressTracker.cs

+            {
+                lock (_lock)
+                {
+                    Counter += value;


The counter seems to be not correlated at all with the ExecuteSafe method. I'd replace lock(_lock) with Interlocked.Add for both, Increment and Decrement methods. Unless in Decrement you want to be sure that no ExecuteSafe is happening at the moment

Scooletz · 2024-12-05T08:52:39Z

src/Nethermind/Nethermind.Synchronization/SnapSync/ProgressTracker.cs

@@ -57,6 +59,8 @@ public class ProgressTracker : IDisposable
        private ConcurrentQueue<ValueHash256> CodesToRetrieve { get; set; } = new();
        private ConcurrentQueue<AccountWithStorageStartingHash> AccountsToRefresh { get; set; } = new();

+        private readonly ConcurrentDictionary<ValueHash256, IStorageRangeLock> _storageRangeLocks = new();
+        private readonly StorageRangeLockPassThrough _emptyStorageLock = new();


So this is the one that allows to pass on all the actions and is treated as the default. Some comment would be nice I believe.

Scooletz · 2024-12-05T09:01:58Z

src/Nethermind/Nethermind.Synchronization/SnapSync/ProgressTracker.cs

+            if (!_storageRangeLocks.TryGetValue(accountPath, out IStorageRangeLock lockInfo)) return;
+            lockInfo.Increment(number);


What @LukaszRozmej said above. What if you call IncrementStorageRangeLockIfExists you get no lockInfo and immediately it's followed by IncrementStorageRangeLock that does the update? Does it mean that something is broken here?

Scooletz · 2024-12-05T09:02:57Z

src/Nethermind/Nethermind.Synchronization/SnapSync/ProgressTracker.cs

+        public interface IStorageRangeLock
+        {
+            IStorageRangeLock Increment(uint value = 1);
+            void Decrement(ValueHash256 key, bool removeFromOwner);


Why key is needed here? Should it not be captured in the ctor of the lock, so that we just Decrement(bool removeFromOwner). You get the lock by key so the key should not be needed here I think.

Scooletz · 2024-12-05T09:06:09Z

src/Nethermind/Nethermind.Synchronization/SnapSync/ProgressTracker.cs

+
+        public interface IStorageRangeLock
+        {
+            IStorageRangeLock Increment(uint value = 1);


Why increment returns the lock object? My feeling is that it's to satisfy the ConcurrentDictionary. I'd keep it simple and make it void.

Scooletz · 2024-12-05T09:07:50Z

src/Nethermind/Nethermind.Synchronization/SnapSync/SnapProvider.cs

                }

                return result;
            }
            finally
            {
+                rangeLock?.Decrement(pathWithAccount.Path, canRemoveFurtherLock);


Why do we need to take care of whether a lock should be removed or not? Should it not be the case that when the counter hits 0, we always remove?

asdacap · 2024-12-05T10:01:39Z

src/Nethermind/Nethermind.Synchronization/SnapSync/ProgressTracker.cs

+            void ExecuteSafe(Action action);
+        }
+
+        public class StorageRangeLock(uint counter, ConcurrentDictionary<ValueHash256, IStorageRangeLock> owner)


I'd just use a sharded lock to avoid such complications.
Also, you can just use _pivot.UpdatedStorages.Add(pathWithAccount.Path); to the splitted storage so that healing will fix it. But it limits the number of storage that can be splitted though.

LukaszRozmej · 2024-12-05T20:17:48Z

src/Nethermind/Nethermind.Synchronization/SnapSync/SnapProviderHelper.cs

+            // This will work if all StorageRange requests share the same AccountWithPath object which seems to be the case.
+            // If this is not true, StorageRange request should be extended with a lock object.
+            // That lock object should be shared between all other StorageRange requests for same account.
+            lock(account.Account)


Main change here!

The main point of having this complexity around locking was to lock only when necessary - that is when split is done and there us a chance of actually having parallel processing of storage ranges for same account. If the overhead is negligible, I'm all for it - it's the simplest.

On having a conversation with @Scooletz yesterday, there could be a way to optimize it easily, if there is significant overhead. Stitching only happens when there are boundary proof nodes - so can lock only if that happens.

lock is super cheap compared to I/O so locking even when not necessary shouldn't degrade performance. And you still paid for ConcurrentDictionary in your solution, so it might be comparable

Should be fine. That said, this code path is run quite a lot, might wannna double check if the lock is significant.

If there is no contention then lock is very fast, if there is contention, well we still have to lock. Keep in mind in the previous solution we locked on potential contention while also accessed ConcurrentDictionary at least one time even with no contention. https://stackoverflow.com/a/72479230

damian-orzechowski · 2024-12-06T08:31:52Z

This test run shows this may bring missing nodes on storage - cannot merge yet
https://github.com/NethermindEth/nethermind/actions/runs/12183618666
Looks like this has been caused by and error in the previous locking scheme. After changes from @LukaszRozmej it looks fine and state passes verification after sync:
https://github.com/NethermindEth/nethermind/actions/runs/12238172700

Scooletz · 2024-12-10T14:27:40Z

src/Nethermind/Nethermind.Synchronization/SnapSync/SnapProviderHelper.cs

+            // This will work if all StorageRange requests share the same AccountWithPath object which seems to be the case.
+            // If this is not true, StorageRange request should be extended with a lock object.
+            // That lock object should be shared between all other StorageRange requests for same account.
+            lock (account.Account)


I don't like the account based locking and using the object that might be elsewhere locked or used for different purposes. Locking should have specific scoping and should be reasoned about locally.

Another take is that this lock is needed, correct me @damian-orzechowski if I'm wrong, only if we know that there's no pending requests. We could extract the check from StitchBoundaries and assert it explicitly before the lock.

we could create artificial object and pass it around, but I don't really see the point other than increasing the memory pressure and complicating the code. We added comments! :)

Done is probably better than perfect here.
If we want to refactor is to have parent request directly available when child request is being created. So the connection is obvious and not hidden in multiple method calls.

Done some refactoring here #7930. The problem is that passing around the whole StorageRange request means also passing an index of the actual account processed (accounts are processed one at a time), so it doesn't look very nice.

Co-authored-by: lukasz.rozmej <lukasz.rozmej@gmail.com>

damian-orzechowski · 2025-01-03T11:32:24Z

More issues found - do not merge:
https://github.com/NethermindEth/nethermind/actions/runs/12595664462/job/35105410956

Scooletz · 2025-01-03T12:50:16Z

@damian-orzechowski Converted this PR to a draft for now.

Scooletz · 2025-01-28T10:18:00Z

Not sure if ready for a review. If yes, re-request pls.

Split storage trie request if processed data is less than requested r…

588549c

…ange.

damian-orzechowski mentioned this pull request Nov 6, 2024

Split storage ranges at the end of snap ranges sync to parallelize execution #7688

Closed

16 tasks

damian-orzechowski added 3 commits November 6, 2024 23:25

Missed hash limit param

cb302e4

Fixes and more diag

3aa3708

Cleanup

f734316

damian-orzechowski marked this pull request as ready for review November 8, 2024 16:59

LukaszRozmej approved these changes Nov 8, 2024

View reviewed changes

LukaszRozmej requested a review from asdacap November 8, 2024 19:29

asdacap reviewed Nov 8, 2024

View reviewed changes

src/Nethermind/Nethermind.Synchronization/SnapSync/ProgressTracker.cs Outdated Show resolved Hide resolved

damian-orzechowski requested review from rubo and a team as code owners November 11, 2024 09:13

damian-orzechowski force-pushed the feature/snap-sync-parallel-storage-2 branch from 525c8aa to f734316 Compare November 11, 2024 10:31

Tests and fixes

a74be64

damian-orzechowski removed request for a team and rubo November 11, 2024 13:44

damian-orzechowski added 2 commits November 12, 2024 10:07

Whitespace

95228de

Merge master

35387e2

marcindsobczak approved these changes Nov 12, 2024

View reviewed changes

Fix after merge

978d2af

asdacap approved these changes Nov 12, 2024

View reviewed changes

damian-orzechowski added 7 commits November 13, 2024 15:26

Merge branch 'master' into feature/snap-sync-parallel-storage-2

5f0a5d2

Merge branch 'master' into feature/snap-sync-parallel-storage-2

5e99ecc

Fix for param

bfaccf7

Add locking for storage ranges executed in parallel

f5a0146

Merged master

65d944e

Node stitching fix

436db0f

Add tests

2ebde28

damian-orzechowski requested a review from LukaszRozmej December 4, 2024 16:37

Scooletz reviewed Dec 5, 2024

View reviewed changes

asdacap reviewed Dec 5, 2024

View reviewed changes

Scooletz mentioned this pull request Dec 5, 2024

simplified locks #7868

Closed

Way simpler locking on account

0460606

LukaszRozmej reviewed Dec 5, 2024

View reviewed changes

LukaszRozmej added 2 commits December 5, 2024 21:20

whitespace

d1091fd

increase timeout on flaky test

fa4a044

Merge branch 'master' into feature/snap-sync-parallel-storage-2

d76cc63

LukaszRozmej requested review from asdacap and Scooletz December 6, 2024 12:33

Scooletz suggested changes Dec 10, 2024

View reviewed changes

LukaszRozmej mentioned this pull request Dec 16, 2024

Slight db adjustment for snap sync perf #7919

Merged

5 tasks

damian-orzechowski mentioned this pull request Dec 18, 2024

StorageRange request refactoring for parallel execution #7930

Merged

16 tasks

StorageRange request refactoring for parallel execution (#7930)

e87c511

Co-authored-by: lukasz.rozmej <lukasz.rozmej@gmail.com>

damian-orzechowski requested a review from Scooletz December 23, 2024 16:40

Scooletz approved these changes Jan 2, 2025

View reviewed changes

Merged master

3cba43a

Scooletz marked this pull request as draft January 3, 2025 12:50

damian-orzechowski mentioned this pull request Jan 9, 2025

Change more children derivation on snap sync #8034

Merged

16 tasks

Merge branch 'master' into feature/snap-sync-parallel-storage-2

f44de49

damian-orzechowski force-pushed the feature/snap-sync-parallel-storage-2 branch from c31f55b to f44de49 Compare January 28, 2025 09:52

Merge master

510338b

Merge branch 'master' into feature/snap-sync-parallel-storage-2

0eb143c

damian-orzechowski marked this pull request as ready for review February 14, 2025 12:11

damian-orzechowski merged commit b9ed459 into master Feb 20, 2025
80 checks passed

damian-orzechowski deleted the feature/snap-sync-parallel-storage-2 branch February 20, 2025 12:28

		if (!_storageRangeLocks.TryGetValue(accountPath, out IStorageRangeLock lockInfo)) return;
		lockInfo.Increment(number);

Split storage ranges to parallelize execution #7733

Split storage ranges to parallelize execution #7733

Uh oh!

Conversation

damian-orzechowski commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Types of changes

What types of changes does your code introduce?

Testing

Requires testing

If yes, did you write tests?

Notes on testing

Documentation

Requires documentation update

Requires explanation in Release Notes

Uh oh!

LukaszRozmej left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

asdacap commented Nov 8, 2024

Uh oh!

damian-orzechowski commented Nov 11, 2024

Uh oh!

Scooletz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LukaszRozmej Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LukaszRozmej Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damian-orzechowski commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LukaszRozmej Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damian-orzechowski commented Jan 3, 2025

Uh oh!

Scooletz commented Jan 3, 2025

Uh oh!

Scooletz commented Jan 28, 2025

Uh oh!

Uh oh!

damian-orzechowski commented Nov 6, 2024 •

edited

Loading

Scooletz left a comment •

edited

Loading

LukaszRozmej Dec 6, 2024 •

edited

Loading

LukaszRozmej Dec 6, 2024 •

edited

Loading

damian-orzechowski commented Dec 6, 2024 •

edited

Loading

LukaszRozmej Dec 10, 2024 •

edited

Loading