-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
lib/model: Clean up index handler life cycle (fixes #9021) #9038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1792,7 +1793,7 @@ func (m *model) Closed(conn protocol.Connection, err error) { | |||
delete(m.remoteFolderStates, device) | |||
closed := m.closed[device] | |||
delete(m.closed, device) | |||
delete(m.indexHandlers, device) | |||
m.indexHandlers.RemoveAndWait(device, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and I'm 99% sure that "...AndWait" is safe here, but not 100%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also my understanding :)
One potentialy small downside of the map-and-service-construct: Can't move the waiting part outside of the lock anymore - worst case it's still writing an index update received just before the conn was closed to db and thus we block pmut for a while longer than necessary. Doesn't seem very relevant though, if db is that slow that it matters operation wont be smooth anyway.
// already a service at the given key, it is removed first. | ||
func (s *serviceMap[K, S]) Add(k K, v S) { | ||
if tok, ok := s.tokens[k]; ok { | ||
// There is already a service at this key, remove it first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could call this incorrect usage and panic, since it's likely to be caused by a missing Remove at some point...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or a middle ground: Panic if it's an RC, send a failure event (and thus report) otherwise.
At first glance the service map has more |
} | ||
|
||
// Each calls the given function for each service in the map. | ||
func (s *serviceMap[K, S]) Each(fn func(K, S)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do this over exposing the map and just looping over that at the caller, to prevent the caller from messing with the map?
(Also I want range
over functions: golang/go#61405)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I considered returning a copy of the map, but this seemed fine as well...
(lib/svcutil) Agreed, if and when it's used outside of the one package at least... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some nits, lgtm otherwise
func (s *serviceMap[K, S]) Get(k K) (v S, ok bool) { | ||
v, ok = s.services[k] | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curiousity: Is there a specific reason for the named return values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I thought of it as documentation on what the bool is (though that's hardly necessary as it's a super common pattern) and the method body becomes a few characters shorter, possibly it evolved this was from an earlier iteration without the OK bool...
lib/model/indexhandler.go
Outdated
r.sup.RemoveAndWait(is.token, 0) | ||
delete(r.indexHandlers, folder.ID) | ||
} | ||
r.indexHandlers.Remove(folder.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is also one of the "should never happen", but if it did, wouldn't we want to RemoveAndWait
to ensure there's not two indexhandlers at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bug, we did ...AndWait before and should do so now as well 👍
lib/model/indexhandler.go
Outdated
@@ -389,48 +385,35 @@ type indexHandlerFolderState struct { | |||
runner service | |||
} | |||
|
|||
func newIndexHandlerRegistry(conn protocol.Connection, downloads *deviceDownloadState, closed chan struct{}, parentSup *suture.Supervisor, evLogger events.Logger) *indexHandlerRegistry { | |||
func newIndexHandlerRegistry(conn protocol.Connection, downloads *deviceDownloadState, closed chan struct{}, evLogger events.Logger) *indexHandlerRegistry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The closed
arg isn't used anymore.
@@ -1792,7 +1793,7 @@ func (m *model) Closed(conn protocol.Connection, err error) { | |||
delete(m.remoteFolderStates, device) | |||
closed := m.closed[device] | |||
delete(m.closed, device) | |||
delete(m.indexHandlers, device) | |||
m.indexHandlers.RemoveAndWait(device, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also my understanding :)
One potentialy small downside of the map-and-service-construct: Can't move the waiting part outside of the lock anymore - worst case it's still writing an index update received just before the conn was closed to db and thus we block pmut for a while longer than necessary. Doesn't seem very relevant though, if db is that slow that it matters operation wont be smooth anyway.
Co-authored-by: Simon Frei <freisim93@gmail.com>
This doesn't fix #9021, are you sure your attempt does?
(latest build run on Windows) |
Frankly the test seems kinda odd and even has a comment related to this panic, which it tries to work around by adding a new connection for the same device ID for every test iteration it does... I think this change (and yours) is sound, but I think the test does something else unrealistic/wrong for different reasons. |
I am sure it makes it less likely at least - I get that panic within a few runs on main and not at all on my PR with Also regarding the test being weird/potentially bogous: I don't disagree - it still shouldn't be able to trigger that panic. I don't see the difference. I switched my PR to also make the index handler registry a service and control it from model (in an ugly quick and dirty way just for debugging), and the test still doesn't fail. |
Can't say I understand why, but waiting for the index handler to stop outside of the |
That's bizarre. I would like to understand why that has any effect... |
FWIW, this also "fixes" it for me; diff --git a/lib/model/model.go b/lib/model/model.go
index 0fafd0d03..aad3a45da 100644
--- a/lib/model/model.go
+++ b/lib/model/model.go
@@ -1795,6 +1795,7 @@ func (m *model) Closed(conn protocol.Connection, err error) {
delete(m.closed, device)
m.indexHandlers.RemoveAndWait(device, 0)
m.pmut.Unlock()
+ runtime.Gosched()
m.progressEmitter.temporaryIndexUnsubscribe(conn)
m.deviceDidClose(device, time.Since(conn.EstablishedAt())) I'll see if I can figure this out later on, something seems to be async that shouldn't... |
I tried to figure out the sequence of events leading to the panic, but I can't say I really have a clear picture. However it definitely involves late effects of the previous test-case (config changes). And I am now more or less convinced that this test is a sufficiently messy to show that while the condition that leads to the panic holds most of the time, there's actually no guarantee it does: A connection might be closed right after a cluster-config is received, thus the model methods handling the two events can race on the pmut lock such that the connection gets closed/removed before the cluster-config. I think we should just return an error in that case, same as with the next condition that checks the configuration. That will still fail this test, but that's clearly a case of a deficient test (probably should be a bunch of separate sub-tests that either get their own state or properly clean up after themselves). (Also none of this explains the interesting question of why |
There, I fixed it, I believe. My interpretation is that this was a test-only failure. The thing is, we may under some circumstances try to close a connection more than once, from CommitConfiguration (device is now paused) and AddConnection (we're adding a new connection for the same device) or something. This happens because starting a close is just calling Adding the Gosched or channel stuff or a wait gave things time to close before adding the next connection. |
Ah yes and my statement above that |
* main: (121 commits) build: Update dependencies gui: Remove footer and move links to header (fixes syncthing#5607) (syncthing#9067) gui: Fix lastSeenDays error due to undefined deviceStats when adding new devices (ref syncthing#8730) (syncthing#9066) gui: Automatically select device ID on click (ref syncthing#8544) (syncthing#9065) gui: Prevent modifications when saving changes (fixes syncthing#9019) (syncthing#9063) gui: Show in GUI if limitBandwidthInLan is enabled (syncthing#9062) lib/upgrade: Enable HTTP/2 for upgrade checks (syncthing#9060) lib/discover: Enable HTTP/2 for global discovery requests (syncthing#9059) cmd/stdiscosrv: Streamline context handling cmd/stdiscosrv: Explicitly enable HTTP/2 gui, man, authors: Update docs, translations, and contributors cmd/stdiscosrv: Separate HTTPS and replication certificates cmd/stdiscosrv: Use larger database settings cmd/stdiscosrv: Modernise TLS settings, remove excessive HTTP logging cmd/stdiscosrv: Serve compressed responses lib/connections: Allow IPv6 ULA in discovery announcements (fixes syncthing#7456) (syncthing#9048) lib/beacon: Check FlagRunning (syncthing#9051) all: Remove lib/util package (syncthing#9049) lib/model: Clean up index handler life cycle (fixes syncthing#9021) (syncthing#9038) lib/osutil, lib/upnp: Check FlagRunning (fixes syncthing#8767) (syncthing#9047) ...
This is an alternative to #9035. There's a certain amount of overengineering here, possibly, but hear me out...
serviceMap
. This is essentially a map of key -> service plus key -> Suture-token and some management to make sure that adding and removing things from the map does the right things for the services in question. This reduces boilerplate since we have this pattern in a couple of places. The serviceMap is itself a Service that runs the children.We can also use this serviceMap for the folder runners, but that's a large refactor-only diff that we can save for later...