lib/model: Clean up index handler life cycle (fixes #9021) #9038

calmh · 2023-08-16T09:31:09Z

This is an alternative to #9035. There's a certain amount of overengineering here, possibly, but hear me out...

There's a new thing called a serviceMap. This is essentially a map of key -> service plus key -> Suture-token and some management to make sure that adding and removing things from the map does the right things for the services in question. This reduces boilerplate since we have this pattern in a couple of places. The serviceMap is itself a Service that runs the children.
The index handler registry now uses one of these to manage the index handlers
The model now uses one of these to manage the index handler registries
I think it all works out quite nicely, stopping one thing stops all the children, and we now remove (stop) the index handler registry for a device when we close the connection.

We can also use this serviceMap for the folder runners, but that's a large refactor-only diff that we can save for later...

calmh · 2023-08-16T09:35:23Z

lib/model/model.go

@@ -1792,7 +1793,7 @@ func (m *model) Closed(conn protocol.Connection, err error) {
 	delete(m.remoteFolderStates, device)
 	closed := m.closed[device]
 	delete(m.closed, device)
-	delete(m.indexHandlers, device)
+	m.indexHandlers.RemoveAndWait(device, 0)


I believe this is the relevant point that fixes #9021, as per my understanding of #9035...

... and I'm 99% sure that "...AndWait" is safe here, but not 100%

Also my understanding :)

One potentialy small downside of the map-and-service-construct: Can't move the waiting part outside of the lock anymore - worst case it's still writing an index update received just before the conn was closed to db and thus we block pmut for a while longer than necessary. Doesn't seem very relevant though, if db is that slow that it matters operation wont be smooth anyway.

calmh · 2023-08-16T09:40:31Z

lib/model/service_map.go

+// already a service at the given key, it is removed first.
+func (s *serviceMap[K, S]) Add(k K, v S) {
+	if tok, ok := s.tokens[k]; ok {
+		// There is already a service at this key, remove it first.


Alternatively, we could call this incorrect usage and panic, since it's likely to be caused by a missing Remove at some point...

Or a middle ground: Panic if it's an RC, send a failure event (and thus report) otherwise.

imsodin · 2023-08-17T05:42:10Z

At first glance the service map has more lib/svcutil vibes than model vibes (haven't actually reviewed yet, so maybe it obviously isn't).

imsodin · 2023-08-17T05:49:55Z

lib/model/service_map.go

+}
+
+// Each calls the given function for each service in the map.
+func (s *serviceMap[K, S]) Each(fn func(K, S)) {


We do this over exposing the map and just looping over that at the caller, to prevent the caller from messing with the map?

(Also I want range over functions: golang/go#61405)

Indeed. I considered returning a copy of the map, but this seemed fine as well...

calmh · 2023-08-17T06:56:54Z

(lib/svcutil) Agreed, if and when it's used outside of the one package at least...

imsodin

just some nits, lgtm otherwise

imsodin · 2023-08-18T19:41:12Z

lib/model/service_map.go

+func (s *serviceMap[K, S]) Get(k K) (v S, ok bool) {
+	v, ok = s.services[k]
+	return


Just curiousity: Is there a specific reason for the named return values?

I think I thought of it as documentation on what the bool is (though that's hardly necessary as it's a super common pattern) and the method body becomes a few characters shorter, possibly it evolved this was from an earlier iteration without the OK bool...

lib/model/service_map.go

imsodin · 2023-08-18T19:44:13Z

lib/model/indexhandler.go

-		r.sup.RemoveAndWait(is.token, 0)
-		delete(r.indexHandlers, folder.ID)
-	}
+	r.indexHandlers.Remove(folder.ID)


I think this is also one of the "should never happen", but if it did, wouldn't we want to RemoveAndWait to ensure there's not two indexhandlers at the same time?

My bug, we did ...AndWait before and should do so now as well 👍

imsodin · 2023-08-18T19:53:06Z

lib/model/indexhandler.go

@@ -389,48 +385,35 @@ type indexHandlerFolderState struct {
 	runner service
 }

-func newIndexHandlerRegistry(conn protocol.Connection, downloads *deviceDownloadState, closed chan struct{}, parentSup *suture.Supervisor, evLogger events.Logger) *indexHandlerRegistry {
+func newIndexHandlerRegistry(conn protocol.Connection, downloads *deviceDownloadState, closed chan struct{}, evLogger events.Logger) *indexHandlerRegistry {


The closed arg isn't used anymore.

imsodin · 2023-08-18T20:03:13Z

lib/model/model.go

@@ -1792,7 +1793,7 @@ func (m *model) Closed(conn protocol.Connection, err error) {
 	delete(m.remoteFolderStates, device)
 	closed := m.closed[device]
 	delete(m.closed, device)
-	delete(m.indexHandlers, device)
+	m.indexHandlers.RemoveAndWait(device, 0)


Also my understanding :)

One potentialy small downside of the map-and-service-construct: Can't move the waiting part outside of the lock anymore - worst case it's still writing an index update received just before the conn was closed to db and thus we block pmut for a while longer than necessary. Doesn't seem very relevant though, if db is that slow that it matters operation wont be smooth anyway.

Co-authored-by: Simon Frei <freisim93@gmail.com>

calmh · 2023-08-19T06:58:30Z

This doesn't fix #9021, are you sure your attempt does?

--- FAIL: TestAutoAcceptEnc (0.00s)
panic: bug: ClusterConfig called on closed or nonexistent connection [recovered]
	panic: bug: ClusterConfig called on closed or nonexistent connection

goroutine 2639 [running]:
testing.tRunner.func1.2({0xf78000, 0x1254330})
	C:/hostedtoolcache/windows/go/1.20.7/x64/src/testing/testing.go:1526 +0x24e
testing.tRunner.func1()
	C:/hostedtoolcache/windows/go/1.20.7/x64/src/testing/testing.go:1529 +0x39f
panic({0xf78000, 0x1254330})
	C:/hostedtoolcache/windows/go/1.20.7/x64/src/runtime/panic.go:884 +0x213
github.com/syncthing/syncthing/lib/model.(*model).ClusterConfig(0xc00088dba0, {0x1268188?, 0xc000453800?}, {{0xc005db1900?, 0x0?, 0x0?}})
	D:/a/syncthing/syncthing/lib/model/model.go:1177 +0x100a
github.com/syncthing/syncthing/lib/model.TestAutoAcceptEnc.func2({0x2, 0x23, 0xe5, 0xbf, 0x3f, 0x57, 0x1f, 0x39, 0xd2, 0xf0, ...}, ...)
	D:/a/syncthing/syncthing/lib/model/model_test.go:1341 +0x125
github.com/syncthing/syncthing/lib/model.TestAutoAcceptEnc(0xc00088d040)
	D:/a/syncthing/syncthing/lib/model/model_test.go:1398 +0x168a
testing.tRunner(0xc00088d040, 0x11418f8)
	C:/hostedtoolcache/windows/go/1.20.7/x64/src/testing/testing.go:1576 +0x10b
created by testing.(*T).Run
	C:/hostedtoolcache/windows/go/1.20.7/x64/src/testing/testing.go:1629 +0x3ea
FAIL	github.com/syncthing/syncthing/lib/model	2.348s
?   	github.com/syncthing/syncthing/lib/model/mocks	[no test files]

(latest build run on Windows)

calmh · 2023-08-19T07:02:46Z

Frankly the test seems kinda odd and even has a comment related to this panic, which it tries to work around by adding a new connection for the same device ID for every test iteration it does... I think this change (and yours) is sound, but I think the test does something else unrealistic/wrong for different reasons.

imsodin · 2023-08-19T08:16:41Z

I am sure it makes it less likely at least - I get that panic within a few runs on main and not at all on my PR with go test -count 1000 -failfast -run AutoAcceptEnc ./lib/model/. I now also checked your PR out, and indeed it fails too - I don't understand yet why.

Also regarding the test being weird/potentially bogous: I don't disagree - it still shouldn't be able to trigger that panic.

I don't see the difference. I switched my PR to also make the index handler registry a service and control it from model (in an ugly quick and dirty way just for debugging), and the test still doesn't fail.
There are also a few tiny circumstantial changes in the PR here, but fixing them doesn't change anything (as expected). I'll comment inline just for completeness.

lib/model/service_map.go

lib/model/model.go

imsodin · 2023-08-20T09:35:55Z

Can't say I understand why, but waiting for the index handler to stop outside of the pmut lock makes the test no longer/fail panic for me. Patch doing that on top of your PR: imsodin@50dc429
Something similar might actual be a generally useful extension of the suture API.

calmh · 2023-08-20T13:22:10Z

That's bizarre. I would like to understand why that has any effect...

calmh · 2023-08-20T13:33:14Z

FWIW, this also "fixes" it for me;

diff --git a/lib/model/model.go b/lib/model/model.go
index 0fafd0d03..aad3a45da 100644
--- a/lib/model/model.go
+++ b/lib/model/model.go
@@ -1795,6 +1795,7 @@ func (m *model) Closed(conn protocol.Connection, err error) {
        delete(m.closed, device)
        m.indexHandlers.RemoveAndWait(device, 0)
        m.pmut.Unlock()
+       runtime.Gosched()

        m.progressEmitter.temporaryIndexUnsubscribe(conn)
        m.deviceDidClose(device, time.Since(conn.EstablishedAt()))

I'll see if I can figure this out later on, something seems to be async that shouldn't...

imsodin · 2023-08-20T19:53:18Z

I tried to figure out the sequence of events leading to the panic, but I can't say I really have a clear picture. However it definitely involves late effects of the previous test-case (config changes). And I am now more or less convinced that this test is a sufficiently messy to show that while the condition that leads to the panic holds most of the time, there's actually no guarantee it does: A connection might be closed right after a cluster-config is received, thus the model methods handling the two events can race on the pmut lock such that the connection gets closed/removed before the cluster-config. I think we should just return an error in that case, same as with the next condition that checks the configuration. That will still fail this test, but that's clearly a case of a deficient test (probably should be a bunch of separate sub-tests that either get their own state or properly clean up after themselves).

(Also none of this explains the interesting question of why runtime.Gosched() has the impact it does - might be it's just a more or less random scheduling effect that leads to a benign ordering of the racying methods?)

calmh · 2023-08-21T09:19:28Z

There, I fixed it, I believe. My interpretation is that this was a test-only failure. The thing is, we may under some circumstances try to close a connection more than once, from CommitConfiguration (device is now paused) and AddConnection (we're adding a new connection for the same device) or something. This happens because starting a close is just calling (protocol.Connection).Close(), and the actual cleanup happens when the connection calls back into (*Model).Closed(). Now, (*Model).Closed() must only be called once per connection because it (currently, pre-my-other-PR) just works per device and assumes any connection present for the device is the one it's closing. Our protocol code does this correctly -- it only closes once, because there's a sync.Once for the whole thing. The fakeConnection wasn't as precise about this and would happily call the closed callback more than once. Normally this wouldn't be a problem but the TestAutoAcceptEnc effectively races AddConnection against Close all the time while causing Close() to be called multiple times, so sometimes closing connection A would remove the structures for connection B.

Adding the Gosched or channel stuff or a wait gave things time to close before adding the next connection.

imsodin · 2023-08-21T09:45:29Z

Ah yes and my statement above that ClusterConfig may race Closed is wrong: The protocol package only calls close after it stopped all it's processing, so no in-flight calls going on of course.

* main: (121 commits) build: Update dependencies gui: Remove footer and move links to header (fixes syncthing#5607) (syncthing#9067) gui: Fix lastSeenDays error due to undefined deviceStats when adding new devices (ref syncthing#8730) (syncthing#9066) gui: Automatically select device ID on click (ref syncthing#8544) (syncthing#9065) gui: Prevent modifications when saving changes (fixes syncthing#9019) (syncthing#9063) gui: Show in GUI if limitBandwidthInLan is enabled (syncthing#9062) lib/upgrade: Enable HTTP/2 for upgrade checks (syncthing#9060) lib/discover: Enable HTTP/2 for global discovery requests (syncthing#9059) cmd/stdiscosrv: Streamline context handling cmd/stdiscosrv: Explicitly enable HTTP/2 gui, man, authors: Update docs, translations, and contributors cmd/stdiscosrv: Separate HTTPS and replication certificates cmd/stdiscosrv: Use larger database settings cmd/stdiscosrv: Modernise TLS settings, remove excessive HTTP logging cmd/stdiscosrv: Serve compressed responses lib/connections: Allow IPv6 ULA in discovery announcements (fixes syncthing#7456) (syncthing#9048) lib/beacon: Check FlagRunning (syncthing#9051) all: Remove lib/util package (syncthing#9049) lib/model: Clean up index handler life cycle (fixes syncthing#9021) (syncthing#9038) lib/osutil, lib/upnp: Check FlagRunning (fixes syncthing#8767) (syncthing#9047) ...

calmh added 4 commits August 16, 2023 10:16

wip

699ce62

wip

bf2043e

wip

be1d912

wip

d60e601

calmh commented Aug 16, 2023

View reviewed changes

calmh mentioned this pull request Aug 16, 2023

lib/model: Ensure index handler stops on closed connection (fixes #9021) #9035

Closed

calmh commented Aug 16, 2023

View reviewed changes

imsodin reviewed Aug 17, 2023

View reviewed changes

imsodin reviewed Aug 18, 2023

View reviewed changes

calmh and others added 2 commits August 19, 2023 08:31

Update lib/model/service_map.go

f442d34

Co-authored-by: Simon Frei <freisim93@gmail.com>

wip

c445e00

imsodin reviewed Aug 19, 2023

View reviewed changes

lib/model/service_map.go Show resolved Hide resolved

imsodin reviewed Aug 19, 2023

View reviewed changes

lib/model/model.go Show resolved Hide resolved

wip

12df675

calmh requested a review from imsodin August 21, 2023 10:59

imsodin approved these changes Aug 21, 2023

View reviewed changes

calmh merged commit 40b3b9a into syncthing:main Aug 21, 2023

calmh deleted the indexhandlerlifecycle branch August 22, 2023 06:27

calmh added this to the v1.24.0 milestone Aug 23, 2023

st-review added the frozen-due-to-age Issues closed and untouched for a long time, together with being locked for discussion label Aug 20, 2024

syncthing locked and limited conversation to collaborators Aug 20, 2024

Uh oh!

lib/model: Clean up index handler life cycle (fixes #9021) #9038

lib/model: Clean up index handler life cycle (fixes #9021) #9038

Uh oh!

Conversation

calmh commented Aug 16, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imsodin commented Aug 17, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

calmh commented Aug 17, 2023

Uh oh!

imsodin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

calmh commented Aug 19, 2023

Uh oh!

calmh commented Aug 19, 2023

Uh oh!

imsodin commented Aug 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imsodin commented Aug 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

calmh commented Aug 20, 2023

Uh oh!

calmh commented Aug 20, 2023

Uh oh!

imsodin commented Aug 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

calmh commented Aug 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imsodin commented Aug 21, 2023

Uh oh!

Uh oh!

imsodin commented Aug 19, 2023 •

edited

Loading

imsodin commented Aug 20, 2023 •

edited

Loading

imsodin commented Aug 20, 2023 •

edited

Loading

calmh commented Aug 21, 2023 •

edited

Loading