-
Notifications
You must be signed in to change notification settings - Fork 387
Rework graceful shutdown using IPROTO_EVENT #6813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e152cc6
to
b374d46
Compare
|
@Gerold103 This patch completely reworks the implementation of the graceful shutdown (the procedure stays the same, but the code is moved around and |
b374d46
to
c99ecbf
Compare
@@ -1411,10 +1411,16 @@ iproto_connection_on_output(ev_loop *loop, struct ev_io *watcher, | |||
ev_io_start(loop, &con->output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, the test in the first commit passes on my machine even without the fix. Does it fail on your machine? Except when I run it too many times like this python test-run.py $(yes 6818 | head -100)
, but then it hangs even with the patch. There was a problem with luatest and too many runs in parallel, at least on Mac. Instances are not all killed in the end and it randomly hangs. So the hang probably isn't related to the patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my machine it didn't fail either. I slightly tuned the test - now it fails all the time without the patch and passes with the patch. I couldn't reproduce the hang running the test in parallel - looks like a test-run issue after all.
diff
diff --git a/test/box-luatest/gh_6818_iproto_resumes_input_if_no_output_test.lua b/test/box-luatest/gh_6818_iproto_resumes_input_if_no_output_test.lua
index 8bf44e18eb5e..d5675061d1bd 100644
--- a/test/box-luatest/gh_6818_iproto_resumes_input_if_no_output_test.lua
+++ b/test/box-luatest/gh_6818_iproto_resumes_input_if_no_output_test.lua
@@ -8,7 +8,7 @@ g.before_all = function()
g.server = server:new({
alias = 'master',
box_cfg = {
- net_msg_max = 16,
+ net_msg_max = 4,
}
})
g.server:start()
@@ -31,13 +31,13 @@ g.test_iproto_resumes_input_if_no_output = function()
local conn = net.connect(g.server.net_box_uri)
conn:watch('foo', function() end)
for _ = 1, loop_count do
- conn:eval('return true')
+ conn:eval([[require('fiber').sleep(0.001)]])
end
conn:close()
on_done:put(true)
end)
end
for _ = 1, fiber_count do
- t.assert(on_done:get())
+ t.assert(on_done:get(10))
end
end
IPROTO triggers input processing after sending output, provided the output channel isn't clogged, which totally makes sense. There's one nuance here: currently, input is triggered only if something was written to the output channel. Before the IPROTO_WATCH request type was added by commit 4be5de4 ("iproto: add watchers support"), every request had a response so it worked just fine. The new request type is special - the server doesn't reply to it. As a result, if input was blocked upon reaching the box.cfg.net_msg_max limit, it might never get resumed, because input processing might never be triggered. Fix this issue by triggering input processing even if there was no output to flush. No user is affected, because IPROTO_WATCH hasn't been officially released yet so neither changelog nor doc is required. Closes tarantool#6818 NO_CHANGELOG=bug not released NO_DOC=NO_CHANGELOG
This commit fixes the following assertion failure that occurrs on an attempt to register a watcher for a connection to a server that doesn't support watchers (e.g. 2.8): src/box/lua/net_box.c:2283: netbox_transport_dispatch_response: Assertion `transport->inprogress_request_count > 0' failed. The problem is that an IPROTO_WATCH request isn't accounted in inprogress_request_count, because the server isn't supposed to reply to it. However, if the server doesn't support the request type, it will reply to it with an error, breaking the assumption made by the client. We fix the assertion itself by explicitly excluding requests with sync=0 from accounting on the client side (IPROTO_WATCH has sync=0), because such requests aren't supposed to have a matching reply. Also, to avoid errors on the server side, we make the client code not send IPROTO_WATCH and IPROTO_WATCH if the server doesn't set the 'watchers' feature bit in reply to the IPROTO_ID request from the client. No user is affected, because IPROTO_WATCH hasn't been officially released yet so neither changelog nor doc is required. Closes tarantool#6819 NO_CHANGELOG=bug not released NO_DOC=NO_CHANGELOG
c99ecbf
to
90db35d
Compare
Found another bug in the watcher infrastructure, because of which vshard and cartridge tests fail: #6824. |
90db35d
to
1d2fef7
Compare
It's a system fiber. Killing it might result in crashing Tarantool. Integration tests would fail without this patch after tarantool/tarantool#6813 is merged.
Currently, to resubscribe registered watchers net.box installs an on_connect trigger that writes IPROTO_WATCH requests to the send buffer. The requests will be sent by the net.box state machine once all on_connect triggers return. The problem is this trigger may run before a user-defined trigger that closes the connection, in which case close() will hang forever, because since tarantool#6338, the close method blocks until the send buffer is emptied, while it can't be emptied in case close is called by the state machine itself (see tarantool#5358). Let's fix this issue by making close() omit blocking if called from the state machine fiber. No user is affected, because IPROTO_WATCH hasn't been officially released yet so neither changelog nor doc is required. Closes tarantool#6824 NO_CHANGELOG=bug not released NO_DOC=NO_CHANGELOG
This patch drops IPROTO_SHUTDOWN and IPROTO_FEATURE_GRACEFUL_SHUTDOWN (the feature hasn't been officially released yet so it's okay). Instead, the server now generates IPROTO_EVENT{key='box.shutdown', value=true} before running on_shutdown triggers. The on_shutdown trigger installed by IPROTO has been greatly simplified - now it just stops listening for new connections. A new on_shutdown trigger is installed by the session infrastructure code - the trigger waits for all sessions that subscribed to 'box.shutdown' event to close. Net.box uses conn:watch to subscribe internally to the new event. As before, it runs user-defined on_shutdown triggers and then waits for active requests to finish before closing the connection. Notes about the tests: 1. We don't need graceful_shutdown_errinj test, because a. The feature can be disabled in net.box without errinj by adding internal _disable_graceful_shutdown connection option. b. Since there's no new packet type, we don't need to check that IPROTO_ID/AUTH handle it correctly in net.box. 2. The graceful_shutdown.test_shutdown_aborted_on_reconnect_2 changed: since triggers use the watcher infrstructure, the same trigger can't run in parallel with itself. The main benefit of reusing IPROTO_EVENT instead of introducing a new request type is that connections will have to do less work to support the feature (to some extent, it'd be supported even if the connector doesn't do any work and just lets the user subscribe to events). Suggested-by @unera Follow-up a33f3cc ("net.box: support graceful shutdown protocol") Follow-up 6f29f9d ("iproto: introduce graceful shutdown protocol") Follow-up tarantool#5924 NO_CHANGELOG=already added NO_DOC=already added, will be updated manually
1d2fef7
to
588b840
Compare
Fixed the bug with a new commit ("net.box: don't block close if called from state machine fiber"). Re cartridge test failures - turns out cartridge kills all fibers except those listed explicitly in the blacklist. The fiber used by the watcher infrastructure ("box.watchable") isn't on the list, which results in a crash. I opened a PR to add it there: tarantool/cartridge#1741 |
It's a system fiber. Killing it might result in crashing Tarantool. Integration tests would fail without this patch after tarantool/tarantool#6813 is merged.
Thanks! All looks good to me. |
This patch drops
IPROTO_SHUTDOWN
andIPROTO_FEATURE_GRACEFUL_SHUTDOWN
(the feature hasn't been officially released yet so it's okay). Instead, the server now generatesIPROTO_EVENT{key='box.shutdown', value=true}
before runningon_shutdown
triggers. Theon_shutdown
trigger installed by IPROTO has been greatly simplified - now it just stops listening for new connections. A newon_shutdown
trigger is installed by the session infrastructure code - the trigger waits for all sessions that subscribed to 'box.shutdown' event to close. Net.box usesconn:watch
to subscribeinternally to the new event. As before, it runs user-defined
on_shutdown
triggers and then waits for active requests to finish before closing the connection.Notes about the tests:
graceful_shutdown_errinj
test, becausea. The feature can be disabled in net.box without errinj by adding internal
_disable_graceful_shutdown
connection option.b. Since there's no new packet type, we don't need to check that
IPROTO_ID
andIPROTO_AUTH
handle it correctly in net.box.graceful_shutdown.test_shutdown_aborted_on_reconnect_2
changed: since triggers use the watcher infrstructure, the same trigger can't run in parallel with itself.The main benefit of reusing
IPROTO_EVENT
instead of introducing a new request type is that connections will have to do less work to support the feature (to some extent, it'd be supported even if the connector doesn't do any work and just lets the user subscribe to events).Suggested-by @unera
Follow-up #6783
Really closes #5924
Also found and fixed a couple of watcher-related issues in the scope of this PR:
Closes #6818
Closes #6819
Closes #6824
If we agree to merge this change, I'll manually update documentation tickets: tarantool/doc#2632, tarantool/doc#2633.