Rework graceful shutdown using IPROTO_EVENT #6813

locker · 2022-01-28T15:34:18Z

This patch drops IPROTO_SHUTDOWN and IPROTO_FEATURE_GRACEFUL_SHUTDOWN (the feature hasn't been officially released yet so it's okay). Instead, the server now generates IPROTO_EVENT{key='box.shutdown', value=true} before running on_shutdown triggers. The on_shutdown trigger installed by IPROTO has been greatly simplified - now it just stops listening for new connections. A new on_shutdown trigger is installed by the session infrastructure code - the trigger waits for all sessions that subscribed to 'box.shutdown' event to close. Net.box uses conn:watch to subscribe
internally to the new event. As before, it runs user-defined on_shutdown triggers and then waits for active requests to finish before closing the connection.

Notes about the tests:

We don't need graceful_shutdown_errinj test, because
a. The feature can be disabled in net.box without errinj by adding internal _disable_graceful_shutdown connection option.
b. Since there's no new packet type, we don't need to check that IPROTO_ID and IPROTO_AUTH handle it correctly in net.box.
The graceful_shutdown.test_shutdown_aborted_on_reconnect_2 changed: since triggers use the watcher infrstructure, the same trigger can't run in parallel with itself.

The main benefit of reusing IPROTO_EVENT instead of introducing a new request type is that connections will have to do less work to support the feature (to some extent, it'd be supported even if the connector doesn't do any work and just lets the user subscribe to events).

Suggested-by @unera
Follow-up #6783
Really closes #5924

Also found and fixed a couple of watcher-related issues in the scope of this PR:

Closes #6818
Closes #6819
Closes #6824

If we agree to merge this change, I'll manually update documentation tickets: tarantool/doc#2632, tarantool/doc#2633.

locker · 2022-01-28T16:04:36Z

box/iproto_stress.test.lua fails with this patch. I'm currently investigating why. It looks an issue of conn.watch.

locker · 2022-01-28T16:05:34Z

@Gerold103 This patch completely reworks the implementation of the graceful shutdown (the procedure stays the same, but the code is moved around and IPROTO_SHUTDOWN is dropped in favor of 'box.shutdown' event). Please take a look. Do you think it's a step in the right direction?

src/box/iproto.cc

src/box/lua/net_box.lua

src/box/watcher.c

locker · 2022-01-31T16:00:51Z

box/iproto_stress.test.lua fails with this patch. I'm currently investigating why. It looks an issue of conn.watch.

Fixed this issue (#6818) in a separate commit. Also found and fixed another watcher-related issue (#6819).

Gerold103 · 2022-01-31T22:10:04Z

src/box/iproto.cc

@@ -1411,10 +1411,16 @@ iproto_connection_on_output(ev_loop *loop, struct ev_io *watcher,
 			ev_io_start(loop, &con->output);


Hm, the test in the first commit passes on my machine even without the fix. Does it fail on your machine? Except when I run it too many times like this python test-run.py $(yes 6818 | head -100), but then it hangs even with the patch. There was a problem with luatest and too many runs in parallel, at least on Mac. Instances are not all killed in the end and it randomly hangs. So the hang probably isn't related to the patch.

On my machine it didn't fail either. I slightly tuned the test - now it fails all the time without the patch and passes with the patch. I couldn't reproduce the hang running the test in parallel - looks like a test-run issue after all.

diff

diff --git a/test/box-luatest/gh_6818_iproto_resumes_input_if_no_output_test.lua b/test/box-luatest/gh_6818_iproto_resumes_input_if_no_output_test.lua index 8bf44e18eb5e..d5675061d1bd 100644 --- a/test/box-luatest/gh_6818_iproto_resumes_input_if_no_output_test.lua +++ b/test/box-luatest/gh_6818_iproto_resumes_input_if_no_output_test.lua @@ -8,7 +8,7 @@ g.before_all = function() g.server = server:new({ alias = 'master', box_cfg = { - net_msg_max = 16, + net_msg_max = 4, } }) g.server:start() @@ -31,13 +31,13 @@ g.test_iproto_resumes_input_if_no_output = function() local conn = net.connect(g.server.net_box_uri) conn:watch('foo', function() end) for _ = 1, loop_count do - conn:eval('return true') + conn:eval([[require('fiber').sleep(0.001)]]) end conn:close() on_done:put(true) end) end for _ = 1, fiber_count do - t.assert(on_done:get()) + t.assert(on_done:get(10)) end end

IPROTO triggers input processing after sending output, provided the output channel isn't clogged, which totally makes sense. There's one nuance here: currently, input is triggered only if something was written to the output channel. Before the IPROTO_WATCH request type was added by commit 4be5de4 ("iproto: add watchers support"), every request had a response so it worked just fine. The new request type is special - the server doesn't reply to it. As a result, if input was blocked upon reaching the box.cfg.net_msg_max limit, it might never get resumed, because input processing might never be triggered. Fix this issue by triggering input processing even if there was no output to flush. No user is affected, because IPROTO_WATCH hasn't been officially released yet so neither changelog nor doc is required. Closes tarantool#6818 NO_CHANGELOG=bug not released NO_DOC=NO_CHANGELOG

This commit fixes the following assertion failure that occurrs on an attempt to register a watcher for a connection to a server that doesn't support watchers (e.g. 2.8): src/box/lua/net_box.c:2283: netbox_transport_dispatch_response: Assertion `transport->inprogress_request_count > 0' failed. The problem is that an IPROTO_WATCH request isn't accounted in inprogress_request_count, because the server isn't supposed to reply to it. However, if the server doesn't support the request type, it will reply to it with an error, breaking the assumption made by the client. We fix the assertion itself by explicitly excluding requests with sync=0 from accounting on the client side (IPROTO_WATCH has sync=0), because such requests aren't supposed to have a matching reply. Also, to avoid errors on the server side, we make the client code not send IPROTO_WATCH and IPROTO_WATCH if the server doesn't set the 'watchers' feature bit in reply to the IPROTO_ID request from the client. No user is affected, because IPROTO_WATCH hasn't been officially released yet so neither changelog nor doc is required. Closes tarantool#6819 NO_CHANGELOG=bug not released NO_DOC=NO_CHANGELOG

coveralls · 2022-02-01T10:13:47Z

Coverage increased (+0.01%) to 84.602% when pulling 588b840 on locker:graceful-shutdown-rework into 5fdc1ea on tarantool:master.

locker · 2022-02-01T12:13:03Z

Found another bug in the watcher infrastructure, because of which vshard and cartridge tests fail: #6824.

It's a system fiber. Killing it might result in crashing Tarantool. Integration tests would fail without this patch after tarantool/tarantool#6813 is merged.

Currently, to resubscribe registered watchers net.box installs an on_connect trigger that writes IPROTO_WATCH requests to the send buffer. The requests will be sent by the net.box state machine once all on_connect triggers return. The problem is this trigger may run before a user-defined trigger that closes the connection, in which case close() will hang forever, because since tarantool#6338, the close method blocks until the send buffer is emptied, while it can't be emptied in case close is called by the state machine itself (see tarantool#5358). Let's fix this issue by making close() omit blocking if called from the state machine fiber. No user is affected, because IPROTO_WATCH hasn't been officially released yet so neither changelog nor doc is required. Closes tarantool#6824 NO_CHANGELOG=bug not released NO_DOC=NO_CHANGELOG

@unera

This patch drops IPROTO_SHUTDOWN and IPROTO_FEATURE_GRACEFUL_SHUTDOWN (the feature hasn't been officially released yet so it's okay). Instead, the server now generates IPROTO_EVENT{key='box.shutdown', value=true} before running on_shutdown triggers. The on_shutdown trigger installed by IPROTO has been greatly simplified - now it just stops listening for new connections. A new on_shutdown trigger is installed by the session infrastructure code - the trigger waits for all sessions that subscribed to 'box.shutdown' event to close. Net.box uses conn:watch to subscribe internally to the new event. As before, it runs user-defined on_shutdown triggers and then waits for active requests to finish before closing the connection. Notes about the tests: 1. We don't need graceful_shutdown_errinj test, because a. The feature can be disabled in net.box without errinj by adding internal _disable_graceful_shutdown connection option. b. Since there's no new packet type, we don't need to check that IPROTO_ID/AUTH handle it correctly in net.box. 2. The graceful_shutdown.test_shutdown_aborted_on_reconnect_2 changed: since triggers use the watcher infrstructure, the same trigger can't run in parallel with itself. The main benefit of reusing IPROTO_EVENT instead of introducing a new request type is that connections will have to do less work to support the feature (to some extent, it'd be supported even if the connector doesn't do any work and just lets the user subscribe to events). Suggested-by @unera Follow-up a33f3cc ("net.box: support graceful shutdown protocol") Follow-up 6f29f9d ("iproto: introduce graceful shutdown protocol") Follow-up tarantool#5924 NO_CHANGELOG=already added NO_DOC=already added, will be updated manually

locker · 2022-02-01T16:31:39Z

Found another bug in the watcher infrastructure, because of which vshard and cartridge tests fail: #6824.

Fixed the bug with a new commit ("net.box: don't block close if called from state machine fiber").

Re cartridge test failures - turns out cartridge kills all fibers except those listed explicitly in the blacklist. The fiber used by the watcher infrastructure ("box.watchable") isn't on the list, which results in a crash. I opened a PR to add it there: tarantool/cartridge#1741

It's a system fiber. Killing it might result in crashing Tarantool. Integration tests would fail without this patch after tarantool/tarantool#6813 is merged.

Gerold103 · 2022-02-01T21:51:05Z

Thanks! All looks good to me.

This was referenced Jan 28, 2022

Document graceful shutdown of net.box connections tarantool/doc#2633

Closed

Document IPROTO graceful shutdown tarantool/doc#2632

Closed

locker force-pushed the graceful-shutdown-rework branch from e152cc6 to b374d46 Compare January 28, 2022 16:00

locker requested a review from Gerold103 January 28, 2022 16:00

locker assigned Gerold103 Jan 28, 2022

Gerold103 reviewed Jan 28, 2022

View reviewed changes

src/box/iproto.cc Show resolved Hide resolved

src/box/lua/net_box.lua Show resolved Hide resolved

src/box/watcher.c Outdated Show resolved Hide resolved

This was referenced Jan 31, 2022

net.box connection that has watcher hangs when net_msg_max limit is reached #6818

Closed

Make sure tests don't linger because of shutdown timeout #6820

Closed

locker force-pushed the graceful-shutdown-rework branch from b374d46 to c99ecbf Compare January 31, 2022 15:59

locker requested a review from Gerold103 January 31, 2022 16:03

locker added the full-ci Enables all tests for a pull request label Jan 31, 2022

Gerold103 approved these changes Jan 31, 2022

View reviewed changes

locker added do not merge and removed full-ci Enables all tests for a pull request labels Feb 1, 2022

locker assigned locker and unassigned Gerold103 Feb 1, 2022

locker added 2 commits February 1, 2022 11:41

locker force-pushed the graceful-shutdown-rework branch from c99ecbf to 90db35d Compare February 1, 2022 08:57

locker added full-ci Enables all tests for a pull request and removed do not merge labels Feb 1, 2022

locker added do not merge and removed full-ci Enables all tests for a pull request labels Feb 1, 2022

locker force-pushed the graceful-shutdown-rework branch from 90db35d to 1d2fef7 Compare February 1, 2022 12:53

locker added the full-ci Enables all tests for a pull request label Feb 1, 2022

locker mentioned this pull request Feb 1, 2022

Don't kill box.watchable system fiber on reload tarantool/cartridge#1741

Merged

locker added 2 commits February 1, 2022 19:11

locker force-pushed the graceful-shutdown-rework branch from 1d2fef7 to 588b840 Compare February 1, 2022 16:28

locker removed the do not merge label Feb 1, 2022

kyukhin merged commit 2e9cbec into tarantool:master Feb 2, 2022

locker deleted the graceful-shutdown-rework branch February 2, 2022 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework graceful shutdown using IPROTO_EVENT #6813

Rework graceful shutdown using IPROTO_EVENT #6813

Uh oh!

locker commented Jan 28, 2022 •

edited

Loading

Uh oh!

locker commented Jan 28, 2022

Uh oh!

locker commented Jan 28, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

locker commented Jan 31, 2022

Uh oh!

Gerold103 Jan 31, 2022

Uh oh!

locker Feb 1, 2022

Uh oh!

coveralls commented Feb 1, 2022 •

edited

Loading

Uh oh!

locker commented Feb 1, 2022

Uh oh!

locker commented Feb 1, 2022

Uh oh!

Gerold103 commented Feb 1, 2022

Uh oh!

Uh oh!

		@@ -1411,10 +1411,16 @@ iproto_connection_on_output(ev_loop loop, struct ev_io watcher,
		ev_io_start(loop, &con->output);

Rework graceful shutdown using IPROTO_EVENT #6813

Rework graceful shutdown using IPROTO_EVENT #6813

Uh oh!

Conversation

locker commented Jan 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

locker commented Jan 28, 2022

Uh oh!

locker commented Jan 28, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

locker commented Jan 31, 2022

Uh oh!

Gerold103 Jan 31, 2022

Choose a reason for hiding this comment

Uh oh!

locker Feb 1, 2022

Choose a reason for hiding this comment

Uh oh!

coveralls commented Feb 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

locker commented Feb 1, 2022

Uh oh!

locker commented Feb 1, 2022

Uh oh!

Gerold103 commented Feb 1, 2022

Uh oh!

Uh oh!

locker commented Jan 28, 2022 •

edited

Loading

coveralls commented Feb 1, 2022 •

edited

Loading