Skip to content

Conversation

locker
Copy link
Member

@locker locker commented Jan 28, 2022

This patch drops IPROTO_SHUTDOWN and IPROTO_FEATURE_GRACEFUL_SHUTDOWN (the feature hasn't been officially released yet so it's okay). Instead, the server now generates IPROTO_EVENT{key='box.shutdown', value=true} before running on_shutdown triggers. The on_shutdown trigger installed by IPROTO has been greatly simplified - now it just stops listening for new connections. A new on_shutdown trigger is installed by the session infrastructure code - the trigger waits for all sessions that subscribed to 'box.shutdown' event to close. Net.box uses conn:watch to subscribe
internally to the new event. As before, it runs user-defined on_shutdown triggers and then waits for active requests to finish before closing the connection.

Notes about the tests:

  1. We don't need graceful_shutdown_errinj test, because
    a. The feature can be disabled in net.box without errinj by adding internal _disable_graceful_shutdown connection option.
    b. Since there's no new packet type, we don't need to check that IPROTO_ID and IPROTO_AUTH handle it correctly in net.box.
  2. The graceful_shutdown.test_shutdown_aborted_on_reconnect_2 changed: since triggers use the watcher infrstructure, the same trigger can't run in parallel with itself.

The main benefit of reusing IPROTO_EVENT instead of introducing a new request type is that connections will have to do less work to support the feature (to some extent, it'd be supported even if the connector doesn't do any work and just lets the user subscribe to events).

Suggested-by @unera
Follow-up #6783
Really closes #5924

Also found and fixed a couple of watcher-related issues in the scope of this PR:

Closes #6818
Closes #6819
Closes #6824

If we agree to merge this change, I'll manually update documentation tickets: tarantool/doc#2632, tarantool/doc#2633.

@locker
Copy link
Member Author

locker commented Jan 28, 2022

box/iproto_stress.test.lua fails with this patch. I'm currently investigating why. It looks an issue of conn.watch.

@locker
Copy link
Member Author

locker commented Jan 28, 2022

@Gerold103 This patch completely reworks the implementation of the graceful shutdown (the procedure stays the same, but the code is moved around and IPROTO_SHUTDOWN is dropped in favor of 'box.shutdown' event). Please take a look. Do you think it's a step in the right direction?

@locker
Copy link
Member Author

locker commented Jan 31, 2022

box/iproto_stress.test.lua fails with this patch. I'm currently investigating why. It looks an issue of conn.watch.

Fixed this issue (#6818) in a separate commit. Also found and fixed another watcher-related issue (#6819).

@locker locker requested a review from Gerold103 January 31, 2022 16:03
@locker locker added the full-ci Enables all tests for a pull request label Jan 31, 2022
@@ -1411,10 +1411,16 @@ iproto_connection_on_output(ev_loop *loop, struct ev_io *watcher,
ev_io_start(loop, &con->output);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, the test in the first commit passes on my machine even without the fix. Does it fail on your machine? Except when I run it too many times like this python test-run.py $(yes 6818 | head -100), but then it hangs even with the patch. There was a problem with luatest and too many runs in parallel, at least on Mac. Instances are not all killed in the end and it randomly hangs. So the hang probably isn't related to the patch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine it didn't fail either. I slightly tuned the test - now it fails all the time without the patch and passes with the patch. I couldn't reproduce the hang running the test in parallel - looks like a test-run issue after all.

diff
diff --git a/test/box-luatest/gh_6818_iproto_resumes_input_if_no_output_test.lua b/test/box-luatest/gh_6818_iproto_resumes_input_if_no_output_test.lua
index 8bf44e18eb5e..d5675061d1bd 100644
--- a/test/box-luatest/gh_6818_iproto_resumes_input_if_no_output_test.lua
+++ b/test/box-luatest/gh_6818_iproto_resumes_input_if_no_output_test.lua
@@ -8,7 +8,7 @@ g.before_all = function()
     g.server = server:new({
         alias = 'master',
         box_cfg = {
-            net_msg_max = 16,
+            net_msg_max = 4,
         }
     })
     g.server:start()
@@ -31,13 +31,13 @@ g.test_iproto_resumes_input_if_no_output = function()
             local conn = net.connect(g.server.net_box_uri)
             conn:watch('foo', function() end)
             for _ = 1, loop_count do
-                conn:eval('return true')
+                conn:eval([[require('fiber').sleep(0.001)]])
             end
             conn:close()
             on_done:put(true)
         end)
     end
     for _ = 1, fiber_count do
-        t.assert(on_done:get())
+        t.assert(on_done:get(10))
     end
 end

@locker locker added do not merge and removed full-ci Enables all tests for a pull request labels Feb 1, 2022
@locker locker assigned locker and unassigned Gerold103 Feb 1, 2022
IPROTO triggers input processing after sending output, provided the
output channel isn't clogged, which totally makes sense. There's one
nuance here: currently, input is triggered only if something was written
to the output channel. Before the IPROTO_WATCH request type was added by
commit 4be5de4 ("iproto: add watchers
support"), every request had a response so it worked just fine. The new
request type is special - the server doesn't reply to it. As a result,
if input was blocked upon reaching the box.cfg.net_msg_max limit, it
might never get resumed, because input processing might never be
triggered. Fix this issue by triggering input processing even if there
was no output to flush.

No user is affected, because IPROTO_WATCH hasn't been officially
released yet so neither changelog nor doc is required.

Closes tarantool#6818

NO_CHANGELOG=bug not released
NO_DOC=NO_CHANGELOG
This commit fixes the following assertion failure that occurrs on an
attempt to register a watcher for a connection to a server that doesn't
support watchers (e.g. 2.8):

  src/box/lua/net_box.c:2283: netbox_transport_dispatch_response:
    Assertion `transport->inprogress_request_count > 0' failed.

The problem is that an IPROTO_WATCH request isn't accounted in
inprogress_request_count, because the server isn't supposed to reply
to it. However, if the server doesn't support the request type, it
will reply to it with an error, breaking the assumption made by the
client.

We fix the assertion itself by explicitly excluding requests with sync=0
from accounting on the client side (IPROTO_WATCH has sync=0), because
such requests aren't supposed to have a matching reply. Also, to avoid
errors on the server side, we make the client code not send IPROTO_WATCH
and IPROTO_WATCH if the server doesn't set the 'watchers' feature bit in
reply to the IPROTO_ID request from the client.

No user is affected, because IPROTO_WATCH hasn't been officially
released yet so neither changelog nor doc is required.

Closes tarantool#6819

NO_CHANGELOG=bug not released
NO_DOC=NO_CHANGELOG
@locker locker force-pushed the graceful-shutdown-rework branch from c99ecbf to 90db35d Compare February 1, 2022 08:57
@locker locker added full-ci Enables all tests for a pull request and removed do not merge labels Feb 1, 2022
@coveralls
Copy link

coveralls commented Feb 1, 2022

Coverage Status

Coverage increased (+0.01%) to 84.602% when pulling 588b840 on locker:graceful-shutdown-rework into 5fdc1ea on tarantool:master.

@locker locker added do not merge and removed full-ci Enables all tests for a pull request labels Feb 1, 2022
@locker
Copy link
Member Author

locker commented Feb 1, 2022

Found another bug in the watcher infrastructure, because of which vshard and cartridge tests fail: #6824.

@locker locker force-pushed the graceful-shutdown-rework branch from 90db35d to 1d2fef7 Compare February 1, 2022 12:53
@locker locker added the full-ci Enables all tests for a pull request label Feb 1, 2022
locker added a commit to locker/tarantool-cartridge that referenced this pull request Feb 1, 2022
It's a system fiber. Killing it might result in crashing Tarantool.

Integration tests would fail without this patch after
tarantool/tarantool#6813 is merged.
Currently, to resubscribe registered watchers net.box installs an
on_connect trigger that writes IPROTO_WATCH requests to the send buffer.
The requests will be sent by the net.box state machine once all
on_connect triggers return. The problem is this trigger may run before a
user-defined trigger that closes the connection, in which case close()
will hang forever, because since tarantool#6338, the close method blocks until
the send buffer is emptied, while it can't be emptied in case close is
called by the state machine itself (see tarantool#5358).

Let's fix this issue by making close() omit blocking if called from
the state machine fiber.

No user is affected, because IPROTO_WATCH hasn't been officially
released yet so neither changelog nor doc is required.

Closes tarantool#6824

NO_CHANGELOG=bug not released
NO_DOC=NO_CHANGELOG
This patch drops IPROTO_SHUTDOWN and IPROTO_FEATURE_GRACEFUL_SHUTDOWN
(the feature hasn't been officially released yet so it's okay). Instead,
the server now generates IPROTO_EVENT{key='box.shutdown', value=true}
before running on_shutdown triggers. The on_shutdown trigger installed
by IPROTO has been greatly simplified - now it just stops listening for
new connections. A new on_shutdown trigger is installed by the session
infrastructure code - the trigger waits for all sessions that subscribed
to 'box.shutdown' event to close. Net.box uses conn:watch to subscribe
internally to the new event. As before, it runs user-defined on_shutdown
triggers and then waits for active requests to finish before closing the
connection.

Notes about the tests:
 1. We don't need graceful_shutdown_errinj test, because
    a. The feature can be disabled in net.box without errinj by adding
       internal _disable_graceful_shutdown connection option.
    b. Since there's no new packet type, we don't need to check that
       IPROTO_ID/AUTH handle it correctly in net.box.
 2. The graceful_shutdown.test_shutdown_aborted_on_reconnect_2 changed:
    since triggers use the watcher infrstructure, the same trigger can't
    run in parallel with itself.

The main benefit of reusing IPROTO_EVENT instead of introducing a new
request type is that connections will have to do less work to support
the feature (to some extent, it'd be supported even if the connector
doesn't do any work and just lets the user subscribe to events).

Suggested-by @unera
Follow-up a33f3cc
("net.box: support graceful shutdown protocol")
Follow-up 6f29f9d
("iproto: introduce graceful shutdown protocol")

Follow-up tarantool#5924

NO_CHANGELOG=already added
NO_DOC=already added, will be updated manually
@locker locker force-pushed the graceful-shutdown-rework branch from 1d2fef7 to 588b840 Compare February 1, 2022 16:28
@locker
Copy link
Member Author

locker commented Feb 1, 2022

Found another bug in the watcher infrastructure, because of which vshard and cartridge tests fail: #6824.

Fixed the bug with a new commit ("net.box: don't block close if called from state machine fiber").

Re cartridge test failures - turns out cartridge kills all fibers except those listed explicitly in the blacklist. The fiber used by the watcher infrastructure ("box.watchable") isn't on the list, which results in a crash. I opened a PR to add it there: tarantool/cartridge#1741

filonenko-mikhail pushed a commit to tarantool/cartridge that referenced this pull request Feb 1, 2022
It's a system fiber. Killing it might result in crashing Tarantool.

Integration tests would fail without this patch after
tarantool/tarantool#6813 is merged.
@Gerold103
Copy link
Collaborator

Thanks! All looks good to me.

@kyukhin kyukhin merged commit 2e9cbec into tarantool:master Feb 2, 2022
@locker locker deleted the graceful-shutdown-rework branch February 2, 2022 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full-ci Enables all tests for a pull request
Projects
None yet
4 participants