Skip to content

Conversation

fjahr
Copy link
Contributor

@fjahr fjahr commented Apr 20, 2024

This addresses multiprocess CI failures in p2p_tx_download.py, likely introduced by #29827.

Example failure: https://cirrus-ci.com/task/5622109341220864

I was having a hard time reproducing or rationalizing the root cause of the issue but it seemed very likely the mock time wasn't working as expected without another reset and I got a successful run with it when I temporarily introduced it to another PR I am working on: https://cirrus-ci.com/task/5109555795853312

@DrahtBot
Copy link
Contributor

DrahtBot commented Apr 20, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

@fjahr
Copy link
Contributor Author

fjahr commented Apr 21, 2024

Hm, still seems to be flakey at least, giving it one more shot with longer bump

@fjahr
Copy link
Contributor Author

fjahr commented Apr 21, 2024

#28313 / #28321 may be related

Copy link
Member

@glozow glozow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(cc @theStack, indeed looks like it's from #29827)

Comment on lines -268 to +269
node.bumpmocktime(MAX_GETDATA_INBOUND_WAIT)
node.setmocktime(int(time.time()))
node.bumpmocktime(MAX_GETDATA_INBOUND_WAIT + 300)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to reproduce this issue and am mostly guessing here, but if it's a setmocktime problem, perhaps just do a node.setmocktime(0) to reset it a few lines up (shouldn't impact the test except maybe make it take 2sec longer)?

Copy link
Member

@glozow glozow Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that the +300 would make a difference since the timing for getdata isn't variable like it is for announcements (which was the case in #28321)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had has only used the node.setmocktime(0) in that place and that succeeded when I temporarily pushed it in an unrelated PR but then here it failed when I opened this one. After adding the +300 it succeeded, but yeah, this may have been just luck again. I will push a version without +300 and and the node.setmocktime(0) moved higher up.

I just saw that the Re-Run button also appears for CI jobs that succeeded, I will try to re-rerun this version a few times manually so we can have bit more confidence before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had another failure, so I will add the +300 back and see if that is a permanent fix.

This addresses a timeout error in the multiprocess CI job.
@fjahr fjahr force-pushed the 2024-04-p2p-tx-multiprocess-fail branch from edbfead to fb154e3 Compare April 22, 2024 14:04
@maflcko
Copy link
Member

maflcko commented Apr 22, 2024

Example failure: https://cirrus-ci.com/task/5622109341220864

If I download https://api.cirrus-ci.com/v1/task/5622109341220864/logs/ci.log, I get

2024-04-21T00:10:09.801000Z TestFramework.utils (ERROR): wait_until() failed. Predicate: ''''
        wait_until_helper_internal(lambda: not self.network_event_loop.is_running(), timeout=timeout)
'''
[node 1] Cleaning up leftover process
[node 0] Cleaning up leftover process


�[1mstderr:
�[0mTraceback (most recent call last):
  File "/ci_container_base/ci/scratch/build/bitcoin-i686-pc-linux-gnu/test/functional/p2p_tx_download.py", line 302, in <module>
    TxDownloadTest().main()
  File "/ci_container_base/ci/scratch/build/bitcoin-i686-pc-linux-gnu/test/functional/test_framework/test_framework.py", line 155, in main
    exit_code = self.shutdown()
                ^^^^^^^^^^^^^^^
  File "/ci_container_base/ci/scratch/build/bitcoin-i686-pc-linux-gnu/test/functional/test_framework/test_framework.py", line 314, in shutdown
    self.network_thread.close()
  File "/ci_container_base/ci/scratch/build/bitcoin-i686-pc-linux-gnu/test/functional/test_framework/p2p.py", line 732, in close
    wait_until_helper_internal(lambda: not self.network_event_loop.is_running(), timeout=timeout)
  File "/ci_container_base/ci/scratch/build/bitcoin-i686-pc-linux-gnu/test/functional/test_framework/util.py", line 293, in wait_until_helper_internal
    raise AssertionError("Predicate {} not true after {} seconds".format(predicate_source, timeout))
AssertionError: Predicate ''''
        wait_until_helper_internal(lambda: not self.network_event_loop.is_running(), timeout=timeout)
''' not true after 10.0 seconds

Which indicates that the problem is during shutdown, after the test, not in the test.

So I don't think adding mocktime will fix it. Maybe you were running into a different issue, or are trying to fix a different bug?

@maflcko
Copy link
Member

maflcko commented Apr 22, 2024

See also the current test failure, which remains: https://github.com/bitcoin/bitcoin/pull/29926/checks?check_run_id=24105987877

@fjahr
Copy link
Contributor Author

fjahr commented Apr 22, 2024

Which indicates that the problem is during shutdown, after the test, not in the test.

So I don't think adding mocktime will fix it. Maybe you were running into a different issue, or are trying to fix a different bug?

Hm, I didn't see Stopping nodes being printed so I am not convinced it's during shutdown but maybe I am just not familiar enough with the test framework. I am closing since it seems even with the +30 it's still failing and it seems duplicate #29933 might have a better approach.

@fjahr fjahr closed this Apr 22, 2024
@maflcko
Copy link
Member

maflcko commented Apr 22, 2024

Hm, I didn't see Stopping nodes being printed so I am not convinced it's during shutdown

This will be printed after the network thread is closed. However, the network thread not closing is the underlying bug.

You can see this in the traceback I posted above. (self.network_thread.close() is line 314, and "Stopping nodes" is line 316)

@bitcoin bitcoin locked and limited conversation to collaborators Apr 22, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants