Skip to content

Conversation

sipa
Copy link
Member

@sipa sipa commented Dec 16, 2021

test_runner.py currently only checks every 0.5s whether any job has finished, and if so, starts at most one new job. At higher parallellism it becomes increasingly likely that multiple jobs have finished at the same time. Fix this by always noticing all finished jobs every timeslot, and starting as many new ones.

@sipa
Copy link
Member Author

sipa commented Dec 16, 2021

At -j32 it speeds things up by a few seconds wall clock time for me. Though the runtime is mostly dominated by a few very long-running jobs. Splitting those up could significantly increase the gain.

@maflcko
Copy link
Member

maflcko commented Dec 16, 2021

"Duplicate" of #13384?

@sipa
Copy link
Member Author

sipa commented Dec 16, 2021

Eh, not really - this still polls every 0.5s. It's just not limited to detecting at most one completed task per 0.5s.

@DrahtBot DrahtBot added the Tests label Dec 16, 2021
@laanwj
Copy link
Member

laanwj commented Dec 17, 2021

The approach in #13384 would be the most efficient, to start a new test as soon as another stops. But this is an improvement. Concept ACK.

@sipa
Copy link
Member Author

sipa commented Dec 17, 2021

I'm perfectly fine with the approach in #13384 as well. This PR is just an obvious incremental improvement to the current code with no impact apart from making things a bit faster, so I'd expect it to be perhaps less controversial.

Comment on lines +532 to +536
i = 0
while i < test_count:
for test_result, testdir, stdout, stderr in job_queue.get_next():
test_results.append(test_result)
i += 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at this code in which for i in range(test_count) is replaced by while i< test_count):

Read the answers here https://stackoverflow.com/questions/869229/why-is-looping-over-range-in-python-faster-than-using-a-while-loop and I am not sure why this would be faster.

Ignore the review if it does not make sense or I am missing something important.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what makes it faster.

The speedup is due to get_next() now returning all finished jobs, instead of one finished job (and then sleeping 0.5s between calls). The change here is just to deal with the fact that the returned value is now a list of jobs instead of a single one.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 975097f

@laanwj
Copy link
Member

laanwj commented Dec 17, 2021

perhaps less controversial.

Sure. Though I don't think it was controversial. I definitely didn't mean my comment like that at the time, it was far from a NACK. I just had a question about the dot-printing, and not being able to exit with Ctrl-C was lightly annoying. I'm sure those could be solved.

@maflcko
Copy link
Member

maflcko commented Dec 18, 2021

Yeah, I am wondering if the CTRL+C needs to be captured and translated into a kill?

@laanwj
Copy link
Member

laanwj commented Jan 5, 2022

Code review and lightly tested ACK 975097f

Yeah, I am wondering if the CTRL+C needs to be captured and translated into a kill?

Maybe. Usually it's a result of accidentally ignoring KeyboardInterrupt exceptions (for example in threads). But haven't checked it.
I'm going ahead and merge this for now. I guess someone interested in the other PR can pick it up.

@laanwj laanwj changed the title Let test_runner.py start multiple jobs per timeslot test: Let test_runner.py start multiple jobs per timeslot Jan 5, 2022
@laanwj laanwj merged commit 121d47a into bitcoin:master Jan 5, 2022
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Jan 6, 2022
… timeslot

975097f Let test_runner.py start multiple jobs per timeslot (Pieter Wuille)

Pull request description:

  test_runner.py currently only checks every 0.5s whether any job has finished, and if so, starts at most one new job. At higher parallellism it becomes increasingly likely that multiple jobs have finished at the same time. Fix this by always noticing *all* finished jobs every timeslot, and starting as many new ones.

ACKs for top commit:
  laanwj:
    Code review and lightly tested ACK 975097f
  prayank23:
    ACK bitcoin@975097f

Tree-SHA512: b70c51f05efcde9bc25475c192b86e86b4c399495b42dee20576af3e6b99e8298be8b9e82146abdabbaedb24a13ee158a7c8947518b16fc4f33a3b434935b550
break
i = 0
while i < test_count:
for test_result, testdir, stdout, stderr in job_queue.get_next():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...............
               
Remaining jobs: [wallet_import_rescan.py --legacy-wallet, p2p_node_network_limited.py]
..............................................................................................................................................................................................................................
                                                                                                                                                                                                                              
Remaining jobs: [wallet_import_rescan.py --legacy-wallet]
......................................................................................................................................................................................
                                                                                                                                                                                      
----------------------------------------------------------------------
Ran 10 tests in 0.753s
OK
Traceback (most recent call last):
  File "test/functional/test_runner.py", line 816, in <module>
    main()
  File "test/functional/test_runner.py", line 460, in main
    run_tests(
  File "test/functional/test_runner.py", line 535, in run_tests
    for test_result, testdir, stdout, stderr in job_queue.get_next():
  File "test/functional/test_runner.py", line 647, in get_next
    raise IndexError('pop from empty list')
IndexError: pop from empty list

https://cirrus-ci.com/task/6633293054476288?logs=ci#L6037

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't understand how this is possible. It requires get_next() to be called when self.jobs and self.test_list are empty, while i < test_count in run_tests(). The latter should imply that self.test_list is not empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #23995

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #23995 (comment) for a possible explanation.

Copy link

@0967622354toon 0967622354toon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate of #

@0967622354toon
Copy link

H

maflcko pushed a commit that referenced this pull request Feb 7, 2022
a036358 test: Repair failfast option for test runner (Martin Zumsande)

Pull request description:

  Fixes #23990

  After #23799, the `--failfast` option in the test runner for the functional tests stopped working, because a second outer loop was introduced, which would have needed a `break` too for the test runner to fail immediately. This also led to the errors reported in #23990.

  This provides a straightforward fix for that.
  There is also #23995 which is a larger refactor, but that hasn't been updated in a while to fix the failfast issue.

ACKs for top commit:
  pg156:
    Tested ACK a036358. I agree adding the `all_passed` flag to break out of the outer loop when needed makes sense. The "failfast" option works after this change.

Tree-SHA512: 3e2f775e36c13d180d32a05cd1cfe0883274e8615cdbbd4e069a9899e9b9ea1091066cf085e93f1c5326bd8ecc6ff524e0dad7c638f60dfdb169fefcdb26ee52
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Feb 7, 2022
…nner

a036358 test: Repair failfast option for test runner (Martin Zumsande)

Pull request description:

  Fixes bitcoin#23990

  After bitcoin#23799, the `--failfast` option in the test runner for the functional tests stopped working, because a second outer loop was introduced, which would have needed a `break` too for the test runner to fail immediately. This also led to the errors reported in bitcoin#23990.

  This provides a straightforward fix for that.
  There is also bitcoin#23995 which is a larger refactor, but that hasn't been updated in a while to fix the failfast issue.

ACKs for top commit:
  pg156:
    Tested ACK a036358. I agree adding the `all_passed` flag to break out of the outer loop when needed makes sense. The "failfast" option works after this change.

Tree-SHA512: 3e2f775e36c13d180d32a05cd1cfe0883274e8615cdbbd4e069a9899e9b9ea1091066cf085e93f1c5326bd8ecc6ff524e0dad7c638f60dfdb169fefcdb26ee52
@bitcoin bitcoin locked and limited conversation to collaborators Jan 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants