Skip to content

Conversation

taylorsilva
Copy link
Member

@taylorsilva taylorsilva commented Mar 6, 2025

Changes proposed by this PR

closes #8172

We found that when the worker is restarted that all containers would be
present, but any tasks running in them would disappear. This would result
in TaskNotFound errors, mostly in resource checks because these containers
are long running and therefore have a higher chance of having their task
killed.

Resource checks are more prone to this type of error because we re-use
their containers over a long period of time. The other types of
containers we make (get, put, task) usually don't hang around that long.

We initially create a Task during container creation and assume that
task would still be there when we go to exec the actual executable that
we wanted to run (e.g. /opt/resource/{in,out,check}). For long running
containers this may not be true but we can gracefully recover in this
scenario.

Notes to reviewer

Can reproduce the error following these steps: #8172 (comment)

More details here: #8172 (comment)

Release Note

  • Gracefully recover from task retrieval: no running task found errors

We found that when the worker is restarted that all containers would be
present, but any tasks running in them would disappear. This would result
in TaskNotFound errors, mostly in resource checks.

Resource checks are more prone to this type of error because we re-use
their containers over a long period of time. The other types of
containers we make (get, put, task) usually don't hang around that long.

We initially create a Task during container creation and assume that
task would still be there when we go to exec the actual executable that
we wanted to run (e.g. /opt/resource/{in,out,check}). For long running
containers this may not be true but we can gracefully recover in this
scenario.

Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
@taylorsilva taylorsilva added the bug label Mar 6, 2025
@taylorsilva taylorsilva requested a review from a team as a code owner March 6, 2025 23:24
@taylorsilva taylorsilva merged commit f9174a2 into master Mar 7, 2025
11 checks passed
@taylorsilva taylorsilva deleted the issue/8172 branch March 7, 2025 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"no running task found" issue after upgrade to 7.7.0
1 participant