Gracefully recover from containerd TaskNotFound errors #9100

taylorsilva · 2025-03-06T23:24:34Z

Changes proposed by this PR

We found that when the worker is restarted that all containers would be
present, but any tasks running in them would disappear. This would result
in TaskNotFound errors, mostly in resource checks because these containers
are long running and therefore have a higher chance of having their task
killed.

Resource checks are more prone to this type of error because we re-use
their containers over a long period of time. The other types of
containers we make (get, put, task) usually don't hang around that long.

We initially create a Task during container creation and assume that
task would still be there when we go to exec the actual executable that
we wanted to run (e.g. /opt/resource/{in,out,check}). For long running
containers this may not be true but we can gracefully recover in this
scenario.

Notes to reviewer

Can reproduce the error following these steps: #8172 (comment)

More details here: #8172 (comment)

Release Note

Gracefully recover from task retrieval: no running task found errors

We found that when the worker is restarted that all containers would be present, but any tasks running in them would disappear. This would result in TaskNotFound errors, mostly in resource checks. Resource checks are more prone to this type of error because we re-use their containers over a long period of time. The other types of containers we make (get, put, task) usually don't hang around that long. We initially create a Task during container creation and assume that task would still be there when we go to exec the actual executable that we wanted to run (e.g. /opt/resource/{in,out,check}). For long running containers this may not be true but we can gracefully recover in this scenario. Signed-off-by: Taylor Silva <dev@taydev.net>

Signed-off-by: Taylor Silva <dev@taydev.net>

taylorsilva added 2 commits March 6, 2025 18:19

use newer slices.ContainsFunc

9a29e29

Signed-off-by: Taylor Silva <dev@taydev.net>

taylorsilva added the bug label Mar 6, 2025

taylorsilva requested a review from a team as a code owner March 6, 2025 23:24

taylorsilva merged commit f9174a2 into master Mar 7, 2025
11 checks passed

taylorsilva deleted the issue/8172 branch March 7, 2025 01:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Gracefully recover from containerd TaskNotFound errors #9100

Gracefully recover from containerd TaskNotFound errors #9100

Uh oh!

taylorsilva commented Mar 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gracefully recover from containerd TaskNotFound errors #9100

Gracefully recover from containerd TaskNotFound errors #9100

Uh oh!

Conversation

taylorsilva commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes proposed by this PR

Notes to reviewer

Release Note

Uh oh!

Uh oh!

Uh oh!

taylorsilva commented Mar 6, 2025 •

edited

Loading