Gracefully recover from containerd TaskNotFound errors #9100
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes proposed by this PR
closes #8172
We found that when the worker is restarted that all containers would be
present, but any tasks running in them would disappear. This would result
in TaskNotFound errors, mostly in resource checks because these containers
are long running and therefore have a higher chance of having their task
killed.
Resource checks are more prone to this type of error because we re-use
their containers over a long period of time. The other types of
containers we make (get, put, task) usually don't hang around that long.
We initially create a Task during container creation and assume that
task would still be there when we go to exec the actual executable that
we wanted to run (e.g. /opt/resource/{in,out,check}). For long running
containers this may not be true but we can gracefully recover in this
scenario.
Notes to reviewer
Can reproduce the error following these steps: #8172 (comment)
More details here: #8172 (comment)
Release Note
task retrieval: no running task found
errors