Skip to content

Conversation

jedevc
Copy link
Contributor

@jedevc jedevc commented Jan 23, 2023

🐛 Fixes #7972

In the refactor from #6995, the error handling was substantially reworked, and changed the types of errors returned - this caused issues in retry logic downstream in BuildKit (see docker/build-push-action#761). BuildKit uses this error result from this function to determine whether to retry the push or not.

Notably, in the case of a network error, instead of propagating the error through to return from pushWriter.Write (as previously), it would be propagated through to pushWriter.Commit - however, this is too late, since we've already closed the io.Pipe by the time we would have reached this function. Therefore, we get the generic error message "io: read/write on closed pipe" for every network error - this seems to be the issue in #7972, likely there is some other underlying network error, but it is not shown.

This patch corrects this behavior to ensure that the correct error object is always returned as early as possible:

  1. Track the corresponding io.PipeReader for the pushWriter -- on any error, we CloseWithError the Reader, to ensure that the next Write or Commit to the Writer returns the desired error. This applies for both network errors from request.doWithRetries, as well as for the ErrResets.
  2. Remove the error channel - we don't need this any more, errors are instead clearly tracked through the CloseWithError function.
  3. Always reset the content, even if the offset is already 0. The previous patch could fallthrough out of the switch statement, and attempt to access a nil response object.

I've also slightly refactored the test to work with the new symbols, and to explicitly test with the commit function, since it took me a while to manually verify that I hadn't broken the logic of the test.


Finally, I've verified that this fix works with BuildKit:


I'm not 100% confident in this code, so hopefully some experienced maintainers can weigh in ❤️. I think while the changes are complex and difficult to verify, I think they do need to be cherry-picked across since the previous changes break heavily on the v1.6 branch

@k8s-ci-robot
Copy link

Hi @jedevc. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jedevc jedevc force-pushed the fix-push-error-propagate branch from 9e3e6ad to 197a317 Compare January 23, 2023 15:14
@akhilerm
Copy link
Member

/cc @dmcgowan

Also, can we get an ok-to-test on this PR

@k8s-ci-robot k8s-ci-robot requested a review from dmcgowan January 23, 2023 15:54
@estesp
Copy link
Member

estesp commented Jan 23, 2023

/ok-to-test

@dmcgowan
Copy link
Member

Could this change be simplified by checking the error channel on Write? The errors don't propagate synchronously either way since the request is done in a go routine. Using the error channel, then you don't need to change the side which closes with error since the Write operation can fail before that point. The current code does a setError followed by a close, so it is possible to check for an error on Write when a closed pipe error is hit.

@jedevc
Copy link
Contributor Author

jedevc commented Jan 23, 2023

Could this change be simplified by checking the error channel on Write?

Oh good call 🎉 The logic that prevented this working properly is the behavior that called .Close after .setError - if we don't call .Close on a network error, then we can still propagate correctly.

I think that means we don't even need to call CloseWithError? Will take a closer look.

@dmcgowan
Copy link
Member

if we don't call .Close on a network error, then we can still propagate correctly

I think .Close still needs to be called to prevent the pipe write from hanging after a request error. Since we are wrapping the reader in NopCloser, the writer should close explicitly after request returns error. Think it is more about using the right error, closed pipe is almost always the case where another error more meaningful occurred.

@jedevc
Copy link
Contributor Author

jedevc commented Jan 23, 2023

Wouldn't we then still have a race where the pipe close error could be returned then?

  1. Write checks error, no error found
  2. Network error occurs, sent to error channel (since error channel is non-blocking), pipe closed
  3. Write calls pipe.Write, gets io.PipeClosedError

The only way to avoid this is if we can avoid the channel entirely, and directly CloseWithError the pipe - and I'm not sure what the error channel is doing then.


Edit: sorry I think I misread. After getting a PipeClosedError from writing to the pipe, we should pull from the error channel - then if we have a "better" error from there, return that, otherwise, return the PipeClosedError.

@AkihiroSuda AkihiroSuda added the cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch label Jan 24, 2023
In the refactor from 926b9c7, the error
handling was substantially reworked, and changed the types of errors
returned.

Notably, in the case of a network error, instead of propogating the
error through to return from pushWriter.Write (as previously), it would
be propagated through to pushWriter.Commit - however, this is too late,
since we've already closed the io.Pipe by the time we would have reached
this function. Therefore, we get the generic error message  "io:
read/write on closed pipe" for *every network error*.

This patch corrects this behavior to ensure that the correct error
object is always returned as early as possible, by checking the error
result after writing and detecting a closed pipe.

Additionally, we do some additional hardening - specifically we prevent
falling through when resetting the content or detecting errors, and
update the tests to explicitly check for the ErrReset message.

Signed-off-by: Justin Chadwell <me@jedevc.com>
@jedevc jedevc force-pushed the fix-push-error-propagate branch from 197a317 to 9f6058d Compare January 24, 2023 11:38
@jedevc
Copy link
Contributor Author

jedevc commented Jan 24, 2023

@dmcgowan have reworked the patch as you suggested, it's a lot simpler.

I also tested this with BuildKit again, and yup, this does appear to fix our downstream issue still.

Copy link
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dmcgowan dmcgowan merged commit c873647 into containerd:main Jan 24, 2023
@jedevc
Copy link
Contributor Author

jedevc commented Jan 24, 2023

Awesome 🎉 should I open a separate cherry-pick PR?

@shawaj
Copy link

shawaj commented Jan 24, 2023

Thanks to you all for such quick resolution 🥳

@thaJeztah thaJeztah added cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch and removed cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch labels Jan 24, 2023
@thaJeztah thaJeztah added the cherry-picked/1.5.x PR commits are cherry-picked into release/1.5 branch label Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/1.5.x PR commits are cherry-picked into release/1.5 branch cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch ok-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"failed to copy: io: read/write on closed pipe" on ctr images push while pushing large images
8 participants