Fix various timing issues with docker pusher #8379

jedevc · 2023-04-13T10:19:53Z

While investigating docker/buildx#1726, I found a couple of timing issues when using the docker pusher.

I discovered these issues in a test environment, hopefully should be able to duplicate from: https://gist.github.com/jedevc/ffb16d1319618d116df89e327d3bd589.

Copying the information from the commit messages (see them individually for more details):

copy: remove wrapping io.NopCloser from push writer pipe (fixes a deadlock)
copy: check if writer was closed before setting a pipe (fixes a panic)
copy: improve error detection from closed pipes (fixes edge cases where pushWriter: correctly propagate errors #7985 doesn't quite work, because reading from the error channel is racy)
copy: remove max number of ErrResets (should fix imagetools create should propagate the original error docker/buildx#1726, where the original error has been hidden from the user - though I can't easily confirm that this one is fixed, it shouldn't cause any issues to remove it)
copy: setError should imply Close (fixes more edge cases where pushWriter: correctly propagate errors #7985 doesn't work, since sending and receiving across multiple channels is racy, see https://gist.github.com/jedevc/876071a1308655ba246278a01982ad93)

Not quite sure about trying to add tests for these? They're very dependent on exact timings, since the patches are about correcting racy behavior. I have confirmed using my test environment shared above (repeating commands hundreds of times in a row), that we don't see the deadlocks, panics, and incorrect error messages.

Also happy to split these up into separate PRs, if some of them are easier reviews than others.

k8s-ci-robot · 2023-04-13T10:20:03Z

Hi @jedevc. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

akhilerm · 2023-04-13T10:46:03Z

/cc

k8s-ci-robot · 2023-04-13T10:46:05Z

@akhilerm: GitHub didn't allow me to request PR reviews from the following users: akhilerm.

Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dmcgowan · 2023-04-16T07:06:09Z

remotes/docker/pusher.go

-		case err2 := <-pw.errC:
-			err = err2
-		default:
+		case <-pw.done:


Let me make sure I understand the 3 cases correctly...

The request has failed, we need to attempt a reset, so we can expect a
new pipe incoming on pipeC.

A redirect or something that retries where the http library requests a new body

The request has failed, we don't need to attempt a reset, so we can
expect an incoming error on errC.

Either the request fails or an unexpected status code is returned

Something else externally has called Close, so we can expect the done
channel to be closed.

straightforward, but don't see any case where this happens

There is a condition where the registry (unexpectedly, not to spec) returns 201 or 204 on the put before the body is fully written. I would expect that the http library would issue close and could fall into a deadlock here. We could just read respC and call setResponse. In that case ErrClosedPipe would get returned and Commit shouldn't be called anyway.

Sorry for the delay in getting back to this! ❤️

Let me make sure I understand the 3 cases correctly...

Yup, that's how I understand it as well.

There is a condition where the registry (unexpectedly, not to spec) returns 201 or 204 on the put before the body is fully written. I would expect that the http library would issue close and could fall into a deadlock here.

Yeah, I think this is also possible. I've updated to do this as well.

@dmcgowan could you take another look? 🙏

Sorry, I know this one is fairly hellish and concurrent.

jedevc · 2023-08-04T09:58:59Z

@dmcgowan sorry to bother, could you take another look?

If there's concerns about changing the complexity of the code here, I'm happy to try and put together some test to run with the race checker, if that helps move things forward.

dmcgowan · 2023-08-07T21:48:28Z

Put on my list to review, I'm just coming back from a break and this one is not quick review due to the complexity. Thanks for updating!

dmcgowan · 2024-02-02T20:31:48Z

@jedevc added to 2.0 milestone, current change LGTM if you want to rebase

io.Pipe produces a PipeReader and a PipeWriter - a close on the write side, causes an error on both the read and write sides, while a close on the read side causes an error on only the read side. Previously, we explicitly prohibited closing from the read side. However, http.Request.Body requires that "calling Close should unblock a Read waiting for input". Our reader will not do this - calling close becomes a no-op. This can cause a deadlock because client.Do may never terminate in some circumstances. We need the Reader side to close its side of the pipe as well, which it already does using the go standard library - otherwise, we can hang forever, writing to a pipe that will never be closed. Allowing the requester to close the body should be safe - we never reuse the same reader between requests, as the result of body() will never be reused by the guarantees of the standard library. Signed-off-by: Justin Chadwell <me@jedevc.com>

If Close is called externally before a request is attempted, then we will accidentally attempt to send to a closed channel, causing a panic. To avoid this, we can check to see if Close has been called, using a done channel. If this channel is ever done, we drop any incoming errors, requests or pipes - we don't need them, since we're done. Signed-off-by: Justin Chadwell <me@jedevc.com>

If we get io.ErrClosedPipe in pushWriter.Write, there are three possible scenarios: - The request has failed, we need to attempt a reset, so we can expect a new pipe incoming on pipeC. - The request has failed, we don't need to attempt a reset, so we can expect an incoming error on errC. - Something else externally has called Close, so we can expect the done channel to be closed. This patch ensures that we block for as long as possible (while still handling each of the above cases, so we avoid hanging), to make sure that we properly return an appropriate error message each time. Signed-off-by: Justin Chadwell <me@jedevc.com>

Signed-off-by: Justin Chadwell <me@jedevc.com>

If a writer continually asks to be reset then it should always succeed - it should be the responsibility of the underlying content.Writer to stop producing ErrReset after some amount of time and to instead return the underlying issue - which pushWriter already does today, using the doWithRetries function. doWithRetries already has a separate cap for retries of 6 requests (5 retries after the original failure), and it seems like this would be previously overridden by content.Copy's max number of 5 attempts, hiding the original error. Signed-off-by: Justin Chadwell <me@jedevc.com>

If sending two messages from goroutine X: a <- 1 b <- 2 And receiving them in goroutine Y: select { case <- a: case <- b: } Either branch of the select can trigger first - so when we call .setError and .Close next to each other, we don't know whether the done channel will close first or the error channel will receive first - so sometimes, we get an incorrect error message. We resolve this by not sending both signals - instead, we can have .setError *imply* .Close, by having the pushWriter call .Close on itself, after receiving an error. Signed-off-by: Justin Chadwell <me@jedevc.com>

We also need an additional check to avoid setting both the error and response which can create a race where they can arrive in the receiving thread in either order. If we hit an error, we don't need to send the response. > There is a condition where the registry (unexpectedly, not to spec) > returns 201 or 204 on the put before the body is fully written. I would > expect that the http library would issue close and could fall into a > deadlock here. We could just read respC and call setResponse. In that > case ErrClosedPipe would get returned and Commit shouldn't be called > anyway. Signed-off-by: Justin Chadwell <me@jedevc.com>

jedevc · 2024-02-03T09:35:55Z

Rebased 🎉 Thanks @dmcgowan! Sorry this one was such a pain.

k8s-ci-robot added the needs-ok-to-test label Apr 13, 2023

jedevc force-pushed the docker-pusher-concurrency branch from b52596f to d9df7c3 Compare April 13, 2023 12:32

jedevc marked this pull request as draft April 13, 2023 15:02

jedevc force-pushed the docker-pusher-concurrency branch from d9df7c3 to e491f1f Compare April 13, 2023 15:03

jedevc marked this pull request as ready for review April 13, 2023 16:21

dmcgowan reviewed Apr 16, 2023

View reviewed changes

jedevc requested a review from dmcgowan May 19, 2023 07:06

jedevc mentioned this pull request Jun 6, 2023

buildx failed with: ERROR: failed to solve: failed to push ghcr.io/finchsec/kali:latest: failed to copy: io: read/write on closed pipe docker/build-push-action#761

Closed

jedevc mentioned this pull request Jun 20, 2023

Random "SIGSEGV: segmentation violation code" when using "docker buildx imagetools create" docker/buildx#1768

Open

3 tasks

akhilerm mentioned this pull request Jun 26, 2023

Simplify content.Copy logic #8734

Merged

jedevc mentioned this pull request Oct 23, 2023

panic in buildx imagetools create docker/buildx#2091

Open

3 tasks

k8s-ci-robot added the needs-rebase label Nov 29, 2023

jedevc mentioned this pull request Jan 31, 2024

fix: avoid modifying source during resolver.Copy docker/buildx#2230

Merged

dmcgowan added this to the 2.0 milestone Feb 2, 2024

dmcgowan added the cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch label Feb 2, 2024

jedevc added 7 commits February 3, 2024 10:18

pushWriter: refactor reset pipe logic into separate function

651cfa2

Signed-off-by: Justin Chadwell <me@jedevc.com>

jedevc force-pushed the docker-pusher-concurrency branch from 7a1b8c8 to a9152eb Compare February 3, 2024 09:27

k8s-ci-robot removed the needs-rebase label Feb 3, 2024

k8s-ci-robot added the size/L label Feb 3, 2024

estesp approved these changes Feb 3, 2024

View reviewed changes

crazy-max approved these changes Feb 3, 2024

View reviewed changes

dmcgowan approved these changes Feb 21, 2024

View reviewed changes

dmcgowan added this pull request to the merge queue Feb 21, 2024

Merged via the queue into containerd:main with commit b8654e3 Feb 21, 2024

jedevc deleted the docker-pusher-concurrency branch February 22, 2024 11:22

crazy-max mentioned this pull request Mar 1, 2024

"imagetools create" panics when pushing the created image (episode 2) docker/buildx#2232

Closed

3 tasks

jedevc mentioned this pull request Mar 4, 2024

[release/1.7] Fix various timing issues with docker pusher #9921

Merged

austinvazquez added cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch and removed cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix various timing issues with docker pusher #8379

Fix various timing issues with docker pusher #8379

Uh oh!

jedevc commented Apr 13, 2023 •

edited

Loading

Uh oh!

k8s-ci-robot commented Apr 13, 2023

Uh oh!

akhilerm commented Apr 13, 2023

Uh oh!

k8s-ci-robot commented Apr 13, 2023

Uh oh!

dmcgowan Apr 16, 2023

Uh oh!

jedevc May 11, 2023

Uh oh!

jedevc Jun 20, 2023

Uh oh!

jedevc commented Aug 4, 2023

Uh oh!

dmcgowan commented Aug 7, 2023

Uh oh!

dmcgowan commented Feb 2, 2024

Uh oh!

jedevc commented Feb 3, 2024

Uh oh!

Uh oh!

Fix various timing issues with docker pusher #8379

Fix various timing issues with docker pusher #8379

Uh oh!

Conversation

jedevc commented Apr 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 13, 2023

Uh oh!

akhilerm commented Apr 13, 2023

Uh oh!

k8s-ci-robot commented Apr 13, 2023

Uh oh!

dmcgowan Apr 16, 2023

Choose a reason for hiding this comment

Uh oh!

jedevc May 11, 2023

Choose a reason for hiding this comment

Uh oh!

jedevc Jun 20, 2023

Choose a reason for hiding this comment

Uh oh!

jedevc commented Aug 4, 2023

Uh oh!

dmcgowan commented Aug 7, 2023

Uh oh!

dmcgowan commented Feb 2, 2024

Uh oh!

jedevc commented Feb 3, 2024

Uh oh!

Uh oh!

jedevc commented Apr 13, 2023 •

edited

Loading