Skip to content

Conversation

taylorsilva
Copy link
Member

@taylorsilva taylorsilva commented Jul 11, 2025

Changes proposed by this PR

closes #9189

If a web node had to re-attach to an already running container, either because the web node restarted or had network connection issues with the worker, after re-attaching, the web node would miss a majority of the logs that the container emits. Random log fragments may be picked up, but most logs would be lost.

This PR fixes this bug by ensuring we cleanup the previous IO Readers that were attached to the containerd Task's FIFO files.

Notes to reviewer

To reproduce/verify it works:

  1. docker compose up -d --build
  2. Login and set the following pipeline:
jobs:
- name: hello-counter-job
  plan:
  - config:
      image_resource:
        name: ""
        source:
          repository: cgr.dev/chainguard/bash
          tag: latest
        type: registry-image
      platform: linux
      run:
        args:
        - -ec
        - |-
          for i in {1..600};  do
            echo "Hello $i"
            sleep 1
          done  | tee /tmp/mylog.txt
        path: /bin/bash
    privileged: true
    task: hello-counter-task
  1. Unpause the pipeline and trigger the job. Wait for a few of the "Hello..." messages to appear
  2. docker compose restart web
  3. Check the job logs. Web should re-connect and continue to stream logs, with no logs being dropped once re-connected. The only logs that are missing now are the one's emitted while the web node was not connected to the container.
image

Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
@taylorsilva taylorsilva requested a review from a team as a code owner July 11, 2025 19:54
@taylorsilva taylorsilva added this to the v7.14.0 milestone Jul 11, 2025
@taylorsilva
Copy link
Member Author

@Kump3r @IvanChalukov I'll wait for you guys to review and test before merging.

@taylorsilva taylorsilva moved this to In Progress in Pull Requests Jul 11, 2025
@taylorsilva
Copy link
Member Author

taylorsilva commented Jul 12, 2025

Got thinking a bit more last night, wondering if there's a way to do this without having to track the IO's ourselves. Tried doing this:

diff --git a/worker/runtime/container.go b/worker/runtime/container.go
index 0c533811a..6e090f183 100644
--- a/worker/runtime/container.go
+++ b/worker/runtime/container.go
@@ -186,9 +186,18 @@ func (c *Container) Attach(pid string, processIO garden.ProcessIO) (process gard
                return nil, fmt.Errorf("task attach: %w", err)
        }

+       prevIO := task.IO()
        cioOpts := containerdCIO(processIO, false)
+       var attach cio.Attach = func(f *cio.FIFOSet) (cio.IO, error) {
+               newIO, err := cio.NewAttach(cioOpts...)(f)
+               if prevIO != nil {
+                       prevIO.Cancel()
+                       prevIO.Close()
+               }
+               return newIO, err
+       }

-       proc, err := task.LoadProcess(ctx, pid, cio.NewAttach(cioOpts...))
+       proc, err := task.LoadProcess(ctx, pid, attach)
        if err != nil {
                return nil, fmt.Errorf("load proc: %w", err)
        }

Sadly, the cio.IO that the task returns at this point in the code is a new cio.IO that just points to the FIFO files. It doesn't contain the previous cio.IO that has the files open. Really seems like containerd doesn't give us a way to lookup the previous/existing cio.IO at all.

Unless someone figures out a way to lookup existing cio.IO's for a given containerd Task, what we're doing is probably the easiest thing for us to do.

Had some initial confusion with what ID was being used between all the
different layers and function calls.

Figured out we need to track the container AND individual process/task
ID's. For example, if one did `fly intercept` that would (previously to
this commit) overwrite the previous cio.IO for the container. Same issue
would happen with resoruce containers. Each new exec would overwrite the
previous cio.IO.

Another issue here is that there are some cases where the task's ID is
literally "task" or "resource". Didn't bother trying to figure out under
what conditions this happens, but was another sign that using just the
task ID to track CIO's would not be sufficient to making the IOManager
robust enough to survive production.

Now all cio.IO's are tracked by container ID and task ID.

The second bug was during cleanup. With the initial tracking only using
task ID, we couldn't clean up our map[] of cio.IO's. Therefore, the map
would have grown forever; a memory leak!

Now we can easily delete all cio.IO's for a container by deleting with
the container ID. Now the map doesn't grow forever!

Signed-off-by: Taylor Silva <dev@taydev.net>
Intergration test failed where we restart the worker and expect we
should be able to keep streaming a builds logs, but it failed with a
panic. Our Attach() expected that Creator would always be called before
Attach(), which would guarantee that the container's tasks were already
being tracked in the map[].

This is not the case if the worker is restarted though. The map[] is
empty in that case and we call Attach() since the container already
exists and is running. Added checks when reading the map[] so our
Attach() doesn't panic reading a nil map.

Signed-off-by: Taylor Silva <dev@taydev.net>
@Kump3r
Copy link
Contributor

Kump3r commented Jul 14, 2025

Got thinking a bit more last night, wondering if there's a way to do this without having to track the IO's ourselves.

We though and tested something similar, although to no effect. Sadly seems like containerd does not keep any metadata information about previous logs anywhere. Possibly in the future, should they implement such a feature, we could revisit. For now I think we all reach the same conclusion/direction as to what you did. Although yours is 10x better, since I didn't even see the backend for garden server 🤣
We will invest some time in reviewing, although I think we might lack some knowledge to provide adequate suggestions on improvements. In any case I will test your solution in our environments to ensure it solves the underlying issue and does not bring weird regressions/bugs.

@IvanChalukov
Copy link
Contributor

Hi @taylorsilva, I've run the following tests:

Local tests

Setup with 1 worker and 1 web instance:

  1. Restarting the worker
  • No logs were lost after the worker restarted.
  • fly watch did not lose logs after the worker restarted.
  • fly intercept maintained its connection after the worker restarted.
  1. Restarting the web
  • Logs were missing only during the web restart(as expected), but resumed streaming as expected once the web came back up.
  • fly watch behaved the same as the UI: logs were missing only during the web restart and resumed afterward.
  • fly intercept dropped its connection during the web restart, as expected.

Bosh environment

It looks like there's a regression when running with 2 web nodes and 2 workers (the number of workers doesn't seem to matter). Both errors appears after sequential web restarts(trying to simulate rolling update of web workers)

  1. Fails in task.Exec(..) in container Run() method and kills container.
  • After the first web restart, the worker switched over to the second web and streaming continue as expected.
  • But after restarting the second web, the worker switched back to the first and then failed with:
{"timestamp":"2025-07-15T12:34:36.575461374Z","level":"error","source":"worker","message":"worker.garden.garden-server.run.failed","data":{"error":"task exec: id task: already exists","handle":"163013a7-1a21-4920-9449-5f2f9e956fea","session":"1.2.127"}}
  1. Fails in task.LoadProcess(..) in container Attach() method and kills container.
    Couldn’t find a clear pattern, but during the restart of the first web, sometimes this was thrown:
wait for process completion: backend error: Exit status: 500, message: {"Type":"","Message":"load proc: failed to open stdin fifo: error creating fifo /run/containerd/fifo/4277305638/task-stdin: no such file or directory","Handle":"","ProcessID":"","Binary":""}

Summary

  • Single web + single worker behaves correctly under both web and worker restarts.
  • With multiple web nodes, there are regressions related to task ID reuse and missing FIFOs after sequential web restarts.

Did you try running with 2 web instances? I’ll dig into this further to see if I can pinpoint the issue.

@Kump3r
Copy link
Contributor

Kump3r commented Jul 15, 2025

I do not have time to test, but shooting a suspicion, feel free to disregard if it BS.
Is it possible that during bosh web process restart, the in memory data for the original task is not flushed. Then the second web attaches and closes the streams. But as soon as the second web shutdowns. The original web 1 tries to attach with the old fifo/streams?

@taylorsilva
Copy link
Member Author

taylorsilva commented Jul 15, 2025

I haven't tried with two web instances. I'll see how I can reproduce that locally.

Not sure why that first error is happening. The second error, no such file or directory, I saw happen when we detach from the FIFO's before attaching the new one's. Hopefully some more debugging will explain why this is happening.

I'll investigate and share what I find here.

…occurred once already

bug was discovered when doing rolling updates on a cluster with two web
nodes. The first re-attchment would succeed, but the second would fail.
This appears to happen because we fully closed the previous CIO
(prevIO.Close()) which signled to containerd to delete the task's FIFO
files. It seems containerd eventually has a chance to delete the files
during the second re-attachment, which results in this error:

```
load proc: failed to open stdin fifo: error creating fifo /run/containerd/fifo/1731945221/task-stdin: no such file or directory
```

We avoid this by never calling `Close()` on any of the previous CIO's.
Calling `Cancel()` is enough to stop the previous CIO's from reading the
FIFO files and allowing the new CIO to read all the logs from the
containerd task.

Signed-off-by: Taylor Silva <dev@taydev.net>
@taylorsilva
Copy link
Member Author

I was able to reproduce the issue locally and fix it. I'm going to try adding an integration test for this as well since a lot of users have two web nodes in their Concourse clusters.

The underlying error for both scenario's Ivan reported was:

load proc: failed to open stdin fifo: error creating fifo /run/containerd/fifo/1731945221/task-stdin: no such file or directory

You only saw this in the worker's logs and not in the web ui. This happens because the web node always tries attaching to a container first and then falls back to initializing a new task if it gets an error trying to attach. The reason for this has to do with the Garden API and keeping compatibility with the Guardian backend.

Anyways, figured out we just need to not fully close out the previous IO's. It seems doing that would signal to containerd that it can delete the tasks's FIFO files. At this point it seems things could get race-y. I'm guessing this is why Ivan would sometimes see the failed to open stdin fifo.

The only concern I have now is that I'm not sure if we're leaking anything by not calling Close() on the previous CIO's.
I've confirmed that the FIFO files are deleted once the task is done running, so it doesn't seem like we're leaking anything. I used lsof and did multiple rolling restarts of the web nodes to see if I saw an increase in lsof's output, but didn't, it stayed stable. So I don't think we're leaking anything by not calling Close() on the previous CIO.

Ivan, if you could do your tests one more time that would be great. You caught a good bug here!

@IvanChalukov
Copy link
Contributor

That all makes sense, thanks for digging into it!

You only saw this in the worker logs and not in the web UI.

Actually, I observed those errors in both the worker logs and the UI logs:
image

That said, I ran the same tests again, both locally and in a BOSH environment, and I can confirm that after the web restarts, there are no errors in either the worker or UI logs.

I also kept a close eye on the FIFO files, and they’re being cleaned up exactly as they should after reattaching to the task. So Containerd isn’t leaking anything from my perspective - it’s correctly cleaning up the FIFOs even without explicitly calling Close() on the previous CIOs.

Really nice catch on the need to avoid fully closing the previous CIOs. Looks like that solved the issue cleanly.

@taylorsilva
Copy link
Member Author

I've added an integration test that brings up two web nodes and tests that build logs are not lost after multiple rolling restarts of the web nodes. It checks for the first few log lines "Hello {1...3}" and some of the last few "Hello {110...112}" as verification that logs continue to stream with a fly watch command.

Signed-off-by: Taylor Silva <dev@taydev.net>
…restarts

Signed-off-by: Taylor Silva <dev@taydev.net>
with the new integration test that has nginx in front of the web nodes,
nginx comes up faster than the web nodes and return HTTP 502. When
downloading fly, we weren't checking the HTTP status code (the returned
err from Get() doesn't error on non-2xx status codes) and ended up
downloading the nginx 502 error page.

Signed-off-by: Taylor Silva <dev@taydev.net>
@taylorsilva taylorsilva merged commit 8419676 into master Jul 17, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Missing build logs - containerd
3 participants