runtime/containerd: ensure logs are not dropped when web node re-attaches to running containers #9234

taylorsilva · 2025-07-11T19:54:48Z

Changes proposed by this PR

If a web node had to re-attach to an already running container, either because the web node restarted or had network connection issues with the worker, after re-attaching, the web node would miss a majority of the logs that the container emits. Random log fragments may be picked up, but most logs would be lost.

This PR fixes this bug by ensuring we cleanup the previous IO Readers that were attached to the containerd Task's FIFO files.

Notes to reviewer

To reproduce/verify it works:

docker compose up -d --build
Login and set the following pipeline:

jobs:
- name: hello-counter-job
  plan:
  - config:
      image_resource:
        name: ""
        source:
          repository: cgr.dev/chainguard/bash
          tag: latest
        type: registry-image
      platform: linux
      run:
        args:
        - -ec
        - |-
          for i in {1..600};  do
            echo "Hello $i"
            sleep 1
          done  | tee /tmp/mylog.txt
        path: /bin/bash
    privileged: true
    task: hello-counter-task

Unpause the pipeline and trigger the job. Wait for a few of the "Hello..." messages to appear
docker compose restart web
Check the job logs. Web should re-connect and continue to stream logs, with no logs being dropped once re-connected. The only logs that are missing now are the one's emitted while the web node was not connected to the container.

Signed-off-by: Taylor Silva <dev@taydev.net>

taylorsilva · 2025-07-11T19:58:30Z

@Kump3r @IvanChalukov I'll wait for you guys to review and test before merging.

taylorsilva · 2025-07-12T15:55:48Z

Got thinking a bit more last night, wondering if there's a way to do this without having to track the IO's ourselves. Tried doing this:

diff --git a/worker/runtime/container.go b/worker/runtime/container.go
index 0c533811a..6e090f183 100644
--- a/worker/runtime/container.go
+++ b/worker/runtime/container.go
@@ -186,9 +186,18 @@ func (c *Container) Attach(pid string, processIO garden.ProcessIO) (process gard
                return nil, fmt.Errorf("task attach: %w", err)
        }

+       prevIO := task.IO()
        cioOpts := containerdCIO(processIO, false)
+       var attach cio.Attach = func(f *cio.FIFOSet) (cio.IO, error) {
+               newIO, err := cio.NewAttach(cioOpts...)(f)
+               if prevIO != nil {
+                       prevIO.Cancel()
+                       prevIO.Close()
+               }
+               return newIO, err
+       }

-       proc, err := task.LoadProcess(ctx, pid, cio.NewAttach(cioOpts...))
+       proc, err := task.LoadProcess(ctx, pid, attach)
        if err != nil {
                return nil, fmt.Errorf("load proc: %w", err)
        }

Sadly, the cio.IO that the task returns at this point in the code is a new cio.IO that just points to the FIFO files. It doesn't contain the previous cio.IO that has the files open. Really seems like containerd doesn't give us a way to lookup the previous/existing cio.IO at all.

Unless someone figures out a way to lookup existing cio.IO's for a given containerd Task, what we're doing is probably the easiest thing for us to do.

Had some initial confusion with what ID was being used between all the different layers and function calls. Figured out we need to track the container AND individual process/task ID's. For example, if one did `fly intercept` that would (previously to this commit) overwrite the previous cio.IO for the container. Same issue would happen with resoruce containers. Each new exec would overwrite the previous cio.IO. Another issue here is that there are some cases where the task's ID is literally "task" or "resource". Didn't bother trying to figure out under what conditions this happens, but was another sign that using just the task ID to track CIO's would not be sufficient to making the IOManager robust enough to survive production. Now all cio.IO's are tracked by container ID and task ID. The second bug was during cleanup. With the initial tracking only using task ID, we couldn't clean up our map[] of cio.IO's. Therefore, the map would have grown forever; a memory leak! Now we can easily delete all cio.IO's for a container by deleting with the container ID. Now the map doesn't grow forever! Signed-off-by: Taylor Silva <dev@taydev.net>

Intergration test failed where we restart the worker and expect we should be able to keep streaming a builds logs, but it failed with a panic. Our Attach() expected that Creator would always be called before Attach(), which would guarantee that the container's tasks were already being tracked in the map[]. This is not the case if the worker is restarted though. The map[] is empty in that case and we call Attach() since the container already exists and is running. Added checks when reading the map[] so our Attach() doesn't panic reading a nil map. Signed-off-by: Taylor Silva <dev@taydev.net>

Kump3r · 2025-07-14T05:20:56Z

Got thinking a bit more last night, wondering if there's a way to do this without having to track the IO's ourselves.

We though and tested something similar, although to no effect. Sadly seems like containerd does not keep any metadata information about previous logs anywhere. Possibly in the future, should they implement such a feature, we could revisit. For now I think we all reach the same conclusion/direction as to what you did. Although yours is 10x better, since I didn't even see the backend for garden server 🤣
We will invest some time in reviewing, although I think we might lack some knowledge to provide adequate suggestions on improvements. In any case I will test your solution in our environments to ensure it solves the underlying issue and does not bring weird regressions/bugs.

IvanChalukov · 2025-07-15T14:40:14Z

Hi @taylorsilva, I've run the following tests:

Local tests

Setup with 1 worker and 1 web instance:

Restarting the worker

No logs were lost after the worker restarted.
fly watch did not lose logs after the worker restarted.
fly intercept maintained its connection after the worker restarted.

Restarting the web

Logs were missing only during the web restart(as expected), but resumed streaming as expected once the web came back up.
fly watch behaved the same as the UI: logs were missing only during the web restart and resumed afterward.
fly intercept dropped its connection during the web restart, as expected.

Bosh environment

It looks like there's a regression when running with 2 web nodes and 2 workers (the number of workers doesn't seem to matter). Both errors appears after sequential web restarts(trying to simulate rolling update of web workers)

Fails in task.Exec(..) in container Run() method and kills container.

After the first web restart, the worker switched over to the second web and streaming continue as expected.
But after restarting the second web, the worker switched back to the first and then failed with:

{"timestamp":"2025-07-15T12:34:36.575461374Z","level":"error","source":"worker","message":"worker.garden.garden-server.run.failed","data":{"error":"task exec: id task: already exists","handle":"163013a7-1a21-4920-9449-5f2f9e956fea","session":"1.2.127"}}

Fails in task.LoadProcess(..) in container Attach() method and kills container.
Couldn’t find a clear pattern, but during the restart of the first web, sometimes this was thrown:

wait for process completion: backend error: Exit status: 500, message: {"Type":"","Message":"load proc: failed to open stdin fifo: error creating fifo /run/containerd/fifo/4277305638/task-stdin: no such file or directory","Handle":"","ProcessID":"","Binary":""}

Summary

Single web + single worker behaves correctly under both web and worker restarts.
With multiple web nodes, there are regressions related to task ID reuse and missing FIFOs after sequential web restarts.

Did you try running with 2 web instances? I’ll dig into this further to see if I can pinpoint the issue.

Kump3r · 2025-07-15T16:02:18Z

I do not have time to test, but shooting a suspicion, feel free to disregard if it BS.
Is it possible that during bosh web process restart, the in memory data for the original task is not flushed. Then the second web attaches and closes the streams. But as soon as the second web shutdowns. The original web 1 tries to attach with the old fifo/streams?

taylorsilva · 2025-07-15T16:21:23Z

I haven't tried with two web instances. I'll see how I can reproduce that locally.

Not sure why that first error is happening. The second error, no such file or directory, I saw happen when we detach from the FIFO's before attaching the new one's. Hopefully some more debugging will explain why this is happening.

I'll investigate and share what I find here.

…occurred once already bug was discovered when doing rolling updates on a cluster with two web nodes. The first re-attchment would succeed, but the second would fail. This appears to happen because we fully closed the previous CIO (prevIO.Close()) which signled to containerd to delete the task's FIFO files. It seems containerd eventually has a chance to delete the files during the second re-attachment, which results in this error: ``` load proc: failed to open stdin fifo: error creating fifo /run/containerd/fifo/1731945221/task-stdin: no such file or directory ``` We avoid this by never calling `Close()` on any of the previous CIO's. Calling `Cancel()` is enough to stop the previous CIO's from reading the FIFO files and allowing the new CIO to read all the logs from the containerd task. Signed-off-by: Taylor Silva <dev@taydev.net>

taylorsilva · 2025-07-16T01:59:25Z

I was able to reproduce the issue locally and fix it. I'm going to try adding an integration test for this as well since a lot of users have two web nodes in their Concourse clusters.

The underlying error for both scenario's Ivan reported was:

load proc: failed to open stdin fifo: error creating fifo /run/containerd/fifo/1731945221/task-stdin: no such file or directory

You only saw this in the worker's logs and not in the web ui. This happens because the web node always tries attaching to a container first and then falls back to initializing a new task if it gets an error trying to attach. The reason for this has to do with the Garden API and keeping compatibility with the Guardian backend.

Anyways, figured out we just need to not fully close out the previous IO's. It seems doing that would signal to containerd that it can delete the tasks's FIFO files. At this point it seems things could get race-y. I'm guessing this is why Ivan would sometimes see the failed to open stdin fifo.

The only concern I have now is that I'm not sure if we're leaking anything by not calling Close() on the previous CIO's.
I've confirmed that the FIFO files are deleted once the task is done running, so it doesn't seem like we're leaking anything. I used lsof and did multiple rolling restarts of the web nodes to see if I saw an increase in lsof's output, but didn't, it stayed stable. So I don't think we're leaking anything by not calling Close() on the previous CIO.

Ivan, if you could do your tests one more time that would be great. You caught a good bug here!

IvanChalukov · 2025-07-16T12:55:55Z

That all makes sense, thanks for digging into it!

You only saw this in the worker logs and not in the web UI.

Actually, I observed those errors in both the worker logs and the UI logs:

That said, I ran the same tests again, both locally and in a BOSH environment, and I can confirm that after the web restarts, there are no errors in either the worker or UI logs.

I also kept a close eye on the FIFO files, and they’re being cleaned up exactly as they should after reattaching to the task. So Containerd isn’t leaking anything from my perspective - it’s correctly cleaning up the FIFOs even without explicitly calling Close() on the previous CIOs.

Really nice catch on the need to avoid fully closing the previous CIOs. Looks like that solved the issue cleanly.

worker/runtime/io_manager.go

taylorsilva · 2025-07-16T21:27:44Z

I've added an integration test that brings up two web nodes and tests that build logs are not lost after multiple rolling restarts of the web nodes. It checks for the first few log lines "Hello {1...3}" and some of the last few "Hello {110...112}" as verification that logs continue to stream with a fly watch command.

Signed-off-by: Taylor Silva <dev@taydev.net>

…restarts Signed-off-by: Taylor Silva <dev@taydev.net>

with the new integration test that has nginx in front of the web nodes, nginx comes up faster than the web nodes and return HTTP 502. When downloading fly, we weren't checking the HTTP status code (the returned err from Get() doesn't error on non-2xx status codes) and ended up downloading the nginx 502 error page. Signed-off-by: Taylor Silva <dev@taydev.net>

taylorsilva added 4 commits July 11, 2025 14:17

add IOManager for tracking containerd Task FIFO readers

991b0ad

Signed-off-by: Taylor Silva <dev@taydev.net>

generate IOManager fake

e8dde1f

Signed-off-by: Taylor Silva <dev@taydev.net>

add tests for IOManager

7ec4a2d

Signed-off-by: Taylor Silva <dev@taydev.net>

add tests for container Run/Attach/Stop operations

e459b60

Signed-off-by: Taylor Silva <dev@taydev.net>

taylorsilva requested a review from a team as a code owner July 11, 2025 19:54

taylorsilva added the bug label Jul 11, 2025

taylorsilva added this to the v7.14.0 milestone Jul 11, 2025

taylorsilva requested review from IvanChalukov and Kump3r July 11, 2025 19:57

taylorsilva added this to Pull Requests Jul 11, 2025

taylorsilva moved this to In Progress in Pull Requests Jul 11, 2025

taylorsilva added 2 commits July 12, 2025 14:21

IvanChalukov reviewed Jul 16, 2025

View reviewed changes

worker/runtime/io_manager.go Outdated Show resolved Hide resolved

taylorsilva added 2 commits July 16, 2025 17:30

only call Cancel() if the inner attach() func doesn't error

ce51ca5

Signed-off-by: Taylor Silva <dev@taydev.net>

add tests that checks build logs continue to stream between web node …

f167268

…restarts Signed-off-by: Taylor Silva <dev@taydev.net>

taylorsilva force-pushed the issue-9189 branch from 7676fca to f167268 Compare July 16, 2025 21:30

IvanChalukov approved these changes Jul 17, 2025

View reviewed changes

taylorsilva merged commit 8419676 into master Jul 17, 2025
12 checks passed

taylorsilva deleted the issue-9189 branch July 17, 2025 21:46

github-project-automation bot moved this from In Progress to Done in Pull Requests Jul 17, 2025

Kump3r mentioned this pull request Jul 24, 2025

stdout/stderr output to the console is lost when jobs have a long duration #9050

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

runtime/containerd: ensure logs are not dropped when web node re-attaches to running containers #9234

runtime/containerd: ensure logs are not dropped when web node re-attaches to running containers #9234

Uh oh!

taylorsilva commented Jul 11, 2025 •

edited

Loading

Uh oh!

taylorsilva commented Jul 11, 2025

Uh oh!

taylorsilva commented Jul 12, 2025 •

edited

Loading

Uh oh!

Kump3r commented Jul 14, 2025 •

edited

Loading

Uh oh!

IvanChalukov commented Jul 15, 2025

Uh oh!

Kump3r commented Jul 15, 2025

Uh oh!

taylorsilva commented Jul 15, 2025 •

edited

Loading

Uh oh!

taylorsilva commented Jul 16, 2025

Uh oh!

IvanChalukov commented Jul 16, 2025

Uh oh!

Uh oh!

taylorsilva commented Jul 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

runtime/containerd: ensure logs are not dropped when web node re-attaches to running containers #9234

runtime/containerd: ensure logs are not dropped when web node re-attaches to running containers #9234

Uh oh!

Conversation

taylorsilva commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes proposed by this PR

Notes to reviewer

Uh oh!

taylorsilva commented Jul 11, 2025

Uh oh!

taylorsilva commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kump3r commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IvanChalukov commented Jul 15, 2025

Local tests

Bosh environment

Summary

Uh oh!

Kump3r commented Jul 15, 2025

Uh oh!

taylorsilva commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taylorsilva commented Jul 16, 2025

Uh oh!

IvanChalukov commented Jul 16, 2025

Uh oh!

Uh oh!

taylorsilva commented Jul 16, 2025

Uh oh!

Uh oh!

Uh oh!

taylorsilva commented Jul 11, 2025 •

edited

Loading

taylorsilva commented Jul 12, 2025 •

edited

Loading

Kump3r commented Jul 14, 2025 •

edited

Loading

taylorsilva commented Jul 15, 2025 •

edited

Loading