-
-
Notifications
You must be signed in to change notification settings - Fork 866
runtime/containerd: ensure logs are not dropped when web node re-attaches to running containers #9234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
@Kump3r @IvanChalukov I'll wait for you guys to review and test before merging. |
Got thinking a bit more last night, wondering if there's a way to do this without having to track the IO's ourselves. Tried doing this: diff --git a/worker/runtime/container.go b/worker/runtime/container.go
index 0c533811a..6e090f183 100644
--- a/worker/runtime/container.go
+++ b/worker/runtime/container.go
@@ -186,9 +186,18 @@ func (c *Container) Attach(pid string, processIO garden.ProcessIO) (process gard
return nil, fmt.Errorf("task attach: %w", err)
}
+ prevIO := task.IO()
cioOpts := containerdCIO(processIO, false)
+ var attach cio.Attach = func(f *cio.FIFOSet) (cio.IO, error) {
+ newIO, err := cio.NewAttach(cioOpts...)(f)
+ if prevIO != nil {
+ prevIO.Cancel()
+ prevIO.Close()
+ }
+ return newIO, err
+ }
- proc, err := task.LoadProcess(ctx, pid, cio.NewAttach(cioOpts...))
+ proc, err := task.LoadProcess(ctx, pid, attach)
if err != nil {
return nil, fmt.Errorf("load proc: %w", err)
} Sadly, the Unless someone figures out a way to lookup existing |
Had some initial confusion with what ID was being used between all the different layers and function calls. Figured out we need to track the container AND individual process/task ID's. For example, if one did `fly intercept` that would (previously to this commit) overwrite the previous cio.IO for the container. Same issue would happen with resoruce containers. Each new exec would overwrite the previous cio.IO. Another issue here is that there are some cases where the task's ID is literally "task" or "resource". Didn't bother trying to figure out under what conditions this happens, but was another sign that using just the task ID to track CIO's would not be sufficient to making the IOManager robust enough to survive production. Now all cio.IO's are tracked by container ID and task ID. The second bug was during cleanup. With the initial tracking only using task ID, we couldn't clean up our map[] of cio.IO's. Therefore, the map would have grown forever; a memory leak! Now we can easily delete all cio.IO's for a container by deleting with the container ID. Now the map doesn't grow forever! Signed-off-by: Taylor Silva <dev@taydev.net>
Intergration test failed where we restart the worker and expect we should be able to keep streaming a builds logs, but it failed with a panic. Our Attach() expected that Creator would always be called before Attach(), which would guarantee that the container's tasks were already being tracked in the map[]. This is not the case if the worker is restarted though. The map[] is empty in that case and we call Attach() since the container already exists and is running. Added checks when reading the map[] so our Attach() doesn't panic reading a nil map. Signed-off-by: Taylor Silva <dev@taydev.net>
We though and tested something similar, although to no effect. Sadly seems like |
Hi @taylorsilva, I've run the following tests: Local testsSetup with 1 worker and 1 web instance:
Bosh environmentIt looks like there's a regression when running with 2 web nodes and 2 workers (the number of workers doesn't seem to matter). Both errors appears after sequential web restarts(trying to simulate rolling update of web workers)
{"timestamp":"2025-07-15T12:34:36.575461374Z","level":"error","source":"worker","message":"worker.garden.garden-server.run.failed","data":{"error":"task exec: id task: already exists","handle":"163013a7-1a21-4920-9449-5f2f9e956fea","session":"1.2.127"}}
wait for process completion: backend error: Exit status: 500, message: {"Type":"","Message":"load proc: failed to open stdin fifo: error creating fifo /run/containerd/fifo/4277305638/task-stdin: no such file or directory","Handle":"","ProcessID":"","Binary":""} Summary
Did you try running with 2 web instances? I’ll dig into this further to see if I can pinpoint the issue. |
I do not have time to test, but shooting a suspicion, feel free to disregard if it BS. |
I haven't tried with two web instances. I'll see how I can reproduce that locally. Not sure why that first error is happening. The second error, I'll investigate and share what I find here. |
…occurred once already bug was discovered when doing rolling updates on a cluster with two web nodes. The first re-attchment would succeed, but the second would fail. This appears to happen because we fully closed the previous CIO (prevIO.Close()) which signled to containerd to delete the task's FIFO files. It seems containerd eventually has a chance to delete the files during the second re-attachment, which results in this error: ``` load proc: failed to open stdin fifo: error creating fifo /run/containerd/fifo/1731945221/task-stdin: no such file or directory ``` We avoid this by never calling `Close()` on any of the previous CIO's. Calling `Cancel()` is enough to stop the previous CIO's from reading the FIFO files and allowing the new CIO to read all the logs from the containerd task. Signed-off-by: Taylor Silva <dev@taydev.net>
I was able to reproduce the issue locally and fix it. I'm going to try adding an integration test for this as well since a lot of users have two web nodes in their Concourse clusters. The underlying error for both scenario's Ivan reported was:
You only saw this in the worker's logs and not in the web ui. This happens because the web node always tries attaching to a container first and then falls back to initializing a new task if it gets an error trying to attach. The reason for this has to do with the Garden API and keeping compatibility with the Guardian backend. Anyways, figured out we just need to not fully close out the previous IO's. It seems doing that would signal to containerd that it can delete the tasks's FIFO files. At this point it seems things could get race-y. I'm guessing this is why Ivan would sometimes see the The only concern I have now is that I'm not sure if we're leaking anything by not calling Ivan, if you could do your tests one more time that would be great. You caught a good bug here! |
I've added an integration test that brings up two web nodes and tests that build logs are not lost after multiple rolling restarts of the web nodes. It checks for the first few log lines "Hello {1...3}" and some of the last few "Hello {110...112}" as verification that logs continue to stream with a |
Signed-off-by: Taylor Silva <dev@taydev.net>
…restarts Signed-off-by: Taylor Silva <dev@taydev.net>
with the new integration test that has nginx in front of the web nodes, nginx comes up faster than the web nodes and return HTTP 502. When downloading fly, we weren't checking the HTTP status code (the returned err from Get() doesn't error on non-2xx status codes) and ended up downloading the nginx 502 error page. Signed-off-by: Taylor Silva <dev@taydev.net>
Changes proposed by this PR
closes #9189
If a web node had to re-attach to an already running container, either because the web node restarted or had network connection issues with the worker, after re-attaching, the web node would miss a majority of the logs that the container emits. Random log fragments may be picked up, but most logs would be lost.
This PR fixes this bug by ensuring we cleanup the previous IO Readers that were attached to the containerd Task's FIFO files.
Notes to reviewer
To reproduce/verify it works:
docker compose up -d --build
docker compose restart web