Skip to content

Conversation

burgerdev
Copy link
Contributor

Container logs are forwarded to the agent through a unix pipe. These pipes have limited capacity and block the writer when full. If reading logs is blocked by policy, a common setup for confidential containers, the pipes fill up and eventually block the container.

This commit changes the implementation of ReadStream such that it returns empty log messages instead of a policy failure (in case reading log messages is forbidden by policy). As long as the runtime does not encounter a failure, it keeps pulling logs periodically. In turn, this triggers the agent to flush the pipes.

Fixes: #10680


Alternatives considered

  • The agent could unconditionally pull from the pipe but still return a policy error, and the runtime could keep pulling in case of policy errors. This would arguably make the agent service correct, but would force the runtime to care about agent implementation details. I feel more strongly about the latter.
  • The agent could spawn a pipe cleaning routine when it first sees a policy failure for ReadStreamRequest. I don't like the added agent complexity of this solution.

cc @sprt @gkurz @mythi

@katacontainersbot katacontainersbot added the size/small Small and simple task label Jan 30, 2025
Copy link
Member

@gkurz gkurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever fix @burgerdev 😄

Please add you Sob to the commit message and this is good to go.

@burgerdev
Copy link
Contributor Author

Oops - done. Thanks, @gkurz!

@sprt sprt requested a review from danmihai1 January 30, 2025 21:48
@danmihai1
Copy link
Member

@burgerdev , I think this is a good change - thank you!

In addition to this change: did you also consider setting up the Write side of the pipe for non-blocking Writes? If that works, it should help a CoCo Guest work a little better if its untrusted Host decides to stop calling ReadStream.

@burgerdev
Copy link
Contributor Author

You mean setting O_NONBLOCK on the write end and passing that to the container process as stdout (stderr)? This will return EAGAIN for writes that would be blocking otherwise, and I fear that most processes will not handle this situation gracefully. For example, this bash program run with non-blocking pipe that is not cleared fails:

set -e

report() {
    ret=$?
    printf "shell exit code: %d\n" "$ret" >&2
    exit "$ret"
}

trap report EXIT

while true; do
  echo foo
done
/bin/sh: line 12: echo: write error: Resource temporarily unavailable
shell exit code: 1

@gkurz
Copy link
Member

gkurz commented Jan 31, 2025

In addition to this change: did you also consider setting up the Write side of the pipe for non-blocking Writes? If that works, it should help a CoCo Guest work a little better if its untrusted Host decides to stop calling ReadStream.

As well explained by @burgerdev , O_NONBLOCK won't help. If DoS is a concern, the agent should monitor the read end of the pipe, consume the data and discard it. ReadStream should just always return an empty response in this case.

@burgerdev
Copy link
Contributor Author

With respect to confidential computing, DoS is usually out of scope because the infrastructure provider can always decide not to run the workload (cf. https://www.redhat.com/en/blog/confidential-computing-primer).

@gkurz
Copy link
Member

gkurz commented Jan 31, 2025

With respect to confidential computing, DoS is usually out of scope because the infrastructure provider can always decide not to run the workload (cf. https://www.redhat.com/en/blog/confidential-computing-primer).

This is my understanding as well. I was just commenting on @danmihai1's suggestion.

@sprt
Copy link
Contributor

sprt commented Jan 31, 2025

@burgerdev Do you think you could also add a test case for this?

@danmihai1
Copy link
Member

@burgerdev , I guess rebasing on the latest main code might help with this test failure: https://github.com/kata-containers/kata-containers/actions/runs/13076857575/job/36496222189?pr=10818 .

(That's how we got that test to pass in #10811)

Container logs are forwarded to the agent through a unix pipe. These
pipes have limited capacity and block the writer when full. If reading
logs is blocked by policy, a common setup for confidential containers,
the pipes fill up and eventually block the container.

This commit changes the implementation of ReadStream such that it
returns empty log messages instead of a policy failure (in case reading
log messages is forbidden by policy). As long as the runtime does not
encounter a failure, it keeps pulling logs periodically. In turn, this
triggers the agent to flush the pipes.

Fixes: kata-containers#10680

Co-Authored-By: Aurélien Bombo <abombo@microsoft.com>
Signed-off-by: Markus Rudy <mr@edgeless.systems>
@burgerdev
Copy link
Contributor Author

I rebased and fixed the clippy warning.

@burgerdev Do you think you could also add a test case for this?

A simple test would be to make the existing k8s-policy-job.yaml write more output than the pipe buffer can handle. Alternatively, we could set up a full test that also tests that logs are allowed if configured, etc. What do you think?

@gkurz
Copy link
Member

gkurz commented Feb 4, 2025

I rebased and fixed the clippy warning.

@burgerdev Do you think you could also add a test case for this?

A simple test would be to make the existing k8s-policy-job.yaml write more output than the pipe buffer can handle. Alternatively, we could set up a full test that also tests that logs are allowed if configured, etc. What do you think?

You could also have unit testing of read_stdout and read_stderr actionable by cargo test. I'd favor that if possible (and I suspect it is).

@burgerdev
Copy link
Contributor Author

I rebased and fixed the clippy warning.

@burgerdev Do you think you could also add a test case for this?

A simple test would be to make the existing k8s-policy-job.yaml write more output than the pipe buffer can handle. Alternatively, we could set up a full test that also tests that logs are allowed if configured, etc. What do you think?

You could also have unit testing of read_stdout and read_stderr actionable by cargo test. I'd favor that if possible (and I suspect it is).

I like the idea of unit testing this, but struggle to implement it. The closest example I see is test_do_write_stream, but it looks intimidatingly complicated for the simple API surface it covers. Also, it does not look like there's an existing unit test that uses policy, and I fear it's going to be even more complicated because of the global state. :/

@sprt
Copy link
Contributor

sprt commented Feb 5, 2025

@burgerdev You can take inspiration here on how to use the policy in an e2e test - other test cases also use a very similar approach

bd6eedc#diff-8aefc3b3747a8e1e0fbc75e5ee84b20b7ccb4853a144c9102ea1999b4d2c3041

@gkurz
Copy link
Member

gkurz commented Feb 5, 2025

I like the idea of unit testing this, but struggle to implement it. The closest example I see is test_do_write_stream, but it looks intimidatingly complicated for the simple API surface it covers.

I'm not really convinced by the intimidatingly complicated argument 😉

Also, it does not look like there's an existing unit test that uses policy, and I fear it's going to be even more complicated because of the global state. :/

Another suggestion is to add an is_allowed bool argument to do_read_stream and do the resp.clear_data() there. This would consolidate the logic in one place and you should be able to unit test that with not that many lines of code IMO.

@katacontainersbot katacontainersbot added size/medium Average sized task and removed size/small Small and simple task labels Feb 19, 2025
This test verifies that, when ReadStreamRequest is blocked by the
policy, the logs are empty and the container does not deadlock.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
@sprt sprt merged commit 601c403 into kata-containers:main Feb 19, 2025
303 of 313 checks passed
@burgerdev
Copy link
Contributor Author

Thanks for adding the test @sprt!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test size/medium Average sized task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

agent: policy: blocking logs can lead to container deadlocks
6 participants