agent: clear log pipes if denied by policy #10818

burgerdev · 2025-01-30T16:36:44Z

Container logs are forwarded to the agent through a unix pipe. These pipes have limited capacity and block the writer when full. If reading logs is blocked by policy, a common setup for confidential containers, the pipes fill up and eventually block the container.

This commit changes the implementation of ReadStream such that it returns empty log messages instead of a policy failure (in case reading log messages is forbidden by policy). As long as the runtime does not encounter a failure, it keeps pulling logs periodically. In turn, this triggers the agent to flush the pipes.

Fixes: #10680

Alternatives considered

The agent could unconditionally pull from the pipe but still return a policy error, and the runtime could keep pulling in case of policy errors. This would arguably make the agent service correct, but would force the runtime to care about agent implementation details. I feel more strongly about the latter.
The agent could spawn a pipe cleaning routine when it first sees a policy failure for ReadStreamRequest. I don't like the added agent complexity of this solution.

cc @sprt @gkurz @mythi

gkurz

This is a clever fix @burgerdev 😄

Please add you Sob to the commit message and this is good to go.

burgerdev · 2025-01-30T21:24:28Z

Oops - done. Thanks, @gkurz!

danmihai1 · 2025-01-30T21:51:59Z

@burgerdev , I think this is a good change - thank you!

In addition to this change: did you also consider setting up the Write side of the pipe for non-blocking Writes? If that works, it should help a CoCo Guest work a little better if its untrusted Host decides to stop calling ReadStream.

burgerdev · 2025-01-31T08:27:47Z

You mean setting O_NONBLOCK on the write end and passing that to the container process as stdout (stderr)? This will return EAGAIN for writes that would be blocking otherwise, and I fear that most processes will not handle this situation gracefully. For example, this bash program run with non-blocking pipe that is not cleared fails:

set -e

report() {
    ret=$?
    printf "shell exit code: %d\n" "$ret" >&2
    exit "$ret"
}

trap report EXIT

while true; do
  echo foo
done

/bin/sh: line 12: echo: write error: Resource temporarily unavailable
shell exit code: 1

gkurz · 2025-01-31T09:27:27Z

In addition to this change: did you also consider setting up the Write side of the pipe for non-blocking Writes? If that works, it should help a CoCo Guest work a little better if its untrusted Host decides to stop calling ReadStream.

As well explained by @burgerdev , O_NONBLOCK won't help. If DoS is a concern, the agent should monitor the read end of the pipe, consume the data and discard it. ReadStream should just always return an empty response in this case.

burgerdev · 2025-01-31T10:26:23Z

With respect to confidential computing, DoS is usually out of scope because the infrastructure provider can always decide not to run the workload (cf. https://www.redhat.com/en/blog/confidential-computing-primer).

gkurz · 2025-01-31T10:50:44Z

With respect to confidential computing, DoS is usually out of scope because the infrastructure provider can always decide not to run the workload (cf. https://www.redhat.com/en/blog/confidential-computing-primer).

This is my understanding as well. I was just commenting on @danmihai1's suggestion.

sprt · 2025-01-31T11:36:32Z

@burgerdev Do you think you could also add a test case for this?

danmihai1 · 2025-01-31T21:03:31Z

@burgerdev , I guess rebasing on the latest main code might help with this test failure: https://github.com/kata-containers/kata-containers/actions/runs/13076857575/job/36496222189?pr=10818 .

(That's how we got that test to pass in #10811)

Container logs are forwarded to the agent through a unix pipe. These pipes have limited capacity and block the writer when full. If reading logs is blocked by policy, a common setup for confidential containers, the pipes fill up and eventually block the container. This commit changes the implementation of ReadStream such that it returns empty log messages instead of a policy failure (in case reading log messages is forbidden by policy). As long as the runtime does not encounter a failure, it keeps pulling logs periodically. In turn, this triggers the agent to flush the pipes. Fixes: kata-containers#10680 Co-Authored-By: Aurélien Bombo <abombo@microsoft.com> Signed-off-by: Markus Rudy <mr@edgeless.systems>

burgerdev · 2025-02-04T12:28:20Z

I rebased and fixed the clippy warning.

@burgerdev Do you think you could also add a test case for this?

A simple test would be to make the existing k8s-policy-job.yaml write more output than the pipe buffer can handle. Alternatively, we could set up a full test that also tests that logs are allowed if configured, etc. What do you think?

gkurz · 2025-02-04T15:06:54Z

I rebased and fixed the clippy warning.

@burgerdev Do you think you could also add a test case for this?

A simple test would be to make the existing k8s-policy-job.yaml write more output than the pipe buffer can handle. Alternatively, we could set up a full test that also tests that logs are allowed if configured, etc. What do you think?

You could also have unit testing of read_stdout and read_stderr actionable by cargo test. I'd favor that if possible (and I suspect it is).

burgerdev · 2025-02-05T10:40:12Z

I rebased and fixed the clippy warning.

@burgerdev Do you think you could also add a test case for this?

A simple test would be to make the existing k8s-policy-job.yaml write more output than the pipe buffer can handle. Alternatively, we could set up a full test that also tests that logs are allowed if configured, etc. What do you think?

You could also have unit testing of read_stdout and read_stderr actionable by cargo test. I'd favor that if possible (and I suspect it is).

I like the idea of unit testing this, but struggle to implement it. The closest example I see is test_do_write_stream, but it looks intimidatingly complicated for the simple API surface it covers. Also, it does not look like there's an existing unit test that uses policy, and I fear it's going to be even more complicated because of the global state. :/

sprt · 2025-02-05T15:49:11Z

@burgerdev You can take inspiration here on how to use the policy in an e2e test - other test cases also use a very similar approach

bd6eedc#diff-8aefc3b3747a8e1e0fbc75e5ee84b20b7ccb4853a144c9102ea1999b4d2c3041

gkurz · 2025-02-05T16:35:58Z

I like the idea of unit testing this, but struggle to implement it. The closest example I see is test_do_write_stream, but it looks intimidatingly complicated for the simple API surface it covers.

I'm not really convinced by the intimidatingly complicated argument 😉

Also, it does not look like there's an existing unit test that uses policy, and I fear it's going to be even more complicated because of the global state. :/

Another suggestion is to add an is_allowed bool argument to do_read_stream and do the resp.clear_data() there. This would consolidate the logic in one place and you should be able to unit test that with not that many lines of code IMO.

This test verifies that, when ReadStreamRequest is blocked by the policy, the logs are empty and the container does not deadlock. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>

burgerdev · 2025-02-26T07:59:13Z

Thanks for adding the test @sprt!

katacontainersbot added the size/small Small and simple task label Jan 30, 2025

gkurz reviewed Jan 30, 2025

View reviewed changes

burgerdev force-pushed the plumbing branch from 00a14c3 to 61a7c59 Compare January 30, 2025 21:23

sprt requested a review from danmihai1 January 30, 2025 21:48

danmihai1 added the ok-to-test label Jan 31, 2025

danmihai1 approved these changes Jan 31, 2025

View reviewed changes

burgerdev force-pushed the plumbing branch from 61a7c59 to 98a4a62 Compare January 31, 2025 21:06

burgerdev force-pushed the plumbing branch from 98a4a62 to 937fd90 Compare February 4, 2025 12:17

sprt force-pushed the plumbing branch from 4f0bc99 to 564a449 Compare February 19, 2025 18:18

katacontainersbot added size/medium Average sized task and removed size/small Small and simple task labels Feb 19, 2025

sprt force-pushed the plumbing branch from 564a449 to 293f29c Compare February 19, 2025 18:19

tests: Add policy test for ReadStreamRequest

cb34675

This test verifies that, when ReadStreamRequest is blocked by the policy, the logs are empty and the container does not deadlock. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>

sprt force-pushed the plumbing branch from 293f29c to cb34675 Compare February 19, 2025 20:03

sprt mentioned this pull request Feb 19, 2025

agent: clear log pipes if denied by policy microsoft/kata-containers#315

Merged

4 tasks

Redent0r approved these changes Feb 19, 2025

View reviewed changes

sprt merged commit 601c403 into kata-containers:main Feb 19, 2025
303 of 313 checks passed

agent: clear log pipes if denied by policy #10818

agent: clear log pipes if denied by policy #10818

Uh oh!

Conversation

burgerdev commented Jan 30, 2025

Alternatives considered

Uh oh!

gkurz left a comment

Choose a reason for hiding this comment

Uh oh!

burgerdev commented Jan 30, 2025

Uh oh!

danmihai1 commented Jan 30, 2025

Uh oh!

burgerdev commented Jan 31, 2025

Uh oh!

gkurz commented Jan 31, 2025

Uh oh!

burgerdev commented Jan 31, 2025

Uh oh!

gkurz commented Jan 31, 2025

Uh oh!

sprt commented Jan 31, 2025

Uh oh!

danmihai1 commented Jan 31, 2025

Uh oh!

burgerdev commented Feb 4, 2025

Uh oh!

gkurz commented Feb 4, 2025

Uh oh!

burgerdev commented Feb 5, 2025

Uh oh!

sprt commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gkurz commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

burgerdev commented Feb 26, 2025

Uh oh!

Uh oh!

sprt commented Feb 5, 2025 •

edited

Loading

gkurz commented Feb 5, 2025 •

edited

Loading