genpolicy: prevent corruption of the layer cache file #11426

charludo · 2025-06-17T13:46:29Z

We have been experiencing sporadic corruption of the layer cache file. Applying genpolicy --use-cached-files to this deployment.yaml and initial state of the layer cache layers-cache.json:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: whatever
spec:
  template:
    spec:
      runtimeClassName: contrast-cc
      containers:
        - name: secret-service
          image: 'ghcr.io/edgelesssys/privatemode/secret-service:v1.17.0@sha256:e36c9f13932f46e2fd460c5469dd4ee22bcdbaefd4b576d985f9e9a435413d3b'

[
  {
    "diff_id": "sha256:fa6a52ec54152fea01d1e53fa5b2447fcc2004d56414f23830fe4536b8b3f823",
    "verity_hash": "e25c3d85b5ef4fb029b338b0eef3c4d5f29ffabb7d0aa27a2a8130e06fdd8190",
    "passwd": "root:x:0:0:root:/root:/bin/ash\nbin:x:1:1:bin:/bin:/sbin/nologin\ndaemon:x:2:2:daemon:/sbin:/sbin/nologin\nadm:x:3:4:adm:/var/adm:/sbin/nologin\nlp:x:4:7:lp:/var/spool/lpd:/sbin/nologin\nsync:x:5:0:sync:/sbin:/bin/sync\nshutdown:x:6:0:shutdown:/sbin:/sbin/shutdown\nhalt:x:7:0:halt:/sbin:/sbin/halt\nmail:x:8:12:mail:/var/mail:/sbin/nologin\nnews:x:9:13:news:/usr/lib/news:/sbin/nologin\nuucp:x:10:14:uucp:/var/spool/uucppublic:/sbin/nologin\noperator:x:11:0:operator:/root:/sbin/nologin\nman:x:13:15:man:/usr/man:/sbin/nologin\npostmaster:x:14:12:postmaster:/var/mail:/sbin/nologin\ncron:x:16:16:cron:/var/spool/cron:/sbin/nologin\nftp:x:21:21::/var/lib/ftp:/sbin/nologin\nsshd:x:22:22:sshd:/dev/null:/sbin/nologin\nnobody:x:65534:65534:nobody:/:/sbin/nologin\nnonroot:x:65532:65532:Account created by apko:/home/nonroot:/bin/sh\nnotebook-user:x:1000:1000::/home/notebook-user:/bin/ash\n",
    "group": "root:x:0:root\nbin:x:1:root,bin,daemon\ndaemon:x:2:root,bin,daemon\nsys:x:3:root,bin,adm\nadm:x:4:root,adm,daemon\ntty:x:5:\ndisk:x:6:root,adm\nlp:x:7:lp\nmem:x:8:\nkmem:x:9:\nwheel:x:10:root\nfloppy:x:11:root\nmail:x:12:mail\nnews:x:13:news\nuucp:x:14:uucp\nman:x:15:man\ncron:x:16:cron\nconsole:x:17:\naudio:x:18:\ncdrom:x:19:\ndialout:x:20:root\nftp:x:21:\nsshd:x:22:\ninput:x:23:\nat:x:25:at\ntape:x:26:root\nvideo:x:27:root\nnetdev:x:28:\nreadproc:x:30:\nkvm:x:34:kvm\ngames:x:35:\nshadow:x:42:\ncdrw:x:80:\nusb:x:85:\nntp:x:123:\nnofiles:x:200:\nsmmsp:x:209:smmsp\nlocate:x:245:\nutmp:x:406:\nnogroup:x:65533:\nnobody:x:65534:\nnonroot:x:65532:\nnotebook-user:x:1000:notebook-user\n"
}
]

reproduces the error by generating this corrupted layers-cache.json in version 3.16.0 of kata-containers:

[
  {
    "diff_id": "sha256:8f061a335e1d6c822399a78168ba847edc75ec8703a064418b21cbb2a8fbf867",
    "verity_hash": "18a84da98e284198dea7b85dd48d5622cd276bdcf4471cf5d4e37f582bfb2c17",
    "passwd": ""
  }
]
:0:root:/root:/bin/ash\nbin:x:1:1:bin:/bin:/sbin/nologin\ndaemon:x:2:2:daemon:/sbin:/sbin/nologin\nadm:x:3:4:adm:/var/adm:/sbin/nologin\nlp:x:4:7:lp:/var/spool/lpd:/sbin/nologin\nsync:x:5:0:sync:/sbin:/bin/sync\nshutdown:x:6:0:shutdown:/sbin:/sbin/shutdown\nhalt:x:7:0:halt:/sbin:/sbin/halt\nmail:x:8:12:mail:/var/mail:/sbin/nologin\nnews:x:9:13:news:/usr/lib/news:/sbin/nologin\nuucp:x:10:14:uucp:/var/spool/uucppublic:/sbin/nologin\noperator:x:11:0:operator:/root:/sbin/nologin\nman:x:13:15:man:/usr/man:/sbin/nologin\npostmaster:x:14:12:postmaster:/var/mail:/sbin/nologin\ncron:x:16:16:cron:/var/spool/cron:/sbin/nologin\nftp:x:21:21::/var/lib/ftp:/sbin/nologin\nsshd:x:22:22:sshd:/dev/null:/sbin/nologin\nnobody:x:65534:65534:nobody:/:/sbin/nologin\nnonroot:x:65532:65532:Account created by apko:/home/nonroot:/bin/sh\nnotebook-user:x:1000:1000::/home/notebook-user:/bin/ash\n"
  },
  {
    "diff_id": "sha256:e67c36428ee0de777e8ec5f7d73483f58854bd361026979b6937b631e2725f7a",
    "verity_hash": "e1fa97319b42f0f53fc07cbc2308e7a3d790aab2d35a2aa3820ec00b706e5efb",
    "passwd": ""
  }
]
loppy:x:11:root\nmail:x:12:mail\nnews:x:13:news\nuucp:x:14:uucp\nman:x:15:man\ncron:x:16:cron\nconsole:x:17:\naudio:x:18:\ncdrom:x:19:\ndialout:x:20:root\nftp:x:21:\nsshd:x:22:\ninput:x:23:\nat:x:25:at\ntape:x:26:root\nvideo:x:27:root\nnetdev:x:28:\nreadproc:x:30:\nkvm:x:34:kvm\ngames:x:35:\nshadow:x:42:\ncdrw:x:80:\nusb:x:85:\nntp:x:123:\nnofiles:x:200:\nsmmsp:x:209:smmsp\nlocate:x:245:\nutmp:x:406:\nnogroup:x:65533:\nnobody:x:65534:\nnonroot:x:65532:\nnotebook-user:x:1000:notebook-user\n"
}

Following 349ce8c this example no longer exhibits the behavior. In the linked commit, the user and group parsing was refactored and improved.

However, for very long strings (in otherwise valid cache files), the same behavior can still be reproduced on the tip of main.

An (ugly) script to generate such a file:

USER_COUNT=7000  # 7k worked for me - depending on your machine, fewer might suffice or more ight be necessary
TMP_USERS=""

for i in $(seq 1 "$USER_COUNT"); do
    username="user-$i"
    uid=$((1000 + i))
    gid=$((1000 + i))

    passwd_entry="$username:x:$uid:$gid::/home/$username:/bin/ash\\n"
    group_entry="$username:x:$gid:$username\\n"

    user_json=$(jq -n \
        --arg user "$username" \
        --arg passwd "$passwd_entry" \
        --arg group "$group_entry" \
        '{username: $user, passwd: $passwd, group: $group}')

    TMP_USERS="${TMP_USERS}${user_json},"
done

final_json="[${TMP_USERS%,}]"
echo "$final_json" > test_users.json

The changes in this PR fix the issue. The root cause is that fs2 only obtains a file lock during write. Since it's an exclusive lock, it should prevent concurrent writes; however, at other points in the code, no lock is acquired to read (or delete) the file. Since fs2 implements advisory locks (through flock on UNIX systems), different threads/processes are still able to read/delete while the lock is supposed to be held elsewhere.

A large number of users and groups seems to take sufficient time to parse to reproduce this error reliably.

To be honest, I am not entirely certain how exactly the corruption of the file occurs, since only one concurrent write should be possible in the current implementation. Maybe someone here has deeper insight into this.

katexochen · 2025-06-17T14:12:56Z

cc @danmihai1

danmihai1

@charludo , thanks for pointing out these issues!

To be honest, I am not entirely certain how exactly the corruption of the file occurs, since only one concurrent write should be possible in the current implementation. Maybe someone here has deeper insight into this.

My understanding is that our OpenOptions::new().write(true).open + writer.flush were not truncating the output file, when the json output was smaller than the file previously grown by a concurrently running genpolicy instance.

Given that:

Synchronizing multiple readers and writes is tricky as we discussed above, dependent on the behavior of buffering at various levels, etc.
Reading and writing this file over and over again is relatively expensive.
Executing multiple copies of genpolicy concurrently is mostly a test use case.

I think we should simplify the design:

try_lock_shared() at the beginning of genpolicy execution. If locking failed: disable layer caching for the current genpolicy instance and cancel all the steps below.
Read the file a single time in memory. If reading fails, start with an empty set of cached layers.
unlock().
Add any layers into the memory of the current genpolicy instance - but don't write these changes into the file yet.
Before successful exit: if there were any changes to the in-memory cached layers:
a. try_lock_exclusive(). If locking failed: discard any layer changes from current instance's memory, and cancel the steps below.
b. Truncate file to size 0.
c. Flush the cached contents from current instance's memory into the file.
d. unlock().

Thoughts?

I would rather avoid the complexity of NamedTempFile - hopefully the simplified algorithm I mentioned above is good enough.

You are more than welcome to make these changes if you want to. Otherwise, either I or one of my colleagues will make the changes.

charludo · 2025-06-18T06:35:04Z

My understanding is that our OpenOptions::new().write(true).open + writer.flush were not truncating the output file, when the json output was smaller than the file previously grown by a concurrently running genpolicy instance.

That does make a lot of sense!

Edit: took a shot at implementing the suggestions.

danmihai1

This looks great to me - thanks @charludo! I will start the CI tests.

src/tools/genpolicy/src/layers_cache.rs

src/tools/genpolicy/src/main.rs

src/tools/genpolicy/src/layers_cache.rs

danmihai1 · 2025-06-19T15:16:37Z

@charludo when you'll get a chance, please try to address the test failures marked "Required" in the list below. Let me know in case you'll need help with these failures.

src/tools/genpolicy/src/layers_cache.rs

The locking mechanism around the layers cache file was insufficient to prevent corruption of the file. This commit moves the layers cache's management in-memory, only reading the cache file once at the beginning of `genpolicy`, and only writing to it once, at the end of `genpolicy`. In the case that obtaining a lock on the cache file fails, reading/writing to it is skipped, and the cache is not used/persisted. Signed-off-by: charludo <git@charlotteharludo.com>

burgerdev

lgtm, thanks!

charludo mentioned this pull request Jun 17, 2025

kata: add patch preventing corruption of genpolicy's layer cache file edgelesssys/contrast#1519

Merged

danmihai1 requested changes Jun 17, 2025

View reviewed changes

danmihai1 requested a review from Redent0r June 17, 2025 20:54

burgerdev self-requested a review June 18, 2025 06:13

charludo force-pushed the fix/genpolicy-corruption-of-layer-cache-file branch from c2a3fa0 to 432ff1d Compare June 18, 2025 13:41

danmihai1 approved these changes Jun 18, 2025

View reviewed changes

src/tools/genpolicy/src/layers_cache.rs Outdated Show resolved Hide resolved

src/tools/genpolicy/src/main.rs Outdated Show resolved Hide resolved

src/tools/genpolicy/src/layers_cache.rs Outdated Show resolved Hide resolved

src/tools/genpolicy/src/layers_cache.rs Show resolved Hide resolved

danmihai1 reviewed Jun 18, 2025

View reviewed changes

src/tools/genpolicy/src/layers_cache.rs Outdated Show resolved Hide resolved

danmihai1 added the ok-to-test label Jun 18, 2025

danmihai1 temporarily deployed to ci June 18, 2025 20:09 — with GitHub Actions Inactive

charludo force-pushed the fix/genpolicy-corruption-of-layer-cache-file branch from 432ff1d to fb41824 Compare June 20, 2025 06:23

charludo temporarily deployed to ci June 20, 2025 06:53 — with GitHub Actions Inactive

sprt temporarily deployed to ci June 20, 2025 20:29 — with GitHub Actions Inactive

burgerdev reviewed Jun 23, 2025

View reviewed changes

src/tools/genpolicy/src/layers_cache.rs Show resolved Hide resolved

src/tools/genpolicy/src/layers_cache.rs Outdated Show resolved Hide resolved

charludo force-pushed the fix/genpolicy-corruption-of-layer-cache-file branch from 6af26b1 to 4e57cc0 Compare June 23, 2025 14:25

charludo temporarily deployed to ci June 23, 2025 14:56 — with GitHub Actions Inactive

burgerdev approved these changes Jun 23, 2025

View reviewed changes

Redent0r approved these changes Jun 23, 2025

View reviewed changes

danmihai1 merged commit 0a57e09 into kata-containers:main Jun 23, 2025
503 of 529 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

genpolicy: prevent corruption of the layer cache file #11426

genpolicy: prevent corruption of the layer cache file #11426

Uh oh!

charludo commented Jun 17, 2025

Uh oh!

katexochen commented Jun 17, 2025

Uh oh!

danmihai1 left a comment

Uh oh!

charludo commented Jun 18, 2025 •

edited

Loading

Uh oh!

danmihai1 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danmihai1 commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

burgerdev left a comment

Uh oh!

Uh oh!

Uh oh!

genpolicy: prevent corruption of the layer cache file #11426

genpolicy: prevent corruption of the layer cache file #11426

Uh oh!

Conversation

charludo commented Jun 17, 2025

Uh oh!

katexochen commented Jun 17, 2025

Uh oh!

danmihai1 left a comment

Choose a reason for hiding this comment

Uh oh!

charludo commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danmihai1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danmihai1 commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

burgerdev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

charludo commented Jun 18, 2025 •

edited

Loading