Skip to content

Introduce a temporary directory for deleting layer files outside the lock #2325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 10, 2025

Conversation

Honny1
Copy link
Member

@Honny1 Honny1 commented May 5, 2025

This PR contains the implementation of the temporary directory and the integration of the temporary directory. The temporary directory is used to delete files outside of the global storage lock to reduce the global lock time. Instead of deleting files. Files are moved to the temporary directory by renaming them. Renaming a file is faster than deleting it from disk. After all files have been moved to the temporary directory and the file references have been removed. And the global storage lock is unlocked. The temporary directory is removed.

Fixes: https://issues.redhat.com/browse/RHEL-36087
Fixes: https://issues.redhat.com/browse/RUN-3104

@Honny1 Honny1 force-pushed the prototype-temp branch 5 times, most recently from ca5a283 to cb73550 Compare May 5, 2025 14:39
@cgwalters
Copy link
Contributor

Design doc: https://docs.google.com/document/d/196S3MWJkuqoKDhkvtnbT6z9LFZzwibIFmcEM9EyJu8Q/edit?usp=sharing

I think we should avoid using private/company-specific Google docs to design public FOSS in general especially as part of the movement of projects in this space to the CNCF.

@Honny1
Copy link
Member Author

Honny1 commented May 5, 2025

@cgwalters ok

@Honny1 Honny1 force-pushed the prototype-temp branch 2 times, most recently from a1810a4 to d4352cd Compare May 5, 2025 15:55
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very preliminary first look — feel free to ignore if feedback is not yet wanted.

At this point I’m mostly focused on the TempDir mechanism’s design/API; callers are interesting to the extent that they represent use cases that need to be handled, but

// Required global storage lock must be held before calling this function.
func recover(rootDir string) error {
tempDirPattern := filepath.Join(rootDir, "temp-dir-*")
tempDirs, err := filepath.Glob(tempDirPattern)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is, this would add a .Glob to every RW lock acquisition. It would be interesting to try to optimize that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just use a parent directory for the temp directories so we don't have to look at their name.

instead of: /foo/run/vfs-layers/temp-dir-28400 use something like /foo/run/vfs-layers/temp-dirs/28400 so we avoid the glob

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I’m wondering about even more optimization — we could have an “might have a staging directory” bool in layers.json, and use the layerStore.lockfile.RecordWrite notification mechanism to make all of this ~zero-cost to users of layerStore.startReading. But, one step at a time.)

@Honny1 Honny1 force-pushed the prototype-temp branch 7 times, most recently from cc142b0 to ce89bc4 Compare May 13, 2025 15:02
@Honny1
Copy link
Member Author

Honny1 commented May 13, 2025

@mtrmac, @giuseppe PTAL, I've updated the prototype.

@Honny1 Honny1 requested review from giuseppe and mtrmac May 13, 2025 15:05
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good progress, TempDir now looks safe, and the documentation is great.

For Podman service, and CRI-O, we do need layer deletion to happen very soon, not just at store shutdown. And, for those, we might also want the recovery to happen much more frequently (e.g. if someone runs podman rmi as an one-time operation on an OpenShift node, that crashes, and the only other c/storage user is a long-running CRI-O process).

That might, OTOH, make the serialization implied by tempDirGlobalLock more frequent = more costly and less desirable, again. I don’t think we absolutely need that lock?

  • Split instance state and TempDir state. Then there will be no need to do any locking in TempDir itself, for in-process purposes, it would just contain read-only fields. And we can make the instance single-goroutine-owned = again, no need for in-process locks.
  • Move instanceLock out of the instance directory, and change lifetimes so that the lock is created before, and deleted after, the instance directory. Then I think we don’t need tempDirGlobalLock for cross-process purposes, either (? apart from coordinating recovery).

@@ -2027,7 +2042,7 @@ func (d *Driver) ListLayers() ([]string, error) {
for _, entry := range entries {
id := entry.Name()
switch id {
case linkDir, stagingDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile:
case linkDir, stagingDir, tempDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh… do we need to worry about forward compatibility, i.e. an older version of this code running and unaware of tempDir?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but I moved it to stagingDir. But I'm not sure how it is with recreateSymlinks(). I think it should ignore stagingDir.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but I moved it to stagingDir.

That might make things worse, see it would interfere with pruneStagingDirectories.


But I'm not sure how it is with recreateSymlinks(). I think it should ignore stagingDir.

Yes. It probably works well enough now, because $stagingDir/link doesn’t exist, so the directory is ignored - but it seems attractive to clean this up, going forward.

The very last resort which always works would be to leave d.home for layer directories, and have the caller of the driver provide a separate d.metadataHome where we have no prior commitments.

But, maybe, the answer is that we can leave tempDir in d.home, as long as $tempDir/link is never created?? And should we now at least reserve a new path prefix, e.g. defining that ${d.home}/meta* is never a layer directory?

Or, alternatively, we could define a storage version number, and refuse to work with future versions. That would be drastic but clearly correct, OTOH we would be signing up for very precisely managing this to keep CRI-O and Podman in sync (and for not forgetting to update the version number…)

All of this forward-compatibility design seems valuable, but also somewhat out of scope of this “unlocked deletes” effort — or at least clearly out of scope of this prototype.

Cc: @giuseppe

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this is no longer just a prototype, we’ll need to find a solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giuseppe With this, old versions’ GarbageCollect == podman system prune --external can delete all of tempDir.

I think that’s good enough, it’s supposed to be run at boot at quiet state, but I’m highlighting this in case I’m missing something.

@cgwalters
Copy link
Contributor

I only tangentially skimmed this PR but just in case of interest, from the very start ostree has implemented a vaguely similar scheme in https://github.com/ostreedev/ostree/blob/7a32b86ee4c0232bfde97a3cd87c4a82b6e73218/src/libostree/ostree-repo.c#L6292
An important concern for ostree is keeping track of which files have been fsync()'d and to allow resuming.

Comment on lines 62 to 76
Example structure:

rootDir/
.global-tmp-lock (tempDirGlobalLock)
temp-dir-ABC/
.lock (instanceLock for temp-dir-ABC)
file1
file3
temp-dir-XYZ/
.lock (instanceLock for temp-dir-XYZ)
file2
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2325 (comment) still holds.

This structure is subject to a race condition when the .lock file is deleted before the other files in the directory.

So you have process A that holds the lock while it runs rm -rf temp-dir-XYZ, what a new process B should do when it attempts to lock the directory but doesn't find the lock file? 1: Assume the directory is being deleted (process A could crash), or 2: delete it as well?

I think it is safer to have:

temp-dir-XZY/data/
temp-dir-XZY/.lock

first temp-dir-XZY/data/ and if that succeeded proceed to remove temp-dir-XZY.

When process B tries to lock the directory, it will find the lock file until data/ exists, and if it doesn't find lock then it can do rmdir("temp-dir-XZY") because that could happen if the process A crashed after removing the lock file but before removing the directory (or after creating the directory but before the lock file was created)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we need that directory at all — $root/${instanceID}.lock could mark ownership of $root/${instanceID}.data.

On creation, allocate the lock first. On deletion, unlock and delete the lock last.

(This could even, hypothetically, allow ….data to be a regular file, without the overhead of any extra directories.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the content we are trying to move to the temp-dir has a file/directory named .lock?

Copy link
Collaborator

@mtrmac mtrmac May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This structure is subject to a race condition when the .lock file is deleted before the other files in the directory.

I think that part is good enough — things like os.RemoveAll really shouldn’t crash if a parent is removed in the meantime.

The problem with moving the lock inside the instance directory is at creation time, where we have the instance directory without any lock, so we need to protect against cleanup/recovery immediately deleting it. The current implementation does solve that, by holding the global lock — but that also means that the cleanup/recovery needs to obtain the global lock every time just to check whether anything needs to be cleaned up.

Together with #2325 (review) arguing that we should do the cleanups frequently, not just at process shutdown, that implies obtaining an extra lock file on both layerStore.startReading and layerStore.startWriting. I’d rather prefer to avoid that cost. (I’d even like to avoid the Readdir on every start…ing, but that will require more cooperation with layerStore.)

Copy link
Collaborator

@mtrmac mtrmac May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the content we are trying to move to the temp-dir has a file/directory named .lock?

I think that’s fine?

  • File names under $root are owned by the tempdir package. Callers have no claim to decide names, we would simply not give them the API (and the current code in .Add that bases the file name on the base name of the input can be changed).
  • So, a regular file would be named $root/${instanceID}.data. No ambiguity.
  • If the instance were a larger directory structure (our primary use case now:, a layer deletion in progress), $root/${instanceID}.data would be a directory containing arbitrary contents. We would not be looking for locks inside the ….data subdirectory at all.

Comment on lines 100 to 101
td.tempDirGlobalLock.Lock()
defer td.tempDirGlobalLock.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the lock here since it is released when the function exits, so for the caller it doesn't matter whether the code below runs with the lock or not.

As @mtrmac pointed out, we need to use os.ReadDir not Walk it recursively

@Honny1 Honny1 force-pushed the prototype-temp branch 3 times, most recently from dbd4cad to 76529ec Compare May 15, 2025 18:34
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ratholes inside ratholes …

If I’m reading things right, using pkg/lockfile.LockFile for staging directories would leak memory? Cc: @giuseppe to keep me honest.

If so, we need a new idea, or a new primitive. I was thinking a new internal/staginglock that supports

  • CreateAndTryLock
  • TryLockExistingPath
  • UnlockAndDelete

without an ever-growing lock map. If so, can we do that in yet another separate PR, fix overlay’s stagingDirBase, and then return to this one?

But maybe there’s some much simpler approach I’m now missing.

@Honny1
Copy link
Member Author

Honny1 commented May 16, 2025

I've edited the RecoverStaleDirs, initInstanceTempDir, and Cleanup so they're less racy and marked racy parts. I think that would be solved with internal/staginglock.

@Honny1 Honny1 changed the title [PROTOTYPE] Temporary directory Introduce a temporary directory for deleting layer files outside the lock Jun 26, 2025
@Honny1 Honny1 marked this pull request as ready for review June 26, 2025 16:19
Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a PR for Podman to test the new functionality and see how it would work?

In particular, I'd like to see how the deferred removal would work in a cleanup process

@Honny1
Copy link
Member Author

Honny1 commented Jul 1, 2025

Here is a PR with vendoring of c/stroage with a new approach for deleting layers with a temporary directory: containers/podman#26546

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I think this proves the concept beyond doubt.

Now, for merging / using in production, the one larger correctness issue I see is correctly placing the staging directories in graph drivers.

(I do worry a little about performance — we now trigger recovery on process start (clearly correct and necessary), and after every time we observe some other process making a modification to the store. That might be frequent in high-load multi-Podman systems… OTOH it should not affect the podman-remote server nor CRI-O, who are ~the only writers, so, on balance, I think this is probably fine, even with the ~2 extra ReadDirs.)

if err != nil {
return nil, err
}
if err := t.Add(d.dir(id)); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d.dir can be filepath.Join(d.home, "dir", …) or filepath.Join(d.imageStore, d.String(), "dir", …) (or something with d.additionalHomes[], although those are presumably in R/O image stores and should not be possible to delete).

d.home and d.imageStore can be on different filesystems (and, I think, that would frequently would be the case: the “image” layers would be in durable storage and the “container” layers in something faster but less durable).

So, I think the GetTempDirRootDir driver API needs to be generalized to return a list of directories. (And that would remove the current “"" means N/A” special case.)

})
}

func (d *Driver) removeCommon(id string, cleanup func(string) error) error {
dir := d.dir(id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to VFS, this can point to d.imageStore, not to d.home.

@@ -2027,7 +2042,7 @@ func (d *Driver) ListLayers() ([]string, error) {
for _, entry := range entries {
id := entry.Name()
switch id {
case linkDir, stagingDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile:
case linkDir, stagingDir, tempDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this is no longer just a prototype, we’ll need to find a solution.

Fixes: https://issues.redhat.com/browse/RUN-3104

Signed-off-by: Jan Rodák <hony.com@seznam.cz>
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a brief look, I didn’t review the driver code changes.

// deferredDelete deletes a layer with the specified name or ID.
// This removal happen immediately (the layer is no longer usable),
// but physically deleting the files may be deferred.
// Caller MUST call all returned cleanup functions outside of the locks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… EVEN IF the function returns an error.

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m afraid the additional image store implementation is confusing , the way it is mixed over Store and drivers. (Exhibit N+1 why I refuse to work on adding more options to graph drivers.)

@Honny1
Copy link
Member Author

Honny1 commented Jul 9, 2025

@mtrmac PTAL

@Honny1 Honny1 requested review from mtrmac and giuseppe July 9, 2025 11:48
Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci openshift-ci bot added the approved label Jul 9, 2025
}

func (d *Driver) DeferredRemove(id string) (tempdir.CleanupTempDirFunc, error) {
_, homeDir, _ := d.dir2(id, false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this homeDir is [undocumented :( and] “where to create layers”; it does not, AFAICS, consistently match the directory where id is found — so this does not return the right directory.


Really I think having dir2, in both graph drivers, return the parent(?) directory would be cleaner (centralizing knowledge/policy, and allowing to clearly track where data comes from) than hard-coding any knowledge in the DeferredRemove callers.

Admittedly, yes, that would imply documenting what dir2 does, and that’s a bit of a puzzle. But, as this PR shows, that would be valuable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rework and documentation of dir2 should be done in another PR. Should I create an issue for this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that’s ultimately up to you.

I think after this merges you are likely to be pinged if this behaves in a surprising way, and right now you understand the code better than most people do most of the time. So, I think it would help your future self to document it very soon. But that’s a guess about the future, I don’t think I’m 100% justified in requiring you to do that investment now.

This enables deferred deletion of physical files outside of global locks, improving performance and reducing lock contention.

Signed-off-by: Jan Rodák <hony.com@seznam.cz>
@Honny1 Honny1 requested a review from mtrmac July 10, 2025 11:11
@Honny1
Copy link
Member Author

Honny1 commented Jul 10, 2025

@mtrmac PTAL

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! A culmination of a lot of great work.

@giuseppe PTAL.

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the great work!

/lgtm

Copy link
Contributor

openshift-ci bot commented Jul 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Honny1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 092254c into containers:main Jul 10, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants