-
Notifications
You must be signed in to change notification settings - Fork 261
Introduce a temporary directory for deleting layer files outside the lock #2325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ca5a283
to
cb73550
Compare
I think we should avoid using private/company-specific Google docs to design public FOSS in general especially as part of the movement of projects in this space to the CNCF. |
@cgwalters ok |
a1810a4
to
d4352cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very preliminary first look — feel free to ignore if feedback is not yet wanted.
At this point I’m mostly focused on the TempDir
mechanism’s design/API; callers are interesting to the extent that they represent use cases that need to be handled, but
pkg/tempdir/tempdir.go
Outdated
// Required global storage lock must be held before calling this function. | ||
func recover(rootDir string) error { | ||
tempDirPattern := filepath.Join(rootDir, "temp-dir-*") | ||
tempDirs, err := filepath.Glob(tempDirPattern) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is, this would add a .Glob
to every RW lock acquisition. It would be interesting to try to optimize that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just use a parent directory for the temp directories so we don't have to look at their name.
instead of: /foo/run/vfs-layers/temp-dir-28400
use something like /foo/run/vfs-layers/temp-dirs/28400
so we avoid the glob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I’m wondering about even more optimization — we could have an “might have a staging directory” bool in layers.json
, and use the layerStore.lockfile.RecordWrite
notification mechanism to make all of this ~zero-cost to users of layerStore.startReading
. But, one step at a time.)
cc142b0
to
ce89bc4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Good progress, TempDir
now looks safe, and the documentation is great.
For Podman service, and CRI-O, we do need layer deletion to happen very soon, not just at store shutdown. And, for those, we might also want the recovery to happen much more frequently (e.g. if someone runs podman rmi
as an one-time operation on an OpenShift node, that crashes, and the only other c/storage user is a long-running CRI-O process).
That might, OTOH, make the serialization implied by tempDirGlobalLock
more frequent = more costly and less desirable, again. I don’t think we absolutely need that lock?
- Split instance state and
TempDir
state. Then there will be no need to do any locking inTempDir
itself, for in-process purposes, it would just contain read-only fields. And we can make the instance single-goroutine-owned = again, no need for in-process locks. - Move
instanceLock
out of the instance directory, and change lifetimes so that the lock is created before, and deleted after, the instance directory. Then I think we don’t needtempDirGlobalLock
for cross-process purposes, either (? apart from coordinating recovery).
drivers/overlay/overlay.go
Outdated
@@ -2027,7 +2042,7 @@ func (d *Driver) ListLayers() ([]string, error) { | |||
for _, entry := range entries { | |||
id := entry.Name() | |||
switch id { | |||
case linkDir, stagingDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile: | |||
case linkDir, stagingDir, tempDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh… do we need to worry about forward compatibility, i.e. an older version of this code running and unaware of tempDir
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but I moved it to stagingDir
. But I'm not sure how it is with recreateSymlinks()
. I think it should ignore stagingDir
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but I moved it to
stagingDir
.
That might make things worse, see it would interfere with pruneStagingDirectories
.
But I'm not sure how it is with
recreateSymlinks()
. I think it should ignorestagingDir
.
Yes. It probably works well enough now, because $stagingDir/link
doesn’t exist, so the directory is ignored - but it seems attractive to clean this up, going forward.
The very last resort which always works would be to leave d.home
for layer directories, and have the caller of the driver provide a separate d.metadataHome
where we have no prior commitments.
But, maybe, the answer is that we can leave tempDir
in d.home
, as long as $tempDir/link
is never created?? And should we now at least reserve a new path prefix, e.g. defining that ${d.home}/meta*
is never a layer directory?
Or, alternatively, we could define a storage version number, and refuse to work with future versions. That would be drastic but clearly correct, OTOH we would be signing up for very precisely managing this to keep CRI-O and Podman in sync (and for not forgetting to update the version number…)
All of this forward-compatibility design seems valuable, but also somewhat out of scope of this “unlocked deletes” effort — or at least clearly out of scope of this prototype.
Cc: @giuseppe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this is no longer just a prototype, we’ll need to find a solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giuseppe With this, old versions’ GarbageCollect
== podman system prune --external
can delete all of tempDir
.
I think that’s good enough, it’s supposed to be run at boot at quiet state, but I’m highlighting this in case I’m missing something.
I only tangentially skimmed this PR but just in case of interest, from the very start ostree has implemented a vaguely similar scheme in https://github.com/ostreedev/ostree/blob/7a32b86ee4c0232bfde97a3cd87c4a82b6e73218/src/libostree/ostree-repo.c#L6292 |
pkg/tempdir/tempdir.go
Outdated
Example structure: | ||
|
||
rootDir/ | ||
.global-tmp-lock (tempDirGlobalLock) | ||
temp-dir-ABC/ | ||
.lock (instanceLock for temp-dir-ABC) | ||
file1 | ||
file3 | ||
temp-dir-XYZ/ | ||
.lock (instanceLock for temp-dir-XYZ) | ||
file2 | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2325 (comment) still holds.
This structure is subject to a race condition when the .lock
file is deleted before the other files in the directory.
So you have process A that holds the lock while it runs rm -rf temp-dir-XYZ
, what a new process B should do when it attempts to lock the directory but doesn't find the lock file? 1: Assume the directory is being deleted (process A could crash), or 2: delete it as well?
I think it is safer to have:
temp-dir-XZY/data/
temp-dir-XZY/.lock
first temp-dir-XZY/data/
and if that succeeded proceed to remove temp-dir-XZY
.
When process B tries to lock the directory, it will find the lock file until data/
exists, and if it doesn't find lock
then it can do rmdir("temp-dir-XZY")
because that could happen if the process A crashed after removing the lock file but before removing the directory (or after creating the directory but before the lock file was created)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think we need that directory at all — $root/${instanceID}.lock
could mark ownership of $root/${instanceID}.data
.
On creation, allocate the lock first. On deletion, unlock and delete the lock last.
(This could even, hypothetically, allow ….data
to be a regular file, without the overhead of any extra directories.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the content we are trying to move to the temp-dir has a file/directory named .lock
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This structure is subject to a race condition when the .lock file is deleted before the other files in the directory.
I think that part is good enough — things like os.RemoveAll
really shouldn’t crash if a parent is removed in the meantime.
The problem with moving the lock inside the instance directory is at creation time, where we have the instance directory without any lock, so we need to protect against cleanup/recovery immediately deleting it. The current implementation does solve that, by holding the global lock — but that also means that the cleanup/recovery needs to obtain the global lock every time just to check whether anything needs to be cleaned up.
Together with #2325 (review) arguing that we should do the cleanups frequently, not just at process shutdown, that implies obtaining an extra lock file on both layerStore.startReading
and layerStore.startWriting
. I’d rather prefer to avoid that cost. (I’d even like to avoid the Readdir
on every start…ing
, but that will require more cooperation with layerStore
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the content we are trying to move to the temp-dir has a file/directory named
.lock
?
I think that’s fine?
- File names under
$root
are owned by thetempdir
package. Callers have no claim to decide names, we would simply not give them the API (and the current code in.Add
that bases the file name on the base name of the input can be changed). - So, a regular file would be named
$root/${instanceID}.data
. No ambiguity. - If the instance were a larger directory structure (our primary use case now:, a layer deletion in progress),
$root/${instanceID}.data
would be a directory containing arbitrary contents. We would not be looking for locks inside the ….data
subdirectory at all.
pkg/tempdir/tempdir.go
Outdated
td.tempDirGlobalLock.Lock() | ||
defer td.tempDirGlobalLock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need the lock here since it is released when the function exits, so for the caller it doesn't matter whether the code below runs with the lock or not.
As @mtrmac pointed out, we need to use os.ReadDir
not Walk it recursively
dbd4cad
to
76529ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ratholes inside ratholes …
If I’m reading things right, using pkg/lockfile.LockFile
for staging directories would leak memory? Cc: @giuseppe to keep me honest.
If so, we need a new idea, or a new primitive. I was thinking a new internal/staginglock
that supports
CreateAndTryLock
TryLockExistingPath
UnlockAndDelete
without an ever-growing lock map. If so, can we do that in yet another separate PR, fix overlay’s stagingDirBase
, and then return to this one?
But maybe there’s some much simpler approach I’m now missing.
I've edited the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have a PR for Podman to test the new functionality and see how it would work?
In particular, I'd like to see how the deferred removal would work in a cleanup process
Here is a PR with vendoring of c/stroage with a new approach for deleting layers with a temporary directory: containers/podman#26546 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I think this proves the concept beyond doubt.
Now, for merging / using in production, the one larger correctness issue I see is correctly placing the staging directories in graph drivers.
(I do worry a little about performance — we now trigger recovery on process start (clearly correct and necessary), and after every time we observe some other process making a modification to the store. That might be frequent in high-load multi-Podman systems… OTOH it should not affect the podman-remote
server nor CRI-O, who are ~the only writers, so, on balance, I think this is probably fine, even with the ~2 extra ReadDir
s.)
drivers/vfs/driver.go
Outdated
if err != nil { | ||
return nil, err | ||
} | ||
if err := t.Add(d.dir(id)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d.dir
can be filepath.Join(d.home, "dir", …)
or filepath.Join(d.imageStore, d.String(), "dir", …)
(or something with d.additionalHomes[]
, although those are presumably in R/O image stores and should not be possible to delete).
d.home
and d.imageStore
can be on different filesystems (and, I think, that would frequently would be the case: the “image” layers would be in durable storage and the “container” layers in something faster but less durable).
So, I think the GetTempDirRootDir
driver API needs to be generalized to return a list of directories. (And that would remove the current “""
means N/A” special case.)
}) | ||
} | ||
|
||
func (d *Driver) removeCommon(id string, cleanup func(string) error) error { | ||
dir := d.dir(id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to VFS, this can point to d.imageStore
, not to d.home
.
drivers/overlay/overlay.go
Outdated
@@ -2027,7 +2042,7 @@ func (d *Driver) ListLayers() ([]string, error) { | |||
for _, entry := range entries { | |||
id := entry.Name() | |||
switch id { | |||
case linkDir, stagingDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile: | |||
case linkDir, stagingDir, tempDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this is no longer just a prototype, we’ll need to find a solution.
Fixes: https://issues.redhat.com/browse/RUN-3104 Signed-off-by: Jan Rodák <hony.com@seznam.cz>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a brief look, I didn’t review the driver code changes.
// deferredDelete deletes a layer with the specified name or ID. | ||
// This removal happen immediately (the layer is no longer usable), | ||
// but physically deleting the files may be deferred. | ||
// Caller MUST call all returned cleanup functions outside of the locks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… EVEN IF the function returns an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m afraid the additional image store implementation is confusing , the way it is mixed over Store
and drivers. (Exhibit N+1 why I refuse to work on adding more options to graph drivers.)
@mtrmac PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
drivers/overlay/overlay.go
Outdated
} | ||
|
||
func (d *Driver) DeferredRemove(id string) (tempdir.CleanupTempDirFunc, error) { | ||
_, homeDir, _ := d.dir2(id, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this homeDir
is [undocumented :( and] “where to create layers”; it does not, AFAICS, consistently match the directory where id
is found — so this does not return the right directory.
Really I think having dir2
, in both graph drivers, return the parent(?) directory would be cleaner (centralizing knowledge/policy, and allowing to clearly track where data comes from) than hard-coding any knowledge in the DeferredRemove
callers.
Admittedly, yes, that would imply documenting what dir2
does, and that’s a bit of a puzzle. But, as this PR shows, that would be valuable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rework and documentation of dir2
should be done in another PR. Should I create an issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that’s ultimately up to you.
I think after this merges you are likely to be pinged if this behaves in a surprising way, and right now you understand the code better than most people do most of the time. So, I think it would help your future self to document it very soon. But that’s a guess about the future, I don’t think I’m 100% justified in requiring you to do that investment now.
This enables deferred deletion of physical files outside of global locks, improving performance and reducing lock contention. Signed-off-by: Jan Rodák <hony.com@seznam.cz>
@mtrmac PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! A culmination of a lot of great work.
@giuseppe PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the great work!
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe, Honny1 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR contains the implementation of the temporary directory and the integration of the temporary directory. The temporary directory is used to delete files outside of the global storage lock to reduce the global lock time. Instead of deleting files. Files are moved to the temporary directory by renaming them. Renaming a file is faster than deleting it from disk. After all files have been moved to the temporary directory and the file references have been removed. And the global storage lock is unlocked. The temporary directory is removed.
Fixes: https://issues.redhat.com/browse/RHEL-36087
Fixes: https://issues.redhat.com/browse/RUN-3104