-
Notifications
You must be signed in to change notification settings - Fork 262
Write tar fixes #2261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write tar fixes #2261
Conversation
there are some tests failures that must be taken care of:
|
The failure itself is not the problem, but rather what the right behavior should be. It seems that some errors should be ignored, but other errors maybe not. Do you have guidance? A cop-out would be to add an |
@giuseppe - let me propose the error handling commit in a separate PR, because it evidently needs more discussion. |
87b1551
to
89d00b2
Compare
89d00b2
to
f7ff038
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thanks; this needs another LGTM, I think. Do you know someone who could provide this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do we need to make
TarWithOptionsTo
public? c/storage attempts to have a stable API, every new function is a commitment. See alsoNewPatternMatcher
. - This does not actually substantially improve error handling, because errors in the goroutine are only logged and don’t cause a failure in the consumer.
pkg/archive/archive.go
Outdated
// TarWithOptions creates an archive from the directory at `srcPath`, only including files whose relative | ||
// paths are included in `options.IncludeFiles` (if non-nil) or not in `options.ExcludePatterns`. | ||
func TarWithOptions(srcPath string, options *TarOptions) (io.ReadCloser, error) { | ||
if _, err := fileutils.NewPatternMatcher(options.ExcludePatterns); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NewPatternMatcher
is not quite free; doing that twice is awkward.
If TarWithOptionsTo
were private, we could easily only do it once, and pass the value into the goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, if the pipe writer were modified to use CloseWithError
(which, I, think, should definitely happen), we don’t need to specifically check here: we can just start the pipe writer goroutine, and receive the pattern matcher error through the pipe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's somewhat incompatible, though: the error will only be propagated once reading on the stream starts. If the caller does something like
tarStream, err := Tar(...)
if err != nil { return err }
dest := os.Create("file.tar")
io.Copy(dest, tarStream)
then your proposed behavior would now leave around "file.tar" in case of a PatternMAtcher failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error will only be propagated once reading on the stream starts
That’s a fair point. There are several ways to structure that in a private function; for the eventual public API, I don’t know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as an aside, the pattern matcher could be O(1) in the number of patterns by rewriting
NewPatternMatcher("a", "b")
to generate the regexp '(a|b)'
pkg/archive/archive.go
Outdated
logrus.Errorf("%s", err) | ||
return | ||
} | ||
compressWriter, err := CompressStream(pipeWriter, options.Compression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS compressWriter
is ~entirely local to the pipe-writing goroutine; the reading goroutine doesn’t need to know it exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean: the writing routine (which becomes TarWithOptionsTo in this PR) should also do compression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is already doing the compression: compressWriter
is a layer of indirection that does the compression within its Write
implementation, and compressWriter.Write
is ultimately called within the producer goroutine.
(After all, compressWriter
is writing to pipeWriter
; the caller of TarWithOptions
is, most of the time, blocked on reading from pipeReader
, and it has no opportunity to write to pipeWriter
in the meantime.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what your original point was. Can you rephrase it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean: the writing routine (which becomes TarWithOptionsTo in this PR) should also do compression?
Yes. TarWithOptionsTo( options: { Compression: … } )
can be set, and it seems simpler to implement it than to specifically refuse the option (or, worse, to silently ignore it while TarWithOptions
implements it).
OTOH that is similar to the PatternMatcher
situation, where moving the initialization into the writing goroutine would delay error reporting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… and would the compression interfere with your plans to use reflinks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and would the compression interfere with your plans to use reflinks?
Compression, either in the file system, or on the tar stream obviously breaks reflink copying.
Unfortunately, the CompressStream
function does more than compression; it also introduces buffering for the Uncompressed
case by means of the NewWriteCloserWrapper
. AFAICT, the wrapper will hide the ReadFrom
method of the underlying writer.
I'm not sure why the buffering is there. The tarWriter
has its own buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moby/moby#7542 claims a fairly significant performance improvement, although that’s not directly attributed to exactly these lines.
I don’t see archive.Tar
doing all that much buffering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the moby PR has a bunch of anecdotal data, along with a TarUntar benchmark. The latter looks rather unrepresentative, creating 100 4 byte files in /tmp, and then repeatedly tarring them.
I made a quick test that tars a dir on a real file system. The buffering in CompressStream makes a ~10% impact when going from ext4 to ext4,
$ TARDEST=$HOME/tardest.tar TARDIR=$HOME/vc/linux go test -run TarLinux -bench TarLinux
2025/03/11 13:41:38 tarring "/home/hanwen/vc/linux"
2025/03/11 13:41:45 7.221876687s
2025/03/11 13:41:55 9.783580435s
2025/03/11 13:42:03 8.324693106s
2025/03/11 13:42:13 9.535414505s
2025/03/11 13:42:21 8.34720174s
2025/03/11 13:42:31 9.622281549s
2025/03/11 13:42:39 8.726733474s
2025/03/11 13:42:48 9.115060653s
2025/03/11 13:42:57 8.690927914s
2025/03/11 13:43:06 9.37279054s
2025/03/11 13:43:06 With 8.34720174s, without 9.535414505s
PASS
ok github.com/containers/storage/pkg/archive 88.748s
suprprisingly, with my local hack for zero-transport copy, the buffering (which disables zero-transport) actually makes things go faster:
$ TARDEST=$HOME/.cache/tardest.tar TARDIR=$HOME/.cache/linux go test -run TarLinux -bench TarLinux
2025/03/11 13:47:06 tarring "/home/hanwen/.cache/linux"
2025/03/11 13:47:15 9.059872899s
2025/03/11 13:47:30 14.882930582s
2025/03/11 13:47:37 6.968169459s
2025/03/11 13:47:53 16.237515011s
2025/03/11 13:48:00 7.07192259s
2025/03/11 13:48:15 15.151443769s
2025/03/11 13:48:23 7.171133439s
2025/03/11 13:48:38 15.060790794s
2025/03/11 13:48:46 8.301557066s
2025/03/11 13:49:01 14.920596331s
2025/03/11 13:49:01 With 7.171133439s, without 15.060790794s
PASS
ok github.com/containers/storage/pkg/archive 114.837s
for tarring larger files, zero copy does bring benefits,
$ TARDEST=$HOME/.cache/tardest.tar TARDIR=$HOME/.cache/lib64 go test -run TarLinux -bench TarLinux
2025/03/11 13:54:09 tarring "/home/hanwen/.cache/lib64"
2025/03/11 13:54:12 2.802090087s
2025/03/11 13:54:14 1.640192869s
2025/03/11 13:54:16 2.12680278s
2025/03/11 13:54:18 1.61872927s
2025/03/11 13:54:20 2.151480276s
2025/03/11 13:54:21 1.619020427s
2025/03/11 13:54:24 2.168449473s
2025/03/11 13:54:25 1.584594405s
2025/03/11 13:54:27 2.132707088s
2025/03/11 13:54:29 1.617399592s
2025/03/11 13:54:29 With 2.151480276s, without 1.61872927s
it seems I'll need a more sophisticated strategy here (using zero-copy only for large files), and right now, there is no good argument for dropping the buffering from CompressStream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/archive/archive.go
Outdated
// `dest`, only including files whose relative paths are included in | ||
// `options.IncludeFiles` (if non-nil) or not in | ||
// `options.ExcludePatterns`. | ||
func TarWithOptionsTo(dest io.WriteCloser, srcPath string, options *TarOptions) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how this function silently ignores many errors (and, it seems, the plan is to change that), I’m very uncomfortable with making it a public API at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point.
I am primarily interested in creating tar files (in the Podman Checkpoint command) without the intermediate buffering introduced by io.Pipe
. How should we proceed then? IIUC, there are couple of uncontested parts:
- passing errors in the writer through to the reader using CloseWithError
- avoiding writes to TarOptions
and a more hairy change, which is to propagate all or some errors from tarWriter.addFile(). IIUC, you'd rather see us decide what this looks like before making a new API.
FWIW, I think it would be better to export something like archive.tarWriter
. Then, some of the functionality of TarOptions
could be expressed in code by callers, rather than having 16 options that interact with each other in unpredictable ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the “uncontested parts” can be merged ~immediately.
I didn’t look at the failures triggered by adding error handling ~at all yet. I understand that they matter to you (don’t they?), so they block the overall effort anyway. It think that’s the next step.
And making the API public would follow, or be a part of the “add error handling” PR.
I think it would be better to export something like
archive.tarWriter
.
I have ~no opinion on that in principle, I’d defer to actual c/storage maintainers.
At a high level, I’m a bit unclear on what is the point of a granular API for external consumers. archive.Tar
, sure, a very simple one-shot call. archive.TarWithOptions
, fine, for various tunables (although many would be pretty specific to needs of c/storage callers).
For the individual tarWriter.addFile
… I’m tempted to question whether there are callers specialized enough that using just archive.TarWithOptions
is insufficient, but not so much that they would want to use the standard-library archive/tar.Writer
, and whether the intersection of these conditions is enough to worry about.
Then, some of the functionality of TarOptions could be expressed in code by callers
I don’t know what this means. Do you mean some other parameter-passing convention (e.g. “functional options”?) Using ordinary function parameters instead of a struct? I’d be rather opposed to the latter.
I am primarily interested in creating tar files (in the Podman Checkpoint command) without the intermediate buffering introduced by io.Pipe
. I’m curious, does that have a measurable performance impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that they matter to you (don’t they?),
The failures in the test suite (caused by ignoring errors) don't matter to me. What matters to me is that in case of any error, the tar creation fails.
they would want to use the standard-library archive/tar.Writer, and whether the intersection of these conditions is enough to worry about.
The archive/tar
package falls short because it doesn't connect the file system to the Writer/Reader, handling among others
- extended attributes
- file types (mkdir vs mknod vs creat)
- hard-links
Note that the std library archive/tar.Writer
is structured as a type with methods, offering greater flexibility to callers, rather than a function with a zillion options.
I’m curious, does that have a measurable performance impact?
See containers/podman@40e49cc and containers/podman#24826 (comment)
for writing large files to BTRFS, there is an effective speedup of up to 1000x.
f7ff038
to
ff06e02
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I think we are now at
- some refactoring towards a better design — good
- some, but not all, error paths now returning errors
That’s a good bit of progress.
@giuseppe PTanotherL.
pkg/archive/archive.go
Outdated
logrus.Errorf("%s", err) | ||
return | ||
} | ||
compressWriter, err := CompressStream(pipeWriter, options.Compression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moby/moby#7542 claims a fairly significant performance improvement, although that’s not directly attributed to exactly these lines.
I don’t see archive.Tar
doing all that much buffering.
ff06e02
to
6e03dd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
ping. @giuseppe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
A nil and empty slice are equivalent, except resulting JSON serialization. There is JSON serialization in applyLayerHandler(), but it's internal to the storage package and therefore not subject to compatibility promises. Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
Before, this was breaking the convention that options passed by reference are not mutated. In particular, TestOverlayTarUntar passes the same object to both TarWithOptions and Untar, resulting in a race detector failure. Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
Gives a name to the anonymous goroutine populating the archive, and makes parameter dependencies explicit, in preparation to moving it out. No functional changes. Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
6e03dd5
to
d39dba2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe, hanwen-flow The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
No description provided.