Skip to content

Improve error handling in archive creation #2289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 23, 2025

Conversation

hanwen-flow
Copy link
Contributor

No description provided.

@hanwen-flow
Copy link
Contributor Author

@mtrmac @giuseppe

FYI - I have found a different way to deal with tar file overhead in checkpoint/restore , so in the end I don't need tarWithOptionsTo as a public function. It might still be a good addition though. Most calls to TarWithOptions copy the the stream to a destination tar file.

@hanwen-flow hanwen-flow changed the title Improve handling in archive creation Improve error handling in archive creation Mar 19, 2025
Comment on lines 644 to 672
// a whiteout header
if wo != nil {
if hdr.Typeflag != tar.TypeReg || hdr.Size == 0 {
if err := ta.TarWriter.WriteHeader(hdr); err != nil {
return err
}
if hdr.Typeflag == tar.TypeReg && hdr.Size > 0 {
return fmt.Errorf("tar: cannot use whiteout for non-empty file")
}
hdr = wo
hdr = headers.whiteout
} else {
logrus.Infof("tar: cannot use whiteout for non-empty file %s", hdr.Name)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don‘t understand the “pkg/archive: remove "whiteout for non-empty file" error handling” commit, I think the original code was correct.

If the converter returns a new whiteout header, the quoted code calls WriteHeader on the original hdr, and then immediately follows WriteHeader(theNewWhiteout). In that case, the Size > 0 situation (as you say) would cause an invalid archive to be created — so why does that commit turns that problematic situation from an error into a warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/containers/storage/blob/main/pkg/archive/archive.go#L634,

			if err := ta.TarWriter.WriteHeader(hdr); err != nil {
				return err
			}
			if hdr.Typeflag == tar.TypeReg && hdr.Size > 0 {
				return fmt.Errorf("tar: cannot use whiteout for non-empty file")
			}
			hdr = wo

the check flags a problematic situation, but it should have been made before writing the header with Size > 0, because the current approach breaks the entire archive without flagging an error to the caller. I don't know why it was written this way; given the state of this code, I suspect carelessness. Removing the check doesn't make a test fail, and moby/moby has this erroneous bit until today.

It was introduced in moby/moby@daa7019 . It already ignore the error return from addFile.

for us, we have to decide between:

  • write the header, write file, then write whiteout
  • ignore the whiteout if file.size >0
  • ignore the file + whiteout if file.size > 0
  • fail the archive creation

I can't tell which one is the right solution , but clearly the current code is broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking a bit further, the only WhiteoutConverter is the one in archive_linux.go. Its ConvertWrite only takes action on directories and char devices, so this condition can never trigger.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are in agreement that writing a whiteout header with .Size > 0 would cause a corrupt file to be created. And that detecting that only after the WriteHeader is not ideal (but I think it doesn’t really matter as long as the overall work reports a failure.)

Where the disconnect seems to be is that previously the code reported a failure, and now it fails and produces incorrect output. I think that’s a move in the wrong direction?!


the only WhiteoutConverter is the one in archive_linux.go. Its ConvertWrite only takes action on directories and char devices, so this condition can never trigger.

Yes. But, sadly, it’s a public API, so we can’t change WhiteoutConverter itself.

Anyway, I think I’d be fine with just code comments (on the caller, noting that the only possible implementation never sets Size > 0, and on the whiteout implementation, reminding that the implementation must preserve that property) — but keeping the check with the hard failure would provide a stronger guarantee, and the code ~already exists…

So I think just moving the error check before the WriteHeader might be best here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But, sadly, it’s a public API, so we can’t change WhiteoutConverter itself.

the converter is a public API, but it is only selected through TarOptions.WhiteoutFormat, whose interpretation in GetWhiteoutConverter we control.

I did some more reading. The AUFS and overlayfs file systems have different ways of indicating deletions. This functionality is converts between them (where overlayfs -> aufs may introduce 2 entries for a single directory). The assumption therefore is that deletions are never marked by a non-empty files, and that a single whiteout in one format may only expand to 2 entries (without content) in the other format.

I'll add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where the disconnect seems to be is that previously the code reported a failure,

it returned an error, but then that error only resulted in a log message.

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Reading this line by line, this is fair enough, but I’m afraid that with this as is, future readers/maintainers would be more confused than with the current code.

Can you add documentation to addFileData? Primarily answering why is the “add file” functionality split into two functions (and, if adding new features in the future, in which of the two functions those features should go), and maybe anything else that maintainers should know.

Looking at #2261 (comment) , I guess the answer is “we have existing tests that require some failures to be ignored, so this is a way to structure the code to allow that, without ignoring others”. That could be a fair answer as far as it goes, but now it’s not obvious from the code without such a documentation…

… but, also, I’m inclined to push for details on the “require write failures to be ignored” part. I’m not at all saying this is incorrect, but someone needs to understand that this is the case, and that we should preserve that behavior — that it is not a bug in the tests where we are now doing a lot of work to reinforce that bug. And I’ll fully admit that I didn’t look into this: I assume you did, so please write down what you found and where we are. (On this part, I’m completely ignorant and I’d happily defer to maintainers more familiar with the code.)

@mtrmac
Copy link
Collaborator

mtrmac commented Mar 19, 2025

Oh and the lint task is failing, with some inscrutable failure… is that an infrastructure problem, or are we somehow throwing away the logs in CI?

@hanwen-flow
Copy link
Contributor Author

… but, also, I’m inclined to push for details on the “require write failures to be ignored” part. I’m not at all saying this is incorrect, but someone needs to understand that this is the case, and that we should preserve that behavior

these functions run on live containers, so another process may be mutating the file we're trying to copy out, so there can always be errors, eg. the path we're trying to archive was just removed by another process, so os.Lstat returns ENOENT.

This code could be much more resistant to races (eg. using openat(2) and friends to fix a file entry to export in just one place.), but it can't be fundamentally fixed: the container may truncate a file while we're writing it to the tar, causing a short read, and there is no way to recover from this.

With this change, we can be sure that we notice before handing back a corrupted file to the caller.

@hanwen-flow
Copy link
Contributor Author

these functions run on live containers, so another process may be mutating the file we're trying to copy out, so there can always be erro

I added this as a comment to TarWithOptions.

@hanwen-flow
Copy link
Contributor Author

Oh and the lint task is failing, with some inscrutable failure… is that an infrastructure problem, or are we somehow throwing away the logs in CI?

with some local tweaks:

$ make local-validate
make -C tests/tools GOLANGCI_LINT_VERSION=1.64.7
make[1]: Entering directory '/home/hanwen/vc/containers/storage/tests/tools'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/hanwen/vc/containers/storage/tests/tools'
./hack/git-validation.sh
+ export GIT_VALIDATION=tests/tools/build/git-validation
+ GIT_VALIDATION=tests/tools/build/git-validation
+ '[' '!' -x tests/tools/build/git-validation ']'
+ EPOCH_TEST_COMMIT=
+ '[' -z '' ']'
++ git merge-base main HEAD
+ EPOCH_TEST_COMMIT=b87bac8fb133a3570c5b11542bf791ef550f308c
+ exec time tests/tools/build/git-validation -q -run DCO,short-subject -range b87bac8fb133a3570c5b11542bf791ef550f308c..HEAD
1.53user 4.91system 0:06.21elapsed 103%CPU (0avgtext+0avgdata 12544maxresident)k
0inputs+5616outputs (0major+643812minor)pagefaults 0swaps

Looks like it's taking 6 seconds on my 16-core machine to check if I have a signed-off line in my commits. 🤷‍♂️

note that my 'main' branch is not upstream's main branch.

@mtrmac
Copy link
Collaborator

mtrmac commented Mar 20, 2025

I added this as a comment to TarWithOptions.

I like this, as well as the comments on the two functions around addFileData. That’s clear thinking and a clear rationale. Thanks!

@hanwen-flow hanwen-flow force-pushed the tar-fixes-2 branch 2 times, most recently from 66f477f to b55f0dd Compare March 24, 2025 11:31
@hanwen-flow
Copy link
Contributor Author

PTAL

@hanwen-flow
Copy link
Contributor Author

@mtrmac , @giuseppe - friendly ping.

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hanwen-flow
Copy link
Contributor Author

@mtrmac ping?

@hanwen-flow
Copy link
Contributor Author

I like this, as well as the comments on the two functions around addFileData. That’s clear thinking and a clear rationale. Thanks!

@mtrmac - would you be able to approve this? Or is there someone else who can give the 2nd LGTM?

@hanwen-flow
Copy link
Contributor Author

Rebased. Needs LGTM again. @giuseppe , @mtrmac

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some minor nits; SGTM overall

@@ -850,13 +850,19 @@ func extractTarFileEntry(path, extractDir string, hdr *tar.Header, reader io.Rea
}

// Tar creates an archive from the directory at `path`, and returns it as a
// stream of bytes.
// stream of bytes. This is a convenience wrapper for TarWithOptions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: may use markup while at it, i.e.

Suggested change
// stream of bytes. This is a convenience wrapper for TarWithOptions.
// stream of bytes. This is a convenience wrapper for [TarWithOptions].

so that go doc will link to this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 858 to 861
// TarWithOptions creates an archive from the directory at `path`,
// only including files whose relative paths are included in
// `options.IncludeFiles` (if non-nil) or not in
// `options.ExcludePatterns`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, it makes no sense to use backticks in docstrings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed. (I was following conventions from rest of file.)

// addFile adds a file from `path` as `name` to the tar archive.
func (ta *tarWriter) addFile(path, name string) error {
type addFileData struct {
// path from which to read contents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing period

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

// path from which to read contents
path string

// os.Stat for the above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing period

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

func TestTarErrorHandling(t *testing.T) {
dir := t.TempDir()

for i := 0; i < 10; i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can use for i := range 10

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

This routine does no I/O, so it cannot fail.

Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
    
Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
Otherwise, if addFile errors out before writing the entry for a file
with hardlink, subsequent files with hardlinks would reference a file
that is not in the tar file.

Use hdr.Name (which is passed through CanonicalTarNameForPath) instead
of the 'name' parameter.

Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
This code was imported in commit f39066f (Sep 7, 2017, "Update
packages to match latest code in moby/pkg") without any further comments.

The error check cannot trigger, because there is only whiteout
converter (overlayfs => AUFS), and that only acts on char devices and
directories.

Document in detail what is going on.

Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
This allows disinguishing between inherent fatal error (writing to the
tar archive failed, eg. disk full) and non-fatal error
(tarWithOptionsTo is racing with another process mutating the file
system.)
    
Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
@hanwen-flow
Copy link
Contributor Author

@kolyshkin ping?

@hanwen-flow
Copy link
Contributor Author

@giuseppe - seems to need an LGTM (again?)

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

openshift-ci bot commented Apr 23, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, hanwen-flow, kolyshkin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit eda48a7 into containers:main Apr 23, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants