-
Notifications
You must be signed in to change notification settings - Fork 22
Implement memory considerate zip walker #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Generate changelog in
|
pkg/archive/zip/reader.go
Outdated
return err | ||
} | ||
} | ||
if numFiles != end.directoryRecords { // only compare 16 bits here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the original code and the comment, I believe end.directoryRecords is truncated and so this will fail if there are more than the max value of a uint16 files. Restoring the uint16() conversions here should fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the test TestOver65kFiles
covers this case.
Max uint16 is 65535
, number of files used in that test case is 65578
.
I've added a counter to check that the 65578
is the number of time the handler is called in this test.
Fancy double checking to see if there's something I'm missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried going through the file history, and can't seem to find much information to support why this was being done - the actual end.directoryRecords is a uint64 so could have more than 16 bits in. I would assume this was done to support zip files produced by certain software which presumably used only 16 bits despite the file format allowing for more. Given this is checking for an error case I think it's better to go with the original code, given the uncertainty around what might break if we change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, here's my best attempt at explaining why the standard library is truncating the values:
- There are two zip file formats, zip and zip64. The former only supports a limited number of files (2^16 worth) while the latter is unlimited.
- However the file count in the older zip format is just a counter. You can write as many records as you like in to the zip and central directory as you want, exceeding the 2^16 maximum count. In this case the count cannot be correct for the number of files actually included in the archive.
- You could declare this to be an error, and cap the older zip format at 2^16 files. There is presumably zip software out there that is happy to produce archives in violation of this.
- Therefore, in order to be as accepting as different zip files as possible, the comparison here is truncated to just 16 bits in order to support archives where more than 2^16 files have been stuffed in to them without using the zip64 format.
398b785
to
e628f15
Compare
1b93a95
to
9e824a4
Compare
👍 |
Released 1.6.0 |
The standard
go
zip.Reader
implementation creates a large slice of*zip.File
when initialising theReader
. This is problematic for extremely large zip files as it can use a lot of memory.This PR implements two zip walking functions that use a version of the standard
go
Reader
to walk through all zip file contents without creating this large slice, using an iteration/stream approach instead.To see the changes from the standard go
zip
package, compare the first commit against the following commits: 9d735f4...gh/memory-considerate-zip-walkerDescription of changes relative to the standard
zip
package codebaseIntent
The following modifications have been made to the original standard zip package to allow for iterating through zipped content in a memory considerate manner. The standard package has been optimised for reasons other than ours and the upstream implementation collects a slice of
*File
with a size equal to the number of files contained within a zip. This implementation works in a iterative/streaming manner where upon encountering a*File
in the zip directory tree, the file is immediately passed to a function provided by the user for processing.Package API
Two top-level functions have been introduced
zip.WalkZipFile
andWalkZipReaderAt
that are akin to the standard packagezip.OpenReader
andzip.NewReader
functions, respectively, but accepting another parameter, aWalkFn
, and immediately walking through the contents of the zip file provided by the file or reader, passing each entry to the givenWalkFn
.The
WalkFn
, when passed a*File
should return a bool and an error. A false bool or a non-nil error will cause the walk to stop and return the error to the user.Changes made relative to the codebase in the standard
zip
packageReader/ReadCloser
have been removed from the package API, allowing only a single entrypoint to the Walk functions that we have implemented.Reader.init
method has been renamed toReader.walk
to reperesent the functionality that it now holds. More details on the cahnges for this can be found below.Reader.Files
field, a[]*File
has been removed to make it impossible to incur the memory penalty of accumalating a slice of*File
details.Walk
methods. An extra test filereader_local_test.go
has been introduced, containing tests specific for our implementation.internal
package have been removed as it is impossible for us to reference themExamples have been removed.
Changes to Reader.init/Reader.walk
Reader.init
is where the standard package reads through the directory tree of the zip file to accumulate a slice of*File
for each file in the zip. This method has been changed to accept aWalkFn
which each*File
is passed to instead of appending it to a slice. TheWalkFn
is passed into the two top-levelWalk
functions by the caller.Reader.Files
has been removed.*File
should return a bool and an error. A false bool or a non-nil error will cause the walk to stop and return the error to the user.Reader.Files
was previously checked. If the numbers do not match, an error is returned from the top-levelWalk
functions.File.readDataDescriptor
is no longer called on each file as it is iterated, this change has been made for performance reasons as it is an expensive call and the information populated byFile.readDataDescriptor
is not required unless the File is being open. This is now called within theFile.Open
method so that it is only called if the file is going to be opened. A user can now determine whether the file should be opened from the filename or size, for example, and then only havef.readDataDescriptor
called for those that are opened.Changes to File
File.readDataDescriptor
now accepts abodyOffset
rather than callingfindBodyOffset
to determine it. This change was made because inFile.Open
, we now find the body offset whenOpen
is called and can inject it intoreadDataDescriptor
to avoidfindBodyOffset
being called multiple times perOpen
.