Skip to content

Conversation

glynternet
Copy link
Contributor

@glynternet glynternet commented Jan 21, 2022

The standard go zip.Reader implementation creates a large slice of *zip.File when initialising the Reader. This is problematic for extremely large zip files as it can use a lot of memory.

This PR implements two zip walking functions that use a version of the standard go Reader to walk through all zip file contents without creating this large slice, using an iteration/stream approach instead.

To see the changes from the standard go zip package, compare the first commit against the following commits: 9d735f4...gh/memory-considerate-zip-walker

Description of changes relative to the standard zip package codebase

Intent

The following modifications have been made to the original standard zip package to allow for iterating through zipped content in a memory considerate manner. The standard package has been optimised for reasons other than ours and the upstream implementation collects a slice of *File with a size equal to the number of files contained within a zip. This implementation works in a iterative/streaming manner where upon encountering a *File in the zip directory tree, the file is immediately passed to a function provided by the user for processing.

Package API

Two top-level functions have been introduced zip.WalkZipFile and WalkZipReaderAt that are akin to the standard package zip.OpenReader and zip.NewReader functions, respectively, but accepting another parameter, a WalkFn, and immediately walking through the contents of the zip file provided by the file or reader, passing each entry to the given WalkFn.

The WalkFn, when passed a *File should return a bool and an error. A false bool or a non-nil error will cause the walk to stop and return the error to the user.

Changes made relative to the codebase in the standard zip package

  • All Write-related code has been removed where possible. We are only interested in reading.
  • All ways ways to create a Reader/ReadCloser have been removed from the package API, allowing only a single entrypoint to the Walk functions that we have implemented.
  • After making the above changes, all unused code has been removed apart from some constants that were part of constant groups.
  • The Reader.init method has been renamed to Reader.walk to reperesent the functionality that it now holds. More details on the cahnges for this can be found below.
  • The Reader.Files field, a []*File has been removed to make it impossible to incur the memory penalty of accumalating a slice of *File details.
  • All tests that are appropriate to keep have been migrated to represent equivalents of their original but using the new Walk methods. An extra test file reader_local_test.go has been introduced, containing tests specific for our implementation.
  • Some tests that use standard internal package have been removed as it is impossible for us to reference them
    Examples have been removed.
  • Some general changes have been made to make the codebase meet some of our linter requirements. i.e. Matching names of method receivers in method definitions, some extra error checking and removal of unnecessary conversions between value types.

Changes to Reader.init/Reader.walk

  • Reader.init is where the standard package reads through the directory tree of the zip file to accumulate a slice of *File for each file in the zip. This method has been changed to accept a WalkFn which each *File is passed to instead of appending it to a slice. The WalkFn is passed into the two top-level Walk functions by the caller.
  • All logic related to Reader.Files has been removed.
  • The given WalkFn, when passed a *File should return a bool and an error. A false bool or a non-nil error will cause the walk to stop and return the error to the user.
  • If the WalkFn never returns a false or non-nil error, the expected number of directory records is compared against the number of files iterated through, in a similar manner that the length of Reader.Files was previously checked. If the numbers do not match, an error is returned from the top-level Walk functions.
  • File.readDataDescriptor is no longer called on each file as it is iterated, this change has been made for performance reasons as it is an expensive call and the information populated by File.readDataDescriptor is not required unless the File is being open. This is now called within the File.Open method so that it is only called if the file is going to be opened. A user can now determine whether the file should be opened from the filename or size, for example, and then only have f.readDataDescriptor called for those that are opened.

Changes to File

  • File.readDataDescriptor now accepts a bodyOffset rather than calling findBodyOffset to determine it. This change was made because in File.Open, we now find the body offset when Open is called and can inject it into readDataDescriptor to avoid findBodyOffset being called multiple times per Open.

@changelog-app
Copy link

changelog-app bot commented Jan 21, 2022

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

The zip walking logic that previously used the standard package has been replaced with a memory optimised version implemented specifically for goals of this product.

Previously the memory usage used for indexing the entries within a zip would be proportional to the number of files contained within the it. As of this release, there will be no index created and so this memory overhead has been removed.

Check the box to generate changelog(s)

  • Generate changelog entry

return err
}
}
if numFiles != end.directoryRecords { // only compare 16 bits here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the original code and the comment, I believe end.directoryRecords is truncated and so this will fail if there are more than the max value of a uint16 files. Restoring the uint16() conversions here should fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the test TestOver65kFiles covers this case.
Max uint16 is 65535, number of files used in that test case is 65578.
I've added a counter to check that the 65578 is the number of time the handler is called in this test.
Fancy double checking to see if there's something I'm missing here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried going through the file history, and can't seem to find much information to support why this was being done - the actual end.directoryRecords is a uint64 so could have more than 16 bits in. I would assume this was done to support zip files produced by certain software which presumably used only 16 bits despite the file format allowing for more. Given this is checking for an error case I think it's better to go with the original code, given the uncertainty around what might break if we change this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, here's my best attempt at explaining why the standard library is truncating the values:

  • There are two zip file formats, zip and zip64. The former only supports a limited number of files (2^16 worth) while the latter is unlimited.
  • However the file count in the older zip format is just a counter. You can write as many records as you like in to the zip and central directory as you want, exceeding the 2^16 maximum count. In this case the count cannot be correct for the number of files actually included in the archive.
  • You could declare this to be an error, and cap the older zip format at 2^16 files. There is presumably zip software out there that is happy to produce archives in violation of this.
  • Therefore, in order to be as accepting as different zip files as possible, the comparison here is truncated to just 16 bits in order to support archives where more than 2^16 files have been stuffed in to them without using the zip64 format.

@glynternet glynternet force-pushed the gh/memory-considerate-zip-walker branch from 398b785 to e628f15 Compare January 21, 2022 11:39
@glynternet glynternet force-pushed the gh/memory-considerate-zip-walker branch from 1b93a95 to 9e824a4 Compare January 21, 2022 12:35
@hpryce
Copy link
Contributor

hpryce commented Jan 21, 2022

👍

@bulldozer-bot bulldozer-bot bot merged commit 6b3db03 into develop Jan 21, 2022
@bulldozer-bot bulldozer-bot bot deleted the gh/memory-considerate-zip-walker branch January 21, 2022 12:48
@svc-autorelease
Copy link
Collaborator

Released 1.6.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants