Implement memory considerate zip walker #87

glynternet · 2022-01-21T10:43:20Z

The standard go zip.Reader implementation creates a large slice of *zip.File when initialising the Reader. This is problematic for extremely large zip files as it can use a lot of memory.

This PR implements two zip walking functions that use a version of the standard go Reader to walk through all zip file contents without creating this large slice, using an iteration/stream approach instead.

To see the changes from the standard go zip package, compare the first commit against the following commits: 9d735f4...gh/memory-considerate-zip-walker

Description of changes relative to the standard `zip` package codebase

Intent

The following modifications have been made to the original standard zip package to allow for iterating through zipped content in a memory considerate manner. The standard package has been optimised for reasons other than ours and the upstream implementation collects a slice of *File with a size equal to the number of files contained within a zip. This implementation works in a iterative/streaming manner where upon encountering a *File in the zip directory tree, the file is immediately passed to a function provided by the user for processing.

Package API

Two top-level functions have been introduced zip.WalkZipFile and WalkZipReaderAt that are akin to the standard package zip.OpenReader and zip.NewReader functions, respectively, but accepting another parameter, a WalkFn, and immediately walking through the contents of the zip file provided by the file or reader, passing each entry to the given WalkFn.

The WalkFn, when passed a *File should return a bool and an error. A false bool or a non-nil error will cause the walk to stop and return the error to the user.

Changes made relative to the codebase in the standard `zip` package

All Write-related code has been removed where possible. We are only interested in reading.
All ways ways to create a Reader/ReadCloser have been removed from the package API, allowing only a single entrypoint to the Walk functions that we have implemented.
After making the above changes, all unused code has been removed apart from some constants that were part of constant groups.
The Reader.init method has been renamed to Reader.walk to reperesent the functionality that it now holds. More details on the cahnges for this can be found below.
The Reader.Files field, a []*File has been removed to make it impossible to incur the memory penalty of accumalating a slice of *File details.
All tests that are appropriate to keep have been migrated to represent equivalents of their original but using the new Walk methods. An extra test file reader_local_test.go has been introduced, containing tests specific for our implementation.
Some tests that use standard internal package have been removed as it is impossible for us to reference them
Examples have been removed.
Some general changes have been made to make the codebase meet some of our linter requirements. i.e. Matching names of method receivers in method definitions, some extra error checking and removal of unnecessary conversions between value types.

Changes to Reader.init/Reader.walk

Reader.init is where the standard package reads through the directory tree of the zip file to accumulate a slice of *File for each file in the zip. This method has been changed to accept a WalkFn which each *File is passed to instead of appending it to a slice. The WalkFn is passed into the two top-level Walk functions by the caller.
All logic related to Reader.Files has been removed.
The given WalkFn, when passed a *File should return a bool and an error. A false bool or a non-nil error will cause the walk to stop and return the error to the user.
If the WalkFn never returns a false or non-nil error, the expected number of directory records is compared against the number of files iterated through, in a similar manner that the length of Reader.Files was previously checked. If the numbers do not match, an error is returned from the top-level Walk functions.
File.readDataDescriptor is no longer called on each file as it is iterated, this change has been made for performance reasons as it is an expensive call and the information populated by File.readDataDescriptor is not required unless the File is being open. This is now called within the File.Open method so that it is only called if the file is going to be opened. A user can now determine whether the file should be opened from the filename or size, for example, and then only have f.readDataDescriptor called for those that are opened.

Changes to File

File.readDataDescriptor now accepts a bodyOffset rather than calling findBodyOffset to determine it. This change was made because in File.Open, we now find the body offset when Open is called and can inject it into readDataDescriptor to avoid findBodyOffset being called multiple times per Open.

changelog-app · 2022-01-21T10:43:26Z

Generate changelog in `changelog/@unreleased`

Type

Description

The zip walking logic that previously used the standard package has been replaced with a memory optimised version implemented specifically for goals of this product.

Previously the memory usage used for indexing the entries within a zip would be proportional to the number of files contained within the it. As of this release, there will be no index created and so this memory overhead has been removed.

Check the box to generate changelog(s)

Generate changelog entry

godel/config/license-plugin.yml

hpryce · 2022-01-21T10:55:29Z

pkg/archive/zip/reader.go

+			return err
+		}
+	}
+	if numFiles != end.directoryRecords { // only compare 16 bits here


As per the original code and the comment, I believe end.directoryRecords is truncated and so this will fail if there are more than the max value of a uint16 files. Restoring the uint16() conversions here should fix this.

I believe the test TestOver65kFiles covers this case.
Max uint16 is 65535, number of files used in that test case is 65578.
I've added a counter to check that the 65578 is the number of time the handler is called in this test.
Fancy double checking to see if there's something I'm missing here?

I've tried going through the file history, and can't seem to find much information to support why this was being done - the actual end.directoryRecords is a uint64 so could have more than 16 bits in. I would assume this was done to support zip files produced by certain software which presumably used only 16 bits despite the file format allowing for more. Given this is checking for an error case I think it's better to go with the original code, given the uncertainty around what might break if we change this.

Okay, here's my best attempt at explaining why the standard library is truncating the values:

There are two zip file formats, zip and zip64. The former only supports a limited number of files (2^16 worth) while the latter is unlimited.

However the file count in the older zip format is just a counter. You can write as many records as you like in to the zip and central directory as you want, exceeding the 2^16 maximum count. In this case the count cannot be correct for the number of files actually included in the archive.

You could declare this to be an error, and cap the older zip format at 2^16 files. There is presumably zip software out there that is happy to produce archives in violation of this.

Therefore, in order to be as accepting as different zip files as possible, the comparison here is truncated to just 16 bits in order to support archives where more than 2^16 files have been stuffed in to them without using the zip64 format.

hpryce · 2022-01-21T12:48:18Z

👍

svc-autorelease · 2022-01-21T12:48:41Z

Released 1.6.0

glynternet added 3 commits January 20, 2022 19:08

Direct cp of standard package

9d735f4

Walker

62fc461

license attempt

6603309

hpryce reviewed Jan 21, 2022

View reviewed changes

godel/config/license-plugin.yml Outdated Show resolved Hide resolved

hpryce reviewed Jan 21, 2022

View reviewed changes

godel/config/license-plugin.yml Outdated Show resolved Hide resolved

glynternet added 2 commits January 21, 2022 10:49

Move walk functions to own file

941b42d

License tidying

35879b3

hpryce reviewed Jan 21, 2022

View reviewed changes

glynternet added 4 commits January 21, 2022 11:11

remove some more

72c2564

confirm number of time walkfn is called for large archive

542bc26

Ignore direcotries when using walk

cdd6fca

Some comments

e628f15

glynternet force-pushed the gh/memory-considerate-zip-walker branch from 398b785 to e628f15 Compare January 21, 2022 11:39

glynternet added 2 commits January 21, 2022 12:22

Truncate directory size check

959c3ad

Add readme to package

9e824a4

glynternet force-pushed the gh/memory-considerate-zip-walker branch from 1b93a95 to 9e824a4 Compare January 21, 2022 12:35

svc-changelog added 3 commits January 21, 2022 12:47

Add generated changelog entries

001e20f

Add generated changelog entries

36d432e

Add generated changelog entries

da47fdf

hpryce added autorelease merge when ready labels Jan 21, 2022

bulldozer-bot bot merged commit 6b3db03 into develop Jan 21, 2022

bulldozer-bot bot deleted the gh/memory-considerate-zip-walker branch January 21, 2022 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement memory considerate zip walker #87

Implement memory considerate zip walker #87

Uh oh!

glynternet commented Jan 21, 2022 •

edited

Loading

Uh oh!

changelog-app bot commented Jan 21, 2022 •

edited by glynternet

Loading

Uh oh!

Uh oh!

Uh oh!

hpryce Jan 21, 2022

Uh oh!

glynternet Jan 21, 2022

Uh oh!

hpryce Jan 21, 2022

Uh oh!

hpryce Jan 21, 2022

Uh oh!

hpryce commented Jan 21, 2022

Uh oh!

svc-autorelease commented Jan 21, 2022

Uh oh!

Uh oh!

Implement memory considerate zip walker #87

Implement memory considerate zip walker #87

Uh oh!

Conversation

glynternet commented Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes relative to the standard zip package codebase

Intent

Package API

Changes made relative to the codebase in the standard zip package

Changes to Reader.init/Reader.walk

Changes to File

Uh oh!

changelog-app bot commented Jan 21, 2022 • edited by glynternet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Generate changelog in changelog/@unreleased

Uh oh!

Uh oh!

Uh oh!

hpryce Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

glynternet Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

hpryce Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

hpryce Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

hpryce commented Jan 21, 2022

Uh oh!

svc-autorelease commented Jan 21, 2022

Uh oh!

Uh oh!

glynternet commented Jan 21, 2022 •

edited

Loading

Description of changes relative to the standard `zip` package codebase

Changes made relative to the codebase in the standard `zip` package

changelog-app bot commented Jan 21, 2022 •

edited by glynternet

Loading

Generate changelog in `changelog/@unreleased`