Skip to content

Deduplicate packages across multiple container image layers #930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 31, 2022

Conversation

wagoodman
Copy link
Contributor

@wagoodman wagoodman commented Mar 30, 2022

This PR adds the ability to deduplicate packages that have the same information found on the same path within multiple layers.

Additionally, []source.Location has been replaced with source.LocationSet on the definition of a package. Why? For two reasons:

  1. The ordering of otherwise the same locations for two package should not cause a distinction to be made when deriving the package ID. In this case, the package IDs should be the same (assuming all other fields on the package also match).
  2. Since pkg.Package.Locations factors into the package ID value, it is convenient to change the response of pkg.Package.Locations.Hash() to control when the ID value should change.

Specific changes:

  • Add source.LocationSet, which represents a unique set of location objects. Hash() on this object is derived from the st of real paths from all locations (ignores virtual paths and filesystems). This allows for IDs for locations on packages to match when they have the same locations.
  • Augment source.CoordinateSet.Hash() to respond based off of the ordered slice that the set if made up of. Note: this change is in response to a bug found when adding tests, not to support pkg.Package.Locations.Hash() functionality.
  • Add the java metadata virtual path to the set of encoded cyclonedx properties on a package so that encode-decode-encode cycles are stable.
  • Includes pURL within the bytes that are input in the package ID hash. This is because pkg.Package.merge() cannot merge string fields, one would need to be dropped. Since the decision to make pURL had not considered the relatively new Format.Decode() paths from other SBOM tools, the assumption that "all pURL fields are derived from package fields" is not necessarily true. If a pURL does not match between the two packages then a) pkg.Package.merge() will return an error, and b) pkg.Catalog.Add() will log a warning. Since this is synthetic data (added after cataloging) there will be a encode-decode cycle mismatch for ID values, since the first ID derivation will not have a pURL associated yet (and subsequent ID derivations after an encode-decode cycle will).
  • Add pkg.Package.merge() operation, which merges CPE fields from another package onto the current package (only if the IDs for both packages matches).
  • Add pkg.Catalog.PackagesByName() helper, which was very convenient in testing and seems generally useful.
  • Index values in pkg.Catalog are also now deduplicated with a new local pkg.artifactIDSet that is only meant for the package catalog.

Closes #32

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
@wagoodman wagoodman requested a review from a team March 30, 2022 21:57
@wagoodman wagoodman self-assigned this Mar 30, 2022
@github-actions
Copy link

github-actions bot commented Mar 30, 2022

Benchmark Test Results

Benchmark results from the latest changes vs base branch
name                                                       old time/op    new time/op    delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2              1.46ms ± 1%    1.34ms ± 6%   -7.98%  (p=0.008 n=5+5)
ImagePackageCatalogers/python-package-cataloger-2            3.61ms ± 2%    3.37ms ± 9%     ~     (p=0.151 n=5+5)
ImagePackageCatalogers/php-composer-installed-cataloger-2    1.17ms ± 1%    1.08ms ± 3%   -7.90%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2         796µs ± 1%     733µs ± 1%   -7.88%  (p=0.008 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                     955µs ± 4%     862µs ± 3%   -9.75%  (p=0.008 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                      823µs ± 1%     759µs ± 2%   -7.71%  (p=0.008 n=5+5)
ImagePackageCatalogers/java-cataloger-2                      14.8ms ± 1%    13.8ms ± 2%   -6.92%  (p=0.008 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                     1.36ms ± 1%    1.30ms ± 4%   -4.07%  (p=0.016 n=5+5)
ImagePackageCatalogers/go-module-binary-cataloger-2          2.41µs ± 1%    2.32µs ± 1%   -3.54%  (p=0.008 n=5+5)

name                                                       old alloc/op   new alloc/op   delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2               182kB ± 0%     183kB ± 0%   +0.56%  (p=0.008 n=5+5)
ImagePackageCatalogers/python-package-cataloger-2             893kB ± 0%     893kB ± 0%     ~     (p=0.841 n=5+5)
ImagePackageCatalogers/php-composer-installed-cataloger-2     194kB ± 0%     195kB ± 0%   +0.63%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2         138kB ± 0%     139kB ± 0%   +0.70%  (p=0.008 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                     173kB ± 0%     174kB ± 0%   +0.87%  (p=0.008 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                      162kB ± 0%     162kB ± 0%   +0.48%  (p=0.008 n=5+5)
ImagePackageCatalogers/java-cataloger-2                      3.18MB ± 0%    3.18MB ± 0%     ~     (p=0.095 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                     1.23MB ± 0%    1.24MB ± 0%   +0.13%  (p=0.008 n=5+5)
ImagePackageCatalogers/go-module-binary-cataloger-2            608B ± 0%      672B ± 0%  +10.53%  (p=0.008 n=5+5)

name                                                       old allocs/op  new allocs/op  delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2               3.67k ± 0%     3.62k ± 0%   -1.42%  (p=0.029 n=4+4)
ImagePackageCatalogers/python-package-cataloger-2             14.9k ± 0%     14.6k ± 0%   -1.94%  (p=0.008 n=5+5)
ImagePackageCatalogers/php-composer-installed-cataloger-2     4.98k ± 0%     4.90k ± 0%   -1.59%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2         2.70k ± 0%     2.68k ± 0%   -0.93%  (p=0.008 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                     3.97k ± 0%     3.89k ± 0%   -1.99%  (p=0.008 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                      3.99k ± 0%     3.97k ± 0%   -0.63%  (p=0.008 n=5+5)
ImagePackageCatalogers/java-cataloger-2                       51.9k ± 0%     51.8k ± 0%   -0.19%  (p=0.008 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                      4.83k ± 0%     4.77k ± 0%   -1.08%  (p=0.016 n=4+5)
ImagePackageCatalogers/go-module-binary-cataloger-2            14.0 ± 0%      15.0 ± 0%   +7.14%  (p=0.008 n=5+5)

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
@wagoodman wagoodman force-pushed the deduplicate-packages branch from 08630b9 to 1f9aed8 Compare March 30, 2022 22:04
Copy link
Contributor

@spiffcs spiffcs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small clarifying comments from my part. The rest LGTM. I did notice a test was failing while doing the review so I think that needs to be fixed before merge 👍

packages_cmd_test.go:222: expected package count of 22, but found 20 <-- The pacakges cmd is coming up with 22

}
return
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update any of the maps on the catalog after the package has been merged or is Add the first time we see p?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me think on that. I think you might be right about needing to update indexes for paths.

@@ -1,20 +1,20 @@
package cpe
package pkg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this moved because of the merge function that was added for packages?

Copy link
Contributor Author

@wagoodman wagoodman Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, otherwise we run into a package cycle. This change didn't feel that great to me. I think if we split out cpe definitions from capabilities and keep those in separate packages that would solve this issue. (this would be a good topic for the top-level API refactor issue)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the catalog to include path indexes as well -- good call . Commit 414acc6

PURL string `hash:"ignore"` // the Package URL (see https://github.com/package-url/purl-spec) (note: this is NOT included in the definition of the ID since all fields on a pURL are derived from other fields)
MetadataType MetadataType `cyclonedx:"metadataType"` // the shape of the additional data in the "metadata" field
Metadata interface{} // additional data found while parsing the package source
id artifact.ID `hash:"ignore"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird that the linter added the white space before each tag here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this difference when using goland or vs code

@@ -46,3 +46,21 @@ func (p Package) ID() artifact.ID {
func (p Package) String() string {
return fmt.Sprintf("Pkg(name=%q version=%q type=%q id=%q)", p.Name, p.Version, p.Type, p.id)
}

func (p *Package) merge(other Package) error {
if p.id != other.id {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure - these are the only two conditions we would not want to merge on?

There is no use in comparing CPE either here even if they are ignored by the hash?

Copy link
Contributor Author

@wagoodman wagoodman Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are the only two conditions we would not want to merge on

correct. The ID check covers all fields that would change an ID (which is most of them), which leaves CPEs and pURLs. CPEs are a slice, so they can be merged, but the pURL field is not a collection, thus cannot be merged.

return coordinates
}

func (s CoordinateSet) Hash() (uint64, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this because when generating the ID CoordinateSet is considered?

Are there any times where we are updating/merging a coordinate set AFTER an ID has been generated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since LocationSet hash is based on the paths of the coordinate set, the hash of the coordinate set is unrelated to packages. This is to fix when adding files to the SBOM (and relationships to files) such that there is a consistent ID for coordinate sets (I was seeing variation when adding explicit tests for this, so provided an explicit hash function based off of a stable slice)

@wagoodman wagoodman marked this pull request as draft March 31, 2022 13:38
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
@wagoodman wagoodman marked this pull request as ready for review March 31, 2022 15:14
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
@spiffcs
Copy link
Contributor

spiffcs commented Mar 31, 2022

@wagoodman I went through all the files again and nothing stood out after our call. New changes also LGTM

@wagoodman wagoodman changed the title Deduplicate identical packages across multiple container image layers Deduplicate packages across multiple container image layers Mar 31, 2022
@wagoodman wagoodman merged commit f24bbc1 into main Mar 31, 2022
@wagoodman wagoodman deleted the deduplicate-packages branch March 31, 2022 19:45
GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SBOM from all-layers scope showing duplicate packages
3 participants