-
Notifications
You must be signed in to change notification settings - Fork 693
Deduplicate packages across multiple container image layers #930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Benchmark Test ResultsBenchmark results from the latest changes vs base branch
|
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
08630b9
to
1f9aed8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just small clarifying comments from my part. The rest LGTM. I did notice a test was failing while doing the review so I think that needs to be fixed before merge 👍
packages_cmd_test.go:222: expected package count of 22, but found 20
<-- The pacakges cmd is coming up with 22
} | ||
return | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update any of the maps on the catalog after the package has been merged or is Add
the first time we see p
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me think on that. I think you might be right about needing to update indexes for paths.
@@ -1,20 +1,20 @@ | |||
package cpe | |||
package pkg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this moved because of the merge
function that was added for packages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, otherwise we run into a package cycle. This change didn't feel that great to me. I think if we split out cpe definitions from capabilities and keep those in separate packages that would solve this issue. (this would be a good topic for the top-level API refactor issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the catalog to include path indexes as well -- good call . Commit 414acc6
PURL string `hash:"ignore"` // the Package URL (see https://github.com/package-url/purl-spec) (note: this is NOT included in the definition of the ID since all fields on a pURL are derived from other fields) | ||
MetadataType MetadataType `cyclonedx:"metadataType"` // the shape of the additional data in the "metadata" field | ||
Metadata interface{} // additional data found while parsing the package source | ||
id artifact.ID `hash:"ignore"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird that the linter added the white space before each tag here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this difference when using goland or vs code
@@ -46,3 +46,21 @@ func (p Package) ID() artifact.ID { | |||
func (p Package) String() string { | |||
return fmt.Sprintf("Pkg(name=%q version=%q type=%q id=%q)", p.Name, p.Version, p.Type, p.id) | |||
} | |||
|
|||
func (p *Package) merge(other Package) error { | |||
if p.id != other.id { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure - these are the only two conditions we would not want to merge on?
There is no use in comparing CPE either here even if they are ignored by the hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are the only two conditions we would not want to merge on
correct. The ID check covers all fields that would change an ID (which is most of them), which leaves CPEs and pURLs. CPEs are a slice, so they can be merged, but the pURL field is not a collection, thus cannot be merged.
return coordinates | ||
} | ||
|
||
func (s CoordinateSet) Hash() (uint64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this because when generating the ID CoordinateSet
is considered?
Are there any times where we are updating/merging a coordinate set AFTER an ID has been generated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since LocationSet hash is based on the paths of the coordinate set, the hash of the coordinate set is unrelated to packages. This is to fix when adding files to the SBOM (and relationships to files) such that there is a consistent ID for coordinate sets (I was seeing variation when adding explicit tests for this, so provided an explicit hash function based off of a stable slice)
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
@wagoodman I went through all the files again and nothing stood out after our call. New changes also LGTM |
This PR adds the ability to deduplicate packages that have the same information found on the same path within multiple layers.
Additionally,
[]source.Location
has been replaced withsource.LocationSet
on the definition of a package. Why? For two reasons:pkg.Package.Locations
factors into the package ID value, it is convenient to change the response ofpkg.Package.Locations.Hash()
to control when the ID value should change.Specific changes:
source.LocationSet
, which represents a unique set of location objects.Hash()
on this object is derived from the st of real paths from all locations (ignores virtual paths and filesystems). This allows for IDs for locations on packages to match when they have the same locations.source.CoordinateSet.Hash()
to respond based off of the ordered slice that the set if made up of. Note: this change is in response to a bug found when adding tests, not to supportpkg.Package.Locations.Hash()
functionality.IncludesIf a pURL does not match between the two packages then a)pURL
within the bytes that are input in the package ID hash. This is becausepkg.Package.merge()
cannot merge string fields, one would need to be dropped. Since the decision to makepURL
had not considered the relatively newFormat.Decode()
paths from other SBOM tools, the assumption that "all pURL fields are derived from package fields" is not necessarily true.pkg.Package.merge()
will return an error, and b)pkg.Catalog.Add()
will log a warning. Since this is synthetic data (added after cataloging) there will be a encode-decode cycle mismatch for ID values, since the first ID derivation will not have a pURL associated yet (and subsequent ID derivations after an encode-decode cycle will).pkg.Package.merge()
operation, which merges CPE fields from another package onto the current package (only if the IDs for both packages matches).pkg.Catalog.PackagesByName()
helper, which was very convenient in testing and seems generally useful.pkg.Catalog
are also now deduplicated with a new localpkg.artifactIDSet
that is only meant for the package catalog.Closes #32