-
Notifications
You must be signed in to change notification settings - Fork 2.7k
reference: use named capturing groups #3803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// TODO(thaJeztah): disambiguate: docker requires domain-name to be either; | ||
// - localhost (special case) | ||
// - at least one "." | ||
// - or a ":port" | ||
// | ||
// Any other domain is considered a path-element. | ||
domainName = domainNameComponent + oneOrMore(`\.`+domainNameComponent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still to be looked at; I think the new matches are already better than the old ones, but needs some eyes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^ to outline; previously the Regex would handle these as "namespace", and then the code processing the output would mark them as "domain". In this implementation, the Regex already splits them to be a domain (after which the code can do further handling).
(this is why some of the tests were updated, because they tested the Regex "incorrect" behavior)
Codecov ReportPatch coverage:
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. Additional details and impacted files@@ Coverage Diff @@
## main #3803 +/- ##
==========================================
+ Coverage 56.88% 56.96% +0.07%
==========================================
Files 106 106
Lines 10703 10711 +8
==========================================
+ Hits 6088 6101 +13
+ Misses 3942 3937 -5
Partials 673 673
☔ View full report in Codecov by Sentry. |
5435a29
to
233f7ad
Compare
This comment was marked as resolved.
This comment was marked as resolved.
233f7ad
to
ae426ce
Compare
ae426ce
to
22f4e75
Compare
719d5ac
to
6d7b3e9
Compare
6d7b3e9
to
9573e05
Compare
Rewrite the regular expressions to use named capturing groups. This simplifies handling the resulting matches, but does require some additional handling to associate matches with their names. Also making some changes to the matching to match how domains are _actually_ matched; some of that was already handled in code parsing the results of the regex, but now this is handled by the regex itself. Before: BenchmarkParse BenchmarkParse-10 12696 93805 ns/op 9311 B/op 185 allocs/op PASS After: BenchmarkParse BenchmarkParse-10 12486 94774 ns/op 18617 B/op 178 allocs/op PASS Benchstat: go test -run='^$' -bench=. -count=10 ./reference/ > old.txt go test -run='^$' -bench=. -count=10 ./reference/ > new.txt benchstat old.txt new.txt name old time/op new time/op delta Parse-10 91.7µs ± 0% 97.0µs ±11% +5.82% (p=0.000 n=9+10) name old alloc/op new alloc/op delta Parse-10 9.32kB ± 0% 18.63kB ± 0% +99.93% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Parse-10 185 ± 0% 178 ± 0% -3.78% (p=0.000 n=10+10) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
9573e05
to
365c733
Compare
@corhere @squizzi @milosgajdos interested to hear your take on this; I think this makes the code/handling easier to reason with, and the Regexes more "complete", and potentially more usable. But (as commented) does come with an increase memory use. Wondering if we think the pros outweigh the cons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names of capture groups don't have to be unique, which could make things weird when composing our expressions with others. And the capture group names become part of the exported API surface, which I'm squeamish about. I would be okay with using named capture groups IF none of the regexes are exported.
I'm really skeptical about the getNamedMatches
function. If we want to have symbolic names for referencing the capture groups, assigning indices returned from (*Regexp).SubexpIndex
to variables would suffice. That would cut down on allocations by not building a map copy of the match slice.
func mustSubexpIndex(re *regexp.Regexp, name string) int {
i := re.SubexpIndex(name)
if i < 0 {
panic("no subexpression named "+name)
}
return i
}
var (
referenceName = mustSubexpIndex(referenceRegexp, "name")
)
if len(r.SubexpNames()) == 0 { | ||
panic("regex does not have named subexpressions: " + r.String()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition will never be true as len(r.SubexpNames()) > 0
for all regexps. r.SubexpNames()
will also have indices for every unnamed subexpression, with the empty string as the value.
if ok = len(matches) > 0; !ok { | ||
return nil, false | ||
} | ||
namedMatches = make(map[string]string, len(matches)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
namedMatches = make(map[string]string, len(matches)) | |
namedMatches = make(map[string]string, len(matches)-1) |
return nil, false | ||
} | ||
namedMatches = make(map[string]string, len(matches)) | ||
// We loop through matches here, in case there's optional named match-groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing optional about optional capture groups is that they don't have to match anything. The indices of the matches
slice don't change around. If they did, the indices in r.SubexpNames()
would not line up. When matches != nil
, len(matches)
is constant for a given expression, irrespective of the input.
The previous way of detecting the name and version to supply to Dependency-Track was very brittle, it already failed for image references including a hash, resulting in names like hello-world@sha256, because it would only split on ':' and then select the first and last component. This version uses a regex to as accurately as possible match the individual components of a docker image reference. The regex comes from the [official implementation] on GitHub, but is actually taken from a [pull request], which adds named capture groups and fixes an issue with domain recognition being too eager. Yes the regex looks pretty wild, yes there are tests. I don't think it makes sense to build the regex from the individual components like the docker library does it. Unfortunately this does not solve the problem of actually getting the reference from somewhere, for images it works with getting it from the name of the main component, but this part stays brittle. Annotations to the scans might be a possible solution for that. [official implementation]: https://github.com/distribution/reference [pull request]: distribution/distribution#3803 Signed-off-by: Lukas Fischer <lukas.fischer@iteratec.com>
The previous way of detecting the name and version to supply to Dependency-Track was very brittle, it already failed for image references including a hash, resulting in names like hello-world@sha256, because it would only split on ':' and then select the first and last component. This version uses a regex to as accurately as possible match the individual components of a docker image reference. The regex comes from the [official implementation] on GitHub, but is actually taken from a [pull request], which adds named capture groups and fixes an issue with domain recognition being too eager. Yes the regex looks pretty wild, yes there are tests. I don't think it makes sense to build the regex from the individual components like the docker library does it. Unfortunately this does not solve the problem of actually getting the reference from somewhere, for images it works with getting it from the name of the main component, but this part stays brittle. Annotations to the scans might be a possible solution for that. [official implementation]: https://github.com/distribution/reference [pull request]: distribution/distribution#3803 Signed-off-by: Lukas Fischer <lukas.fischer@iteratec.com>
@thaJeztah do you mind moving this PR to the https://github.com/distribution/reference repo? |
ping @thaJeztah we need to move this to the new repo. I don't think you can move PRs between repos (I know you can move issues), so we might wanna close this and open it in the new repo. |
@thaJeztah do you want to close this now that |
There are a bunch of conflicts now due to the reference package being moved to a dedicated repo. Mind closing this @thaJeztah or moving it to |
Closing as We need to re-open the PR there. |
Rewrite the regular expressions to use named capturing groups.
This simplifies handling the resulting matches, but does require
some additional handling to associate matches with their names.
Also making some changes to the matching to match how domains
are actually matched; some of that was already handled in
code parsing the results of the regex, but now this is handled
by the regex itself.
Before:
After:
Benchstat: