reference: use named capturing groups #3803

thaJeztah · 2022-11-25T14:34:28Z

depends on reference: rewrite test to use sub-tests, add benchmark #3883

Rewrite the regular expressions to use named capturing groups.
This simplifies handling the resulting matches, but does require
some additional handling to associate matches with their names.

Also making some changes to the matching to match how domains
are actually matched; some of that was already handled in
code parsing the results of the regex, but now this is handled
by the regex itself.

Before:

BenchmarkParse
BenchmarkParse-10              12696             93805 ns/op            9311 B/op        185 allocs/op
PASS

After:

BenchmarkParse
BenchmarkParse-10              12486             94774 ns/op           18617 B/op        178 allocs/op
PASS

Benchstat:

go test -run='^$' -bench=. -count=10 ./reference/ > old.txt
go test -run='^$' -bench=. -count=10 ./reference/ > new.txt

benchstat old.txt new.txt
name       old time/op    new time/op    delta
Parse-10   91.7µs ± 0%    97.0µs ±11%   +5.82%  (p=0.000 n=9+10)

name       old alloc/op   new alloc/op   delta
Parse-10    9.32kB ± 0%   18.63kB ± 0%  +99.93%  (p=0.000 n=10+10)

name       old allocs/op  new allocs/op  delta
Parse-10        185 ± 0%       178 ± 0%   -3.78%  (p=0.000 n=10+10)

thaJeztah · 2022-11-25T14:35:41Z

reference/regexp.go

+	// TODO(thaJeztah): disambiguate: docker requires domain-name to be either;
+	// - localhost (special case)
+	// - at least one "."
+	// - or a ":port"
+	//
+	// Any other domain is considered a path-element.
+	domainName = domainNameComponent + oneOrMore(`\.`+domainNameComponent)


This is still to be looked at; I think the new matches are already better than the old ones, but needs some eyes

^^ to outline; previously the Regex would handle these as "namespace", and then the code processing the output would mark them as "domain". In this implementation, the Regex already splits them to be a domain (after which the code can do further handling).

(this is why some of the tests were updated, because they tested the Regex "incorrect" behavior)

codecov-commenter · 2022-11-25T14:38:08Z

Codecov Report

Patch coverage: 91.11% and project coverage change: +0.07 🎉

Comparison is base (8e29e87) 56.88% compared to head (365c733) 56.96%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3803      +/-   ##
==========================================
+ Coverage   56.88%   56.96%   +0.07%     
==========================================
  Files         106      106              
  Lines       10703    10711       +8     
==========================================
+ Hits         6088     6101      +13     
+ Misses       3942     3937       -5     
  Partials      673      673

Impacted Files	Coverage Δ
reference/reference.go	`80.85% <83.33%> (+2.56%)`	⬆️
reference/regexp.go	`92.00% <89.47%> (-8.00%)`	⬇️
reference/normalize.go	`82.07% <100.00%> (+0.17%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Rewrite the regular expressions to use named capturing groups. This simplifies handling the resulting matches, but does require some additional handling to associate matches with their names. Also making some changes to the matching to match how domains are _actually_ matched; some of that was already handled in code parsing the results of the regex, but now this is handled by the regex itself. Before: BenchmarkParse BenchmarkParse-10 12696 93805 ns/op 9311 B/op 185 allocs/op PASS After: BenchmarkParse BenchmarkParse-10 12486 94774 ns/op 18617 B/op 178 allocs/op PASS Benchstat: go test -run='^$' -bench=. -count=10 ./reference/ > old.txt go test -run='^$' -bench=. -count=10 ./reference/ > new.txt benchstat old.txt new.txt name old time/op new time/op delta Parse-10 91.7µs ± 0% 97.0µs ±11% +5.82% (p=0.000 n=9+10) name old alloc/op new alloc/op delta Parse-10 9.32kB ± 0% 18.63kB ± 0% +99.93% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Parse-10 185 ± 0% 178 ± 0% -3.78% (p=0.000 n=10+10) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

thaJeztah · 2023-05-11T19:57:39Z

@corhere @squizzi @milosgajdos interested to hear your take on this; I think this makes the code/handling easier to reason with, and the Regexes more "complete", and potentially more usable. But (as commented) does come with an increase memory use. Wondering if we think the pros outweigh the cons.

corhere

The names of capture groups don't have to be unique, which could make things weird when composing our expressions with others. And the capture group names become part of the exported API surface, which I'm squeamish about. I would be okay with using named capture groups IF none of the regexes are exported.

I'm really skeptical about the getNamedMatches function. If we want to have symbolic names for referencing the capture groups, assigning indices returned from (*Regexp).SubexpIndex to variables would suffice. That would cut down on allocations by not building a map copy of the match slice.

func mustSubexpIndex(re *regexp.Regexp, name string) int {
    i := re.SubexpIndex(name)
    if i < 0 {
        panic("no subexpression named "+name)
    }
    return i
}

var (
    referenceName = mustSubexpIndex(referenceRegexp, "name")
)

corhere · 2023-05-11T20:15:40Z

reference/regexp.go

+	if len(r.SubexpNames()) == 0 {
+		panic("regex does not have named subexpressions: " + r.String())
+	}


The condition will never be true as len(r.SubexpNames()) > 0 for all regexps. r.SubexpNames() will also have indices for every unnamed subexpression, with the empty string as the value.

corhere · 2023-05-11T20:15:57Z

reference/regexp.go

+	if ok = len(matches) > 0; !ok {
+		return nil, false
+	}
+	namedMatches = make(map[string]string, len(matches))


Suggested change

namedMatches = make(map[string]string, len(matches))

namedMatches = make(map[string]string, len(matches)-1)

corhere · 2023-05-11T20:59:35Z

reference/regexp.go

+		return nil, false
+	}
+	namedMatches = make(map[string]string, len(matches))
+	// We loop through matches here, in case there's optional named match-groups.


The only thing optional about optional capture groups is that they don't have to match anything. The indices of the matches slice don't change around. If they did, the indices in r.SubexpNames() would not line up. When matches != nil, len(matches) is constant for a given expression, irrespective of the input.

The previous way of detecting the name and version to supply to Dependency-Track was very brittle, it already failed for image references including a hash, resulting in names like hello-world@sha256, because it would only split on ':' and then select the first and last component. This version uses a regex to as accurately as possible match the individual components of a docker image reference. The regex comes from the [official implementation] on GitHub, but is actually taken from a [pull request], which adds named capture groups and fixes an issue with domain recognition being too eager. Yes the regex looks pretty wild, yes there are tests. I don't think it makes sense to build the regex from the individual components like the docker library does it. Unfortunately this does not solve the problem of actually getting the reference from somewhere, for images it works with getting it from the name of the main component, but this part stays brittle. Annotations to the scans might be a possible solution for that. [official implementation]: https://github.com/distribution/reference [pull request]: distribution/distribution#3803 Signed-off-by: Lukas Fischer <lukas.fischer@iteratec.com>

milosgajdos · 2023-12-16T23:10:53Z

@thaJeztah do you mind moving this PR to the https://github.com/distribution/reference repo?

milosgajdos · 2023-12-20T10:08:09Z

ping @thaJeztah we need to move this to the new repo. I don't think you can move PRs between repos (I know you can move issues), so we might wanna close this and open it in the new repo.

milosgajdos · 2024-04-08T11:19:39Z

@thaJeztah do you want to close this now that reference is in its own repo?

milosgajdos · 2024-05-27T21:05:01Z

There are a bunch of conflicts now due to the reference package being moved to a dedicated repo. Mind closing this @thaJeztah or moving it to distribution/reference?

milosgajdos · 2024-07-04T16:04:05Z

Closing as reference has moved to https://github.com/distribution/reference

We need to re-open the PR there.

thaJeztah commented Nov 25, 2022

View reviewed changes

thaJeztah force-pushed the reference_named_capture branch from 5435a29 to 233f7ad Compare November 25, 2022 14:38

This comment was marked as resolved.

Sign in to view

thaJeztah force-pushed the reference_named_capture branch from 233f7ad to ae426ce Compare April 17, 2023 13:35

thaJeztah force-pushed the reference_named_capture branch from ae426ce to 22f4e75 Compare April 29, 2023 17:33

thaJeztah mentioned this pull request Apr 29, 2023

reference: rewrite test to use sub-tests, add benchmark #3883

Merged

thaJeztah changed the title ~~[draft] reference: use named capturing groups~~ reference: use named capturing groups Apr 29, 2023

thaJeztah force-pushed the reference_named_capture branch 5 times, most recently from 719d5ac to 6d7b3e9 Compare May 2, 2023 20:57

thaJeztah added refactor status/2-code-review labels May 5, 2023

thaJeztah force-pushed the reference_named_capture branch from 6d7b3e9 to 9573e05 Compare May 9, 2023 14:24

thaJeztah force-pushed the reference_named_capture branch from 9573e05 to 365c733 Compare May 10, 2023 00:24

corhere reviewed May 11, 2023

View reviewed changes

thaJeztah mentioned this pull request Aug 17, 2023

There is a surprising increase in transitive dependencies when we switch to distribution v3 #3587

Closed

milosgajdos closed this Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reference: use named capturing groups #3803

reference: use named capturing groups #3803

Uh oh!

thaJeztah commented Nov 25, 2022 •

edited

Loading

Uh oh!

thaJeztah Nov 25, 2022

Uh oh!

thaJeztah May 11, 2023

Uh oh!

codecov-commenter commented Nov 25, 2022 •

edited

Loading

Uh oh!

This comment was marked as resolved.

thaJeztah commented May 11, 2023

Uh oh!

corhere left a comment

Uh oh!

corhere May 11, 2023

Uh oh!

corhere May 11, 2023

Uh oh!

corhere May 11, 2023

Uh oh!

milosgajdos commented Dec 16, 2023

Uh oh!

milosgajdos commented Dec 20, 2023

Uh oh!

milosgajdos commented Apr 8, 2024

Uh oh!

milosgajdos commented May 27, 2024

Uh oh!

milosgajdos commented Jul 4, 2024

Uh oh!

Uh oh!

	namedMatches = make(map[string]string, len(matches))
	namedMatches = make(map[string]string, len(matches)-1)

reference: use named capturing groups #3803

reference: use named capturing groups #3803

Uh oh!

Conversation

thaJeztah commented Nov 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thaJeztah Nov 25, 2022

Choose a reason for hiding this comment

Uh oh!

thaJeztah May 11, 2023

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Nov 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as resolved.

thaJeztah commented May 11, 2023

Uh oh!

corhere left a comment

Choose a reason for hiding this comment

Uh oh!

corhere May 11, 2023

Choose a reason for hiding this comment

Uh oh!

corhere May 11, 2023

Choose a reason for hiding this comment

Uh oh!

corhere May 11, 2023

Choose a reason for hiding this comment

Uh oh!

milosgajdos commented Dec 16, 2023

Uh oh!

milosgajdos commented Dec 20, 2023

Uh oh!

milosgajdos commented Apr 8, 2024

Uh oh!

milosgajdos commented May 27, 2024

Uh oh!

milosgajdos commented Jul 4, 2024

Uh oh!

Uh oh!

thaJeztah commented Nov 25, 2022 •

edited

Loading

codecov-commenter commented Nov 25, 2022 •

edited

Loading