Skip to content

Conversation

thaJeztah
Copy link
Member

@thaJeztah thaJeztah commented Nov 25, 2022

Rewrite the regular expressions to use named capturing groups.
This simplifies handling the resulting matches, but does require
some additional handling to associate matches with their names.

Also making some changes to the matching to match how domains
are actually matched; some of that was already handled in
code parsing the results of the regex, but now this is handled
by the regex itself.

Before:

BenchmarkParse
BenchmarkParse-10              12696             93805 ns/op            9311 B/op        185 allocs/op
PASS

After:

BenchmarkParse
BenchmarkParse-10              12486             94774 ns/op           18617 B/op        178 allocs/op
PASS

Benchstat:

go test -run='^$' -bench=. -count=10 ./reference/ > old.txt
go test -run='^$' -bench=. -count=10 ./reference/ > new.txt

benchstat old.txt new.txt
name       old time/op    new time/op    delta
Parse-10   91.7µs ± 0%    97.0µs ±11%   +5.82%  (p=0.000 n=9+10)

name       old alloc/op   new alloc/op   delta
Parse-10    9.32kB ± 0%   18.63kB ± 0%  +99.93%  (p=0.000 n=10+10)

name       old allocs/op  new allocs/op  delta
Parse-10        185 ± 0%       178 ± 0%   -3.78%  (p=0.000 n=10+10)

Comment on lines +103 to +109
// TODO(thaJeztah): disambiguate: docker requires domain-name to be either;
// - localhost (special case)
// - at least one "."
// - or a ":port"
//
// Any other domain is considered a path-element.
domainName = domainNameComponent + oneOrMore(`\.`+domainNameComponent)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still to be looked at; I think the new matches are already better than the old ones, but needs some eyes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^ to outline; previously the Regex would handle these as "namespace", and then the code processing the output would mark them as "domain". In this implementation, the Regex already splits them to be a domain (after which the code can do further handling).

(this is why some of the tests were updated, because they tested the Regex "incorrect" behavior)

@codecov-commenter
Copy link

codecov-commenter commented Nov 25, 2022

Codecov Report

Patch coverage: 91.11% and project coverage change: +0.07 🎉

Comparison is base (8e29e87) 56.88% compared to head (365c733) 56.96%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3803      +/-   ##
==========================================
+ Coverage   56.88%   56.96%   +0.07%     
==========================================
  Files         106      106              
  Lines       10703    10711       +8     
==========================================
+ Hits         6088     6101      +13     
+ Misses       3942     3937       -5     
  Partials      673      673              
Impacted Files Coverage Δ
reference/reference.go 80.85% <83.33%> (+2.56%) ⬆️
reference/regexp.go 92.00% <89.47%> (-8.00%) ⬇️
reference/normalize.go 82.07% <100.00%> (+0.17%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@thaJeztah thaJeztah force-pushed the reference_named_capture branch from 5435a29 to 233f7ad Compare November 25, 2022 14:38
@thaJeztah

This comment was marked as resolved.

@thaJeztah thaJeztah force-pushed the reference_named_capture branch from 233f7ad to ae426ce Compare April 17, 2023 13:35
@thaJeztah thaJeztah force-pushed the reference_named_capture branch from ae426ce to 22f4e75 Compare April 29, 2023 17:33
@thaJeztah thaJeztah changed the title [draft] reference: use named capturing groups reference: use named capturing groups Apr 29, 2023
@thaJeztah thaJeztah force-pushed the reference_named_capture branch 5 times, most recently from 719d5ac to 6d7b3e9 Compare May 2, 2023 20:57
@thaJeztah thaJeztah force-pushed the reference_named_capture branch from 6d7b3e9 to 9573e05 Compare May 9, 2023 14:24
Rewrite the regular expressions to use named capturing groups.
This simplifies handling the resulting matches, but does require
some additional handling to associate matches with their names.

Also making some changes to the matching to match how domains
are _actually_ matched; some of that was already handled in
code parsing the results of the regex, but now this is handled
by the regex itself.

Before:

    BenchmarkParse
    BenchmarkParse-10    	   12696	     93805 ns/op	    9311 B/op	     185 allocs/op
    PASS

After:

    BenchmarkParse
    BenchmarkParse-10    	   12486	     94774 ns/op	   18617 B/op	     178 allocs/op
    PASS

Benchstat:

    go test -run='^$' -bench=. -count=10 ./reference/ > old.txt
    go test -run='^$' -bench=. -count=10 ./reference/ > new.txt

    benchstat old.txt new.txt
    name       old time/op    new time/op    delta
    Parse-10   91.7µs ± 0%    97.0µs ±11%   +5.82%  (p=0.000 n=9+10)

    name       old alloc/op   new alloc/op   delta
    Parse-10    9.32kB ± 0%   18.63kB ± 0%  +99.93%  (p=0.000 n=10+10)

    name       old allocs/op  new allocs/op  delta
    Parse-10        185 ± 0%       178 ± 0%   -3.78%  (p=0.000 n=10+10)

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah thaJeztah force-pushed the reference_named_capture branch from 9573e05 to 365c733 Compare May 10, 2023 00:24
@thaJeztah
Copy link
Member Author

@corhere @squizzi @milosgajdos interested to hear your take on this; I think this makes the code/handling easier to reason with, and the Regexes more "complete", and potentially more usable. But (as commented) does come with an increase memory use. Wondering if we think the pros outweigh the cons.

Copy link
Collaborator

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names of capture groups don't have to be unique, which could make things weird when composing our expressions with others. And the capture group names become part of the exported API surface, which I'm squeamish about. I would be okay with using named capture groups IF none of the regexes are exported.

I'm really skeptical about the getNamedMatches function. If we want to have symbolic names for referencing the capture groups, assigning indices returned from (*Regexp).SubexpIndex to variables would suffice. That would cut down on allocations by not building a map copy of the match slice.

func mustSubexpIndex(re *regexp.Regexp, name string) int {
    i := re.SubexpIndex(name)
    if i < 0 {
        panic("no subexpression named "+name)
    }
    return i
}

var (
    referenceName = mustSubexpIndex(referenceRegexp, "name")
)

Comment on lines +188 to +190
if len(r.SubexpNames()) == 0 {
panic("regex does not have named subexpressions: " + r.String())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition will never be true as len(r.SubexpNames()) > 0 for all regexps. r.SubexpNames() will also have indices for every unnamed subexpression, with the empty string as the value.

if ok = len(matches) > 0; !ok {
return nil, false
}
namedMatches = make(map[string]string, len(matches))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
namedMatches = make(map[string]string, len(matches))
namedMatches = make(map[string]string, len(matches)-1)

return nil, false
}
namedMatches = make(map[string]string, len(matches))
// We loop through matches here, in case there's optional named match-groups.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing optional about optional capture groups is that they don't have to match anything. The indices of the matches slice don't change around. If they did, the indices in r.SubexpNames() would not line up. When matches != nil, len(matches) is constant for a given expression, irrespective of the input.

o1oo11oo pushed a commit to o1oo11oo/secureCodeBox that referenced this pull request Sep 18, 2023
The previous way of detecting the name and version to supply to
Dependency-Track was very brittle, it already failed for image
references including a hash, resulting in names like hello-world@sha256,
because it would only split on ':' and then select the first and last
component.

This version uses a regex to as accurately as possible match the
individual components of a docker image reference. The regex comes from
the [official implementation] on GitHub, but is actually taken from a
[pull request], which adds named capture groups and fixes an issue with
domain recognition being too eager.

Yes the regex looks pretty wild, yes there are tests. I don't think it
makes sense to build the regex from the individual components like the
docker library does it.

Unfortunately this does not solve the problem of actually getting the
reference from somewhere, for images it works with getting it from the
name of the main component, but this part stays brittle. Annotations to
the scans might be a possible solution for that.

[official implementation]: https://github.com/distribution/reference
[pull request]: distribution/distribution#3803

Signed-off-by: Lukas Fischer <lukas.fischer@iteratec.com>
o1oo11oo pushed a commit to secureCodeBox/secureCodeBox that referenced this pull request Oct 5, 2023
The previous way of detecting the name and version to supply to
Dependency-Track was very brittle, it already failed for image
references including a hash, resulting in names like hello-world@sha256,
because it would only split on ':' and then select the first and last
component.

This version uses a regex to as accurately as possible match the
individual components of a docker image reference. The regex comes from
the [official implementation] on GitHub, but is actually taken from a
[pull request], which adds named capture groups and fixes an issue with
domain recognition being too eager.

Yes the regex looks pretty wild, yes there are tests. I don't think it
makes sense to build the regex from the individual components like the
docker library does it.

Unfortunately this does not solve the problem of actually getting the
reference from somewhere, for images it works with getting it from the
name of the main component, but this part stays brittle. Annotations to
the scans might be a possible solution for that.

[official implementation]: https://github.com/distribution/reference
[pull request]: distribution/distribution#3803

Signed-off-by: Lukas Fischer <lukas.fischer@iteratec.com>
@milosgajdos
Copy link
Member

@thaJeztah do you mind moving this PR to the https://github.com/distribution/reference repo?

@milosgajdos
Copy link
Member

ping @thaJeztah we need to move this to the new repo. I don't think you can move PRs between repos (I know you can move issues), so we might wanna close this and open it in the new repo.

@milosgajdos
Copy link
Member

@thaJeztah do you want to close this now that reference is in its own repo?

@milosgajdos
Copy link
Member

There are a bunch of conflicts now due to the reference package being moved to a dedicated repo. Mind closing this @thaJeztah or moving it to distribution/reference?

@milosgajdos
Copy link
Member

Closing as reference has moved to https://github.com/distribution/reference

We need to re-open the PR there.

@milosgajdos milosgajdos closed this Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants