-
-
Notifications
You must be signed in to change notification settings - Fork 212
Goodbye validate-iri #2182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Goodbye validate-iri #2182
Conversation
@@ -3812,8 +3812,8 @@ it("parsePnpmLock", async () => { | |||
3, | |||
); | |||
parsedList = await parsePnpmLock("./pnpm-lock.yaml"); | |||
assert.deepStrictEqual(parsedList.pkgList.length, 378); | |||
assert.deepStrictEqual(parsedList.dependenciesList.length, 378); | |||
assert.deepStrictEqual(parsedList.pkgList.length, 377); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validate-iri is gone, so there is one less dependency!
["http://example.com/a%" + "ab%".repeat(500), false], // Many %ab patterns (invalid end) | ||
["http://example.com/a%" + "a".repeat(100), false], // One % followed by many 'a's | ||
["http://example.com/" + "%".repeat(100), false], // Very long sequence of just % | ||
["http://example.com/a%" + "a%".repeat(50000), false], // Many %a patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even big repetitions are no problem with our parse-then-validate approach!
["http://user:p@ssw0rd@example.com/path", true], // Valid, but contains @ | ||
["http://user@example.com/path", true], // Valid with user only | ||
// IRI with IPv6 literal (tests authority parsing) | ||
["http://[2001:db8::1]:8080/path", false], // Valid IPv6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm cheating here. Solving certain test cases will increase the complexity of the parser. Our implementation might incorrectly filter out certain valid IRIs. However, having such IPv6 addresses in SBOMs might look strange anyways.
["\\\\サーバー\\共有\\ファイル.txt", false], // Raw Unicode UNC - invalid IRI ref | ||
// Correct IRI for UNC-like path would need a scheme, e.g., file: | ||
[ | ||
"file:///%E3%82%B5%E3%83%BC%E3%83%90%E3%83%BC/%E5%85%B1%E6%9C%89/%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB.txt", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea why this is valid. There could be some amount of invalid IRIs that might escape the new implementation. Please file those as bugs, so that we can improve the parser.
["https://example.com/café/résumé.html", false], | ||
["https://example.com/path/%C3%A9%C3%A1%C3%BC", true], // Same path, pre-encoded | ||
// Path with Chinese characters | ||
["https://example.com/路径/文件.html", false], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually valid. Sorry, Chinese users. I want to keep the parser implementation simple.
Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>
915dd63
to
530f182
Compare
Fixes #2180
It is no secret that cdx1-pro (qwen3-coder) and I get along very well. The model is fully trained on the cdxgen codebase and CycloneDX in general, so I am seeing those 10x gains that people are talking about.
To rewrite validate-iri, I first tested cdx1-pro's knowledge about RFC 3987 by asking questions based on the test cases here.
Then I asked it to implement a parser by slowly increasing the complexity based on valid test cases, till we got decent coverage for valid tests.
Then come the invalid tests. I kept asking for simple regexes to solve the invalid tests, including the complex ReDoS, Unicode, and other tests. I think we have a decent
parse-then-validate
approach to IRI.