Skip to content

Conversation

prabhu
Copy link
Collaborator

@prabhu prabhu commented Aug 17, 2025

Fixes #2180

It is no secret that cdx1-pro (qwen3-coder) and I get along very well. The model is fully trained on the cdxgen codebase and CycloneDX in general, so I am seeing those 10x gains that people are talking about.

To rewrite validate-iri, I first tested cdx1-pro's knowledge about RFC 3987 by asking questions based on the test cases here.

1 2 3

Then I asked it to implement a parser by slowly increasing the complexity based on valid test cases, till we got decent coverage for valid tests.

4

Then come the invalid tests. I kept asking for simple regexes to solve the invalid tests, including the complex ReDoS, Unicode, and other tests. I think we have a decent parse-then-validate approach to IRI.

@@ -3812,8 +3812,8 @@ it("parsePnpmLock", async () => {
3,
);
parsedList = await parsePnpmLock("./pnpm-lock.yaml");
assert.deepStrictEqual(parsedList.pkgList.length, 378);
assert.deepStrictEqual(parsedList.dependenciesList.length, 378);
assert.deepStrictEqual(parsedList.pkgList.length, 377);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate-iri is gone, so there is one less dependency!

["http://example.com/a%" + "ab%".repeat(500), false], // Many %ab patterns (invalid end)
["http://example.com/a%" + "a".repeat(100), false], // One % followed by many 'a's
["http://example.com/" + "%".repeat(100), false], // Very long sequence of just %
["http://example.com/a%" + "a%".repeat(50000), false], // Many %a patterns
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even big repetitions are no problem with our parse-then-validate approach!

["http://user:p@ssw0rd@example.com/path", true], // Valid, but contains @
["http://user@example.com/path", true], // Valid with user only
// IRI with IPv6 literal (tests authority parsing)
["http://[2001:db8::1]:8080/path", false], // Valid IPv6
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm cheating here. Solving certain test cases will increase the complexity of the parser. Our implementation might incorrectly filter out certain valid IRIs. However, having such IPv6 addresses in SBOMs might look strange anyways.

["\\\\サーバー\\共有\\ファイル.txt", false], // Raw Unicode UNC - invalid IRI ref
// Correct IRI for UNC-like path would need a scheme, e.g., file:
[
"file:///%E3%82%B5%E3%83%BC%E3%83%90%E3%83%BC/%E5%85%B1%E6%9C%89/%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB.txt",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why this is valid. There could be some amount of invalid IRIs that might escape the new implementation. Please file those as bugs, so that we can improve the parser.

["https://example.com/café/résumé.html", false],
["https://example.com/path/%C3%A9%C3%A1%C3%BC", true], // Same path, pre-encoded
// Path with Chinese characters
["https://example.com/路径/文件.html", false],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually valid. Sorry, Chinese users. I want to keep the parser implementation simple.

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>
@prabhu prabhu force-pushed the feature/parse-dont-validate-iri branch from 915dd63 to 530f182 Compare August 18, 2025 00:06
Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>
@prabhu prabhu merged commit 027c715 into master Aug 18, 2025
80 checks passed
@prabhu prabhu deleted the feature/parse-dont-validate-iri branch August 18, 2025 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[security] ReDoS vulnerabilities with validate-iri package
2 participants