Skip to content

Conversation

streichsbaer
Copy link
Contributor

Changelog:

  • Add check for malformed percent-encoding sequences before calling validate-iri
  • Replace %prettyVersion% placeholders with actual versions in composer.lock
  • Prevents cdxgen from hanging when processing BookStack and similar PHP projects

The validate-iri library hangs indefinitely on URLs with invalid percent-encoding like %prettyVersion% where %pr is not a valid hex sequence. This fix detects such malformed sequences early and also replaces known placeholders with actual version strings.

Fixes issue with packages from Codeberg that use URL placeholders.
See issue #2174

- Add check for malformed percent-encoding sequences before calling validate-iri
- Replace %prettyVersion% placeholders with actual versions in composer.lock
- Prevents cdxgen from hanging when processing BookStack and similar PHP projects

The validate-iri library hangs indefinitely on URLs with invalid percent-encoding like %prettyVersion% where %pr is not a valid hex sequence. This fix detects such malformed sequences early and also replaces known placeholders with actual version strings.

Fixes issue with packages from Codeberg that use URL placeholders.

Signed-off-by: Stefan Streichsbier <stefan@streichsbier.at>
@streichsbaer streichsbaer requested a review from prabhu as a code owner August 16, 2025 18:09
Copy link
Collaborator

@prabhu prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This is the second ReDoS reported against that library. Might require some fuzzing to identify all problematic payloads in the future.

@prabhu
Copy link
Collaborator

prabhu commented Aug 16, 2025

Could you kindly run pnpm lint command.

Signed-off-by: Stefan Streichsbier <stefan@streichsbier.at>
@streichsbaer
Copy link
Contributor Author

Done, @prabhu.

The library may be problematic as it hasn't been updated in three years.
It could make sense to replace it with something else/simpler.

At least for now, the early percent-encoding check makes cdxgen more robust against these edge cases.

@prabhu
Copy link
Collaborator

prabhu commented Aug 17, 2025

Agreed. We may have to fork or contribute to this library and enhance its tests including adding fuzzing. Happy to recommend this project to the next round of GitHub SOSF as well.

@streichsbaer
Copy link
Contributor Author

Sounds good!
Do you need anything else from me before merging this PR?

@prabhu prabhu merged commit 02f31c3 into CycloneDX:master Aug 17, 2025
79 of 80 checks passed
@prabhu
Copy link
Collaborator

prabhu commented Aug 17, 2025

Thank you so much!

@prabhu
Copy link
Collaborator

prabhu commented Aug 17, 2025

I found numerous other payloads that is causing the library to hang. Working on a separate PR.

const testCases = [
  // --- Existing Test Cases (for context) ---
  ["", false],
  ["git@gitlab.com:behat-chrome/chrome-mink-driver.git", false],
  ["     git@gitlab.com:behat-chrome/chrome-mink-driver.git      ", false],
  ["${repository.url}", false],
  // bomLink - https://cyclonedx.org/capabilities/bomlink/
  ["urn:cdx:f08a6ccd-4dce-4759-bd84-c626675d60a7/1#componentA", true],
  // http uri - https://www.ietf.org/rfc/rfc7230.txt
  ["https://gitlab.com/behat-chrome/chrome-mink-driver.git      ", false], // Fails due to trailing space
  [
    "     https://gitlab.com/behat-chrome/chrome-mink-driver.git           ",
    false, // Fails due to leading space
  ],
  ["http://gitlab.com/behat-chrome/chrome-mink-driver.git", true],
  ["git+https://github.com/Alex-D/check-disk-space.git      ", false], // Fails due to trailing space
  ["UNKNOWN", false],
  ["http://", false],
  ["http", false],
  ["https", false],
  ["https://", false],
  ["http://www", true],
  ["http://www.", true],
  [
    "https://github.com/apache/maven-resolver/tree/      ${project.scm.tag}",
    false, // Fails due to space and ${}
  ],
  ["git@github.com:prometheus/client_java.git", false],
  // --- New Stress Test Cases ---
  // Potential ReDoS for percent-encoding regex: Long sequences of % followed by non-hex or short hex
  ["http://example.com/a%" + "a%".repeat(50000), false], // Many %a patterns
  ["http://example.com/a%" + "ab%".repeat(50000), false], // Many %ab patterns (invalid end)
  ["http://example.com/a%" + "a".repeat(100000), false], // One % followed by many 'a's
  ["http://example.com/" + "%".repeat(100000), false], // Very long sequence of just %
  // Edge cases around valid percent-encoding boundaries (pushing regex engine)
  ["http://example.com/path%" + "20".repeat(30000) + "%2", false], // Valid %20s, ends with incomplete %
  ["http://example.com/path%" + "20".repeat(30000) + "a", false], // Valid %20s, ends with non-hex
  // Potentially complex IRI that might be slow for validateIri (if not already robust)
  // Using a plausible but complex structure with lots of valid non-ASCII chars (requires UTF-8 support)
  // Note: Actual performance depends on the `validateIri` implementation.
  [
    "http://example.com/path/to/resource/with/lots/of/segments/and/long/-names/including/üñíçødé/characters/ sprinkled/in/" +
      "segment".repeat(2000) +
      "?query=param&other=valué#frågmënt",
    true,
  ], // Assuming validateIri and URL can handle it
  // Very long valid IRI (tests overall handling, potentially URL constructor)
  [
    "http://very.long.domain.name.example.com/very/long/path/component/that/just/keeps/going/on/and/on/forever/it/seems/" +
      "segment/".repeat(3000) +
      "end",
    true,
  ], // Assuming it's technically valid
  // IRI with complex query and fragment (tests boundaries)
  [
    "https://example.com/path?query=with%20lots%20of%20percent%20encoding%20but%20valid%20%C3%A9%C3%B1#fragment-with-unicode-çhars-üñíçødé",
    true,
  ],
  // IRI that looks almost like a bomLink but isn't quite (tests scheme handling)
  ["urn:cdx:some-uuid/1#componentA/extra", true], // Might be valid IRI/URI, depends on urn:cdx spec, but structurally okay for IRI
  ["urn:cdx:some-uuid/1", true], // Valid urn without fragment
  // IRI with userinfo (less common, test robustness)
  ["http://user:p@ssw0rd@example.com/path", true], // Valid, but contains @
  ["http://user@example.com/path", true], // Valid with user only
  // IRI with IPv6 literal (tests authority parsing)
  ["http://[2001:db8::1]:8080/path", true], // Valid IPv6
  ["http://[2001:db8::1]/path", true], // Valid IPv6 without port
  // Potentially problematic characters in path/query/fragment (if not already covered)
  ["http://example.com/path with spaces", false], // Space not encoded
  ["http://example.com/path<with>brackets", false], // < > not typically allowed unencoded
  ['http://example.com/path"with"quotes', false], // " not typically allowed unencoded in URI/IRI ref
  // Test case sensitivity for scheme check (uses original `iri`)
  ["HTTP://example.com", true], // Scheme case (URL constructor should handle)
  ["HTTPS://EXAMPLE.COM/PATH", true],
  // Edge case: IRI that is just a scheme
  ["mailto:", false], // Scheme only, no path/query/fragment. Often invalid as a reference if no authority/data.
  // Re-test specific percent-encoding edge case mentioned in comments
  ["http://example.com/path%ab%cd%ef", true], // Valid percent encodings
  ["http://example.com/path%ab%cd%e", false], // Invalid: incomplete %e at end
  ["http://example.com/path%ab%cd%eg", false], // Invalid: %eg
  ["http://example.com/path%ab%cd%", false], // Invalid: trailing %
  ["http://example.com/path%ab%cd%0", false], // Invalid: %0
  ["http://example.com/path%ab%cd%0Z", false], // Invalid: %0Z (Z is hex, but makes the sequence too long if interpreted as %ab%cd%0Z)
  // Test with extremely long, but valid, percent-encoded sequence (pushes validateIri/URL)
  // This string is valid UTF-8 percent-encoded 'A' repeated many times.
  // encodeURIComponent("A".repeat(10000)) produces a very long string of %41
  // Let's simulate a long valid percent-encoded part manually for a simpler test
  ["http://example.com/data/" + "%41%42%43%44".repeat(10000), true], // Repeats 'ABCD' encoded
  // UNC Paths (IRI references)
  // Standard UNC path (often treated as URIs like \\server\share\path -> file://server/share/path or \\server\share -> smb://server/share)
  // However, as IRI *references* starting with \\, they are generally invalid unless specifically scheme-less references
  // The IRI spec defines scheme-less references as relative. \\server is not a valid relative path segment.
  ["\\\\server\\share\\path\\file.txt", false], // Looks like UNC, invalid as IRI ref
  ["file://server/share/path/file.txt", true], // Correct URI form if that's the intent
  // UNC path with spaces (invalid as IRI ref, valid file URI)
  ["\\\\server name\\share name\\file name.txt", false],
  ["file://server%20name/share%20name/file%20name.txt", true],
  // UNC path with Unicode (invalid as IRI ref, valid file URI if percent-encoded)
  ["\\\\サーバー\\共有\\ファイル.txt", false], // Raw Unicode UNC - invalid IRI ref
  // Correct IRI for UNC-like path would need a scheme, e.g., file:
  [
    "file:///%E3%82%B5%E3%83%BC%E3%83%90%E3%83%BC/%E5%85%B1%E6%9C%89/%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB.txt",
    true,
  ], // file:///%E3%82%B5%E3%83%BC%E3%83%90%E3%83%BC/%E5%85%B1%E6%9C%89/%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB.txt (Japanese characters encoded)

  // Unicode Characters in various components (IRI references)
  // Path with Latin-1 Supplement characters (e.g., accented letters)
  ["https://example.com/café/résumé.html", true], // IRI with non-ASCII chars
  ["https://example.com/path/%C3%A9%C3%A1%C3%BC", true], // Same path, pre-encoded
  // Path with Cyrillic characters
  ["https://example.com/путь/документ.html", true],
  // Path with Chinese characters
  ["https://example.com/路径/文件.html", true],
  // Path with Emoji (if supported by IRI spec and validator)
  ["https://example.com/search?q=cat&emoji=😺", true], // Emoji in query

  // Query and Fragment with Unicode
  ["https://example.com/search?q=café röst", false], // Unencoded space and unicode in query -> invalid URI/IRI ref
  ["https://example.com/search?q=café%20röst", true], // Correctly encoded
  ["https://example.com/page#seção-intro", true], // Unicode in fragment (IRI)
  ["https://example.com/page#se%C3%A7%C3%A3o-intro", true], // Encoded fragment

  // Bidirectional Text (Bidi) in IRI (from RFC 3987 Section 4.3)
  // Note: Actual bidi control characters (like U+200E, U+200F, U+202A..U+202E) should generally be avoided or percent-encoded.
  // Example Bidi IRI from RFC (Hebrew Alef, Lamed, Yod, Vav) - presented logically LTR as Alef-Lamed-Yod-Vav
  // Unicode code points: U+05D0 U+05DC U+05D9 U+05D5
  // UTF-8 Encoding: D7 90 D7 9C D7 99 D7 95
  // Percent Encoding: %D7%90%D7%9C%D7%99%D7%95
  // Assuming the logical string "http://example.com/الयो" represents the Hebrew characters.
  // However, constructing the *exact* bidi IRI string is complex in plain text.
  // Let's test with the percent-encoded version which is clearer.
  // This tests handling of valid UTF-8 sequences representing RTL characters.
  ["http://example.com/%D7%90%D7%9C%D7%99%D7%95", true], // Alef Lamed Yod Vav (Hebrew) encoded

  // Look-alike Characters (from RFC 3987 Section 7.5)
  // Full-width Latin characters (from RFC 3987 Section 7.5)
  // Full-width 'A' (U+FF21) vs. Latin 'A' (U+0041)
  // Full-width 'A' UTF-8: EF BC A1 -> Percent-encoded: %EF%BC%A1
  ["http://example.com/path/FULLWIDTH%EF%BC%A1", true], // Full-width 'A' in path
  // Testing if validator differentiates (it shouldn't inherently, both are valid IRI chars if allowed by scheme)
  ["http://example.com/path/LATIN_A", true], // Standard 'A'

  // Characters specifically excluded in older RFCs mentioned (RFC 3987 Section 7.2)
  // "<", ">", '"', space, "{", "}", "|", "\", "^", and "`"
  // These should generally be invalid *unless* percent-encoded within a valid IRI component context.
  ["https://example.com/path with space", false], // Invalid: unencoded space
  ["https://example.com/path%20with%20space", true], // Valid: encoded space
  ["https://example.com/path<invalid>", false], // Invalid: unencoded <
  ["https://example.com/path%3Cinvalid%3E", true], // Valid: encoded <>
  ['https://example.com/path"quoted"', false], // Invalid: unencoded "
  ["https://example.com/path%22quoted%22", true], // Valid: encoded "
  ["https://example.com/path{invalid}", false], // Invalid: unencoded {
  ["https://example.com/path%7Binvalid%7D", true], // Valid: encoded {}
  // Note: #, %, [, ] are NOT in the excluded list RFC 3987 mentions for conversion; % is crucial for encoding, # [] are for IPv6 literals.

  // Complex UTF-8 sequences (4-byte UTF-8 for supplementary planes)
  // Character: G clef (U+1D11E)
  // UTF-8 Encoding: F0 9D 84 9E -> Percent-encoded: %F0%9D%84%9E
  ["https://example.com/music/notation/%F0%9D%84%9E", true], // G clef in path

  // Extremely long UTF-8 sequence (valid but large)
  // Representing a string like "𝄞".repeat(5000) encoded
  // U+1D11E (G clef) -> UTF-8: F0 9D 84 9E -> Percent-encoded: %F0%9D%84%9E
  // Let's create a long valid percent-encoded string representing repeated 4-byte chars
  ["https://example.com/data/" + "%F0%9D%84%9E".repeat(5000), true], // Many G clefs encoded

  // Mixed valid/invalid percent-encoding with Unicode
  ["https://example.com/path/%F0%9D%84%9E%XY", false], // Invalid: %XY
  ["https://example.com/path/%F0%9D%84", false], // Invalid: incomplete 4-byte sequence
  ["https://example.com/path/%F0%9D%84%9", false], // Invalid: incomplete 4-byte sequence (missing 1 hex digit)
  ["https://example.com/path/%F0%9D%84%9Evalid%C3%A9", true], // Mix: G clef + valid 2-byte char (eacute)

  // Edge case: IRI that is mostly valid UTF-8 percent encoding but ends abruptly
  ["https://example.com/path/data/" + "%E2%82%AC".repeat(10000) + "%E2", false], // Valid Euro signs, ends with incomplete %

  // bomLink with Unicode (if applicable, though UUIDs are typically ASCII)
  // The spec doesn't mandate Unicode in the fragment, but let's test a valid IRI-like fragment
  ["urn:cdx:f08a6ccd-4dce-4759-bd84-c626675d60a7/1#compönent-Üñíçødé", true], // Unicode in fragment (assuming IRI rules apply)
];

@prabhu
Copy link
Collaborator

prabhu commented Aug 17, 2025

Raised #2180

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants