Skip to content

Conversation

jhy
Copy link
Owner

@jhy jhy commented Jul 21, 2025

The primary optimization is to invalidate sibling indexes on modification, vs re-calculating them immediately. When called in a loop of element.child(0).remove(), the previous implementation was effectively O(n²), and is now just O(n).

Also includes optimizations for Document.createElement(), setting attributes, and when parsing a body fragment with many direct children.

The primary optimization is to invalidate sibling indexes on modification, vs re-calculating them immediately. When called in a loop of `element.child(0).remove()`, the previous implementation was effectively O(n²), and is now just O(n).

Also includes optimizations for `Document.createElement()`, setting attributes, and when parsing a body fragment with many direct children.
@jhy
Copy link
Owner Author

jhy commented Jul 21, 2025

@carlosame FYI, this change corrects the busted performance of jsoup in your DOM modification benchmark (code).

New results:

Benchmark                       Mode  Cnt     Score    Error  Units
DOMChangeMark.markChangeDOM    thrpt    3  1539.337 ± 63.919  ops/s
DOMChangeMark.markChangeDOM4J  thrpt    3   363.892 ± 55.999  ops/s
DOMChangeMark.markChangeJdk    thrpt    3  2736.017 ± 83.992  ops/s
DOMChangeMark.markChangeJsoup  thrpt    3   366.781 ± 10.708  ops/s

Where previously jsoup was a pathetic 6 ops/s on this machine. As mentioned in the commit, the flow of the benchmark's removal was running as O(n²); now it's O(n).

From a quick look at the JDK's DOM impl, the primary difference that gives the extra speed for this specific benchmark is that they are using a linked-list vs an array, and so the construction/removal is just a simple field change. And as it's XML they're not needing to lowercase normalize the tag names or attributes etc. There are pros and cons to using a linked list in a DOM; for jsoup's general use case I've found the list is better, but I'll noodle on that some more.

@jhy jhy added the improvement An improvement / new feature idea label Jul 21, 2025
@jhy jhy added this to the 1.21.2 milestone Jul 21, 2025
@jhy jhy merged commit 31ad068 into master Jul 21, 2025
21 checks passed
@jhy jhy deleted the perf-optimizations branch July 21, 2025 03:22
@carlosame
Copy link

From a quick look at the JDK's DOM impl, the primary difference that gives the extra speed for this specific benchmark is that they are using a linked-list vs an array, and so the construction/removal is just a simple field change. And as it's XML they're not needing to lowercase normalize the tag names or attributes etc. There are pros and cons to using a linked list in a DOM; for jsoup's general use case I've found the list is better, but I'll noodle on that some more.

Thanks for the heads-up, I'll update the benchmarks once 1.21.2 is out.

Css4j's native DOM uses a linked list but has all the HTML overhead, it scores 1539 in the benchmark that you ran. The JDK DOM used to be about 55% faster than css4j DOM but has improved in recent JDKs. The JDK is generally faster because it does some weird internal optimizations; those sometimes trigger errors (e.g. ArrayIndexOutOfBoundsException) during thread-heavy benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement An improvement / new feature idea
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants