-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
I have a use case where I want to cut a long HTML document into sections, following the header tags h1, h2, etc...
I'm trying to use position tracking to find the beginning and the end of the headers, but it seems only the opening tag is tracked, but the closing tag is not tracked... except for non-official header tags (ie. h1 to h6 are official HTML header tags, but my document also has h7 and h8 which are not part of the HTML specification.
Let's take a concrete example, let's say you have the following snippet:
<h1>title</h1>
<h2 id="abc">abc</h2>
<p>hello</p>
<h5 id="bcd">bcd</h5>
<p>thanks</p>
<h3 id="cde">cde</h3>
<p>hello</p>
<h7 id="def">def</h7>
<p>thanks</p>
<h3 id="efg">efg</h3>
<p>hello</p>
<h8 id="fgh">fgh</h8>
<p>hello</p>I'm then setting the flag to track the position with, and selecting the headers:
var doc = Parser.htmlParser().setTrackPosition(true).parseInput(htmlDoc, uri)
var headers = doc.select("h1, h2, h3, h4, h5, h6, h7, h8, h9");
When I print the element's .sourceRange().start() / end() and .endSourceRange().start() / end(), I get the following output:
Start: 1,1:0 End: 1,5:4 <-> Start: -1,-1:-1 End: -1,-1:-1 — h1 — title
Start: 2,1:15 End: 2,14:28 <-> Start: -1,-1:-1 End: -1,-1:-1 — h2 — abc
Start: 4,1:50 End: 4,14:63 <-> Start: -1,-1:-1 End: -1,-1:-1 — h5 — bcd thanks
Start: 6,1:86 End: 6,14:99 <-> Start: -1,-1:-1 End: -1,-1:-1 — h3 — cde
Start: 8,1:121 End: 8,14:134 <-> Start: 8,17:137 End: 8,22:142 — h7 — def
Start: 10,1:157 End: 10,14:170 <-> Start: -1,-1:-1 End: -1,-1:-1 — h3 — efg
Start: 12,1:192 End: 12,14:205 <-> Start: 12,17:208 End: 12,22:213 — h8 — fgh
The opening header tags have correct start/end positions for the opening tag.
But all the closing header tags (except the non-standard ones like h7 and h8) have -1 values, as if it wasn't tracked, or that there was no closing tag at all.
Shouldn't the endSourceRange() return non -1 positions?