Skip to content

No tracking of closing header tags #1987

@glaforge

Description

@glaforge

I have a use case where I want to cut a long HTML document into sections, following the header tags h1, h2, etc...
I'm trying to use position tracking to find the beginning and the end of the headers, but it seems only the opening tag is tracked, but the closing tag is not tracked... except for non-official header tags (ie. h1 to h6 are official HTML header tags, but my document also has h7 and h8 which are not part of the HTML specification.

Let's take a concrete example, let's say you have the following snippet:

<h1>title</h1>
<h2 id="abc">abc</h2>
<p>hello</p>
<h5 id="bcd">bcd</h5>
<p>thanks</p>
<h3 id="cde">cde</h3>
<p>hello</p>
<h7 id="def">def</h7>
<p>thanks</p>
<h3 id="efg">efg</h3>
<p>hello</p>
<h8 id="fgh">fgh</h8>
<p>hello</p>

I'm then setting the flag to track the position with, and selecting the headers:

var doc = Parser.htmlParser().setTrackPosition(true).parseInput(htmlDoc, uri)
var headers = doc.select("h1, h2, h3, h4, h5, h6, h7, h8, h9");

When I print the element's .sourceRange().start() / end() and .endSourceRange().start() / end(), I get the following output:

Start: 1,1:0 End: 1,5:4 <-> Start: -1,-1:-1 End: -1,-1:-1 — h1 — title
Start: 2,1:15 End: 2,14:28 <-> Start: -1,-1:-1 End: -1,-1:-1 — h2 — abc
Start: 4,1:50 End: 4,14:63 <-> Start: -1,-1:-1 End: -1,-1:-1 — h5 — bcd thanks
Start: 6,1:86 End: 6,14:99 <-> Start: -1,-1:-1 End: -1,-1:-1 — h3 — cde
Start: 8,1:121 End: 8,14:134 <-> Start: 8,17:137 End: 8,22:142 — h7 — def
Start: 10,1:157 End: 10,14:170 <-> Start: -1,-1:-1 End: -1,-1:-1 — h3 — efg
Start: 12,1:192 End: 12,14:205 <-> Start: 12,17:208 End: 12,22:213 — h8 — fgh

The opening header tags have correct start/end positions for the opening tag.
But all the closing header tags (except the non-standard ones like h7 and h8) have -1 values, as if it wasn't tracked, or that there was no closing tag at all.

Shouldn't the endSourceRange() return non -1 positions?

Metadata

Metadata

Assignees

Labels

bugA confirmed bug, that we should fixfixedAn {bug|improvement} that has been {fixed|implemented}

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions