HTML API: Roadmap

[HTML issues](https://github.com/WordPress/gutenberg/discussions/57560) | [Refactors](https://github.com/WordPress/gutenberg/discussions/54750) | [Interesting Patches](https://github.com/WordPress/gutenberg/discussions/58755) | [Broader Roadmap](https://github.com/WordPress/gutenberg/issues/60397) | [Plans for 6.7](https://github.com/WordPress/gutenberg/issues/60396) | [Plans for 6.8](https://github.com/WordPress/gutenberg/issues/63037)

| &emsp; :warning: **Note:** This issue was created from the [HTML API: Roadmap discussion](https://github.com/WordPress/gutenberg/discussions/54583).&emsp; &nbsp; |
| --- |

See where this work fits in with [Dennis' broad list of interesting things in #62437](https://github.com/WordPress/gutenberg/issues/62437).

## Proposed HTML Specification Changes

 - https://github.com/whatwg/html/pull/10557
 - https://github.com/whatwg/html/issues/2272

## Related

 - https://github.com/WordPress/gutenberg/issues/64808

## Untriaged plans.
 - [x] Figure out and expose whitespace collapsing rules so that it's possible to make proper HTML-to-text functions without recreating that logic in consuming code.
 - Unfortunately it looks like this is mostly handled through the styling engine. We can pre-process a document to collapse `\r\n` into `\n` and then swap `\r` for `\n`, but beyond that, there aren't specific rules for inter-element whitespace on render. Render is governed by a complicated interaction between elements.
 - For instance, we should create a newline for paragraphs, but not if they are the first text content inside of an LI. We could have `<li>Stuff` and this should only have a single newline.
 - [ ] Parsing rules change in SVG and MathML content. This needs to be understood by both the Tag Processor and the HTML Processor.
 - [ ] Add DOM-like methods: `set_inner_html()`, `get_inner_html()`, `wrap_with()`, `unwrap()`, etc…
 - WordPress/gutenberg/issues/59623
 - WordPress/gutenberg/issues/60046
 - [x] Provide a listener interface for when popping elements off of the stack of open elements.
 - This is necessary to enable a number of behaviors related to identifying an open tag and where it ends.
 - [ ] Provide ability to extend the current document with new chunks of HTML.
 - A retention mode preserves the existing HTML so that `get_updated_html()` returns the full document with all chunks. `extend( string $next_chunk ): ?` makes the internal HTML document bigger and need not return anything.
 - A forgetful mode releases the existing HTML so that `get_updated_html()` will only be able to return the contents of the next chunk. `chunk_slide( string $next_chunk ): string` will extend the document, but will also release as much of the previous document that it can, returning the fully-updated portion of the total HTML document that is no longer reachable by a bookmark.
 - [x] Replace `html_decode_entities()` with a version that follows HTML's rules, particularly surrounding the `;` use and the ambiguous ampersand rule.
 - https://github.com/WordPress/wordpress-develop/pull/6387
 - [ ] Add memory and runtime limits for arbitrarily constrained environments.
 - A memory limit essentially sets a maximum chunk size and forces forgetful streaming mode, but will operate in less memory than is required to load the entire HTML string into memory. Useful for streaming pipelines where it's not required to contain the entire document at once.
 - A runtime limit can halt processing after a given timeout and disable further operations other than `get_updated_html()`.
 - [ ] Create a read-only copy of the Tag/HTML Processor that's safe to pass into filters and functions so that they can read from the document without changing it.
 - [ ] Provide the ability to limit edits to a specified region of a document; allowing _some_ operations, e.g. let a filter modify the attributes on the current tag only.
 - [x] Communicate in the HTML Processor that the semantic rules for the current token imply the creation of DOM nodes to a prior location in the document (active format reconstruction, adoption, and fostering, duplicate HTML and BODY tags).
 - An unexpected `` implicitly creates a `` to form an empty P element, but the HTML processor will not find the opening P tag because it doesn't exist. A CSS selector for `P` might miss this or any node it's targeting that depends on `nth-child` semantics.
 - Active format reconstruction creates new formatting elements, sometimes at a reasonable distance before the current tag. The HTML processor will not find these tags because they don't exist.
 - A DIV element as an apparent direct child of a TABLE will be shifted to become a previous sibling of the TABLE element. The HTML processor will not have seen the DIV before the TABLE, even though a DOM would.
 - A duplicate BODY tag sets attributes on the BODY element which don't already exist. The HTML processor would mistakenly report the BODY attributes because it didn't know more were to come.
 - These situations may or may not present obstacles for processing code, dependent on the context. E.g. if it's not relevant that the BODY inherits new attributes, the processor need not halt. In some situations it will be essential: e.g. "replace all empty paragraphs with some content" needs to find that implicit empty P element; "add an attribute to every B element" needs to find the implicit B tag that is created later during reconstruction.

## Tasks

### Bug fixes and quality

 - We need to defer applying enqueued edits as much as we can. When we removed that optimization in order to make the code simpler, we overlooked that on documents with many edits, this could lead to cataclysmic runtime overhead both in processing and memory. The workaround is to track edits externally and apply them all at once. The mechanism is that when applying an edit we copy the entire document, so if we have 50 edits we copy the entire document 50 times. With deferred updates, we only copy the entire document once when applying all of them in one go. [WordPress/wordpress-develop#6120]

## Waiting review and merge

<details open><summary>Future releases</summary>

 - Provide new filtering pipelines for final rendered HTML. Core-43258

</details>

<details open><summary>WordPress 6.6</summary>

 - Safe get/set inner HTML
 - HTML Templating for safe HTML generation.
 - New render-pipeline filter replacing Core functionality:
 - smilies/emojify
 - `capital_P_dangit`
 - Core refactors for HTML processing
 - `wp_strip_all_tags()`
 - `force_balance_tags()` rewritten as "serialize this HTML"

</details>

<details open><summary>WordPress 6.5</summary>

 - Support in the HTML Processor for most common tags IN BODY.
 - _Scan all tokens_ in the Tag Processor to enable modifying HTML structure.
 - <s>HTML Templating for safe HTML generation.</s>
 - Establish test suite of real posts and websites against which to run the HTML Processor and report progress.
 - Run the html5lib tests against the HTML Processor in WordPress CI suite.

</details>

<details><summary>WordPress 6.4</summary>

## Merged and bound for 6.4

 - WordPress/wordpress-develop#5048
 - WordPress/wordpress-develop#5126
 - WordPress/wordpress-develop#5127
 - WordPress/wordpress-develop#5145
 - WordPress/wordpress-develop#5252
 - WordPress/wordpress-develop#4317
 - WordPress/wordpress-develop#5096
 - WordPress/wordpress-develop#5243

## In progress for WordPress 6.4

 - Support for elements in `IN BODY` mode.
 - WordPress/wordpress-develop#5325
 - this probably won't merge, but parts of it might go out. we want to remove code that's passing a plurality of things to a singularity of thing. the issue might seem pedantic, but this can cause defects in `class` handling and we need to find the right way to resolve it.

## Plans for post-6.4 merges

 - Need to refactor the Tag Processor to think more about "tokens" than "tags" internally so that we can stop on comments and other non-tag tokens. This will not only support "funky comments" but is also necessary for work like the `wp_strip_tags()` and `truncate_html()` functionality, which needs to read plaintext content of markup (which needs to ignore comment and other meta content).
 - Focus on adding HTML templating so that the HTML API becomes useful for safe HTML generation in all the places we're currently forgetting to escape attributes and the like.

</details>

## PRs to revisit

 - https://github.com/WordPress/gutenberg/pull/51273 - see if we can unwind the unsafe concatenation of HTML inside the constructor
 - behaviors/lightbox - see if we can remove using multiple instances

## Areas of active exploration

 - Expose "original raw tag" for backwards compatibility with filters that expect spans of the HTML document from functions such as `wp_kses_hair()`. [WordPress/wordpress-develop#5143]
 - This is probably best left for an internal Core class, which is also needed for several of Core's cleanup tasks, tasks that need to examine raw markup and avoid making needless changes.
 - Add `set_raw_inner_markup()` and `get_raw_inner_markup()` (or not, if it's not the right interface). [WordPress/wordpress-develop#4956]
 - A new `wp_strip_tags()` function/approach that only parses as much HTML as is necessary. [WordPress/wordpress-develop#5208]
 - Allow extending the input document for more strict _streaming_ work. [WordPress/wordpress-develop#5050]

## HTML Templating

Provide a means to generate HTML conveniently with placeholders. The placeholders should be "funky comments" that mirror array values passed in to the rendering function. This will/should form the basis for raw HTML templating, replacing inner contents, powering Bits so that we can apply heuristics to the replacement markup, and more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTML API: Roadmap #60397

Proposed HTML Specification Changes

Related

Untriaged plans.

Tasks

Bug fixes and quality

Waiting review and merge

Merged and bound for 6.4

In progress for WordPress 6.4

Plans for post-6.4 merges

PRs to revisit

Areas of active exploration

HTML Templating

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HTML API: Roadmap #60397

Description

Proposed HTML Specification Changes

Related

Untriaged plans.

Tasks

Bug fixes and quality

Waiting review and merge

Merged and bound for 6.4

In progress for WordPress 6.4

Plans for post-6.4 merges

PRs to revisit

Areas of active exploration

HTML Templating

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions