Improve the parsing of remote url details in the URL Details endpoint to REST API #28791

getdave · 2021-02-05T16:36:22Z

Builds on the original implementation from #18042 to add a better mechanism for parsing the information from the remote URL's response body HTML.

This helps to address feedback such as #27762 and also sets things up nicely for parsing additional detail from the remote URL (eg: favicon, Open Graph meta...etc).

This PR focuses solely on utilising DOMDocument and friends to improve the parsing mechanic.

Automated Testing Instructions

npm run test-php should run the unit tests which reside in phpunit/class-wp-rest-url-details-controller-test.php.

Manual Testing Instructions

Check out this PR.
Run npm run wp-env start to boot the local testing env.
Open the Gutenberg local testing env at http://localhost:8889/wp-admin/ and login.
Create a new Post.
Choose one of the following options:

Option 1

Open devtools and in the console enter:

wp.apiFetch( { path: '/__experimental/url-details/?url=https://wordpress.org' } ).then( data => {
    console.log( data );
} );

...also try some non-English websites:

wp.apiFetch( { path: '/__experimental/url-details/?url=http://www.baidu.com' } ).then( data => {
    console.log( data );
} );

You'll see the response returned there.

Option 2

Open devtools and in the console enter wpApiSettings.nonce. You should see a valid nonce returned as a string.
Copy the nonce.
In your browser goto: http://localhost:8889/wp-json/__experimental/url-details/?url=https://wordpress.org&_wpnonce=%YOUR_NONCE_HERE% - be sure to replace the nonce value with that copied from the previous step.
You should see a valid REST API response containing the contents of the title tag from wordpress.org (ie: "Blog Tool, Publishing Platform, and CMS \u2014 WordPress.org").

Types of changes

New feature (non-breaking change which adds functionality)

Checklist:

My code is tested.
My code follows the WordPress code style.
My code follows the accessibility standards.
My code has proper inline documentation.
I've included developer documentation if appropriate.
I've updated all React Native files affected by any refactorings/renamings in this PR.

github-actions · 2021-02-05T16:46:48Z

Size Change: 0 B

Total Size: 1.47 MB

ℹ️ View Unchanged

Filename	Size	Change
`build/a11y/index.js`	1.14 kB	0 B
`build/annotations/index.js`	3.78 kB	0 B
`build/api-fetch/index.js`	3.41 kB	0 B
`build/autop/index.js`	2.83 kB	0 B
`build/blob/index.js`	664 B	0 B
`build/block-directory/index.js`	8.62 kB	0 B
`build/block-directory/style-rtl.css`	1 kB	0 B
`build/block-directory/style.css`	1.01 kB	0 B
`build/block-editor/index.js`	131 kB	0 B
`build/block-editor/style-rtl.css`	13 kB	0 B
`build/block-editor/style.css`	13 kB	0 B
`build/block-library/blocks/archives/editor-rtl.css`	61 B	0 B
`build/block-library/blocks/archives/editor.css`	60 B	0 B
`build/block-library/blocks/audio/editor-rtl.css`	58 B	0 B
`build/block-library/blocks/audio/editor.css`	58 B	0 B
`build/block-library/blocks/audio/style-rtl.css`	112 B	0 B
`build/block-library/blocks/audio/style.css`	112 B	0 B
`build/block-library/blocks/block/editor-rtl.css`	161 B	0 B
`build/block-library/blocks/block/editor.css`	161 B	0 B
`build/block-library/blocks/button/editor-rtl.css`	475 B	0 B
`build/block-library/blocks/button/editor.css`	474 B	0 B
`build/block-library/blocks/button/style-rtl.css`	503 B	0 B
`build/block-library/blocks/button/style.css`	503 B	0 B
`build/block-library/blocks/buttons/editor-rtl.css`	315 B	0 B
`build/block-library/blocks/buttons/editor.css`	315 B	0 B
`build/block-library/blocks/buttons/style-rtl.css`	368 B	0 B
`build/block-library/blocks/buttons/style.css`	368 B	0 B
`build/block-library/blocks/calendar/style-rtl.css`	208 B	0 B
`build/block-library/blocks/calendar/style.css`	208 B	0 B
`build/block-library/blocks/categories/editor-rtl.css`	84 B	0 B
`build/block-library/blocks/categories/editor.css`	83 B	0 B
`build/block-library/blocks/categories/style-rtl.css`	79 B	0 B
`build/block-library/blocks/categories/style.css`	79 B	0 B
`build/block-library/blocks/code/style-rtl.css`	90 B	0 B
`build/block-library/blocks/code/style.css`	90 B	0 B
`build/block-library/blocks/columns/editor-rtl.css`	190 B	0 B
`build/block-library/blocks/columns/editor.css`	190 B	0 B
`build/block-library/blocks/columns/style-rtl.css`	436 B	0 B
`build/block-library/blocks/columns/style.css`	435 B	0 B
`build/block-library/blocks/cover/editor-rtl.css`	605 B	0 B
`build/block-library/blocks/cover/editor.css`	605 B	0 B
`build/block-library/blocks/cover/style-rtl.css`	1.23 kB	0 B
`build/block-library/blocks/cover/style.css`	1.23 kB	0 B
`build/block-library/blocks/embed/editor-rtl.css`	486 B	0 B
`build/block-library/blocks/embed/editor.css`	486 B	0 B
`build/block-library/blocks/embed/style-rtl.css`	401 B	0 B
`build/block-library/blocks/embed/style.css`	400 B	0 B
`build/block-library/blocks/file/editor-rtl.css`	301 B	0 B
`build/block-library/blocks/file/editor.css`	300 B	0 B
`build/block-library/blocks/file/frontend.js`	765 B	0 B
`build/block-library/blocks/file/style-rtl.css`	255 B	0 B
`build/block-library/blocks/file/style.css`	255 B	0 B
`build/block-library/blocks/freeform/editor-rtl.css`	2.44 kB	0 B
`build/block-library/blocks/freeform/editor.css`	2.44 kB	0 B
`build/block-library/blocks/gallery/editor-rtl.css`	704 B	0 B
`build/block-library/blocks/gallery/editor.css`	705 B	0 B
`build/block-library/blocks/gallery/style-rtl.css`	1.09 kB	0 B
`build/block-library/blocks/gallery/style.css`	1.09 kB	0 B
`build/block-library/blocks/group/editor-rtl.css`	160 B	0 B
`build/block-library/blocks/group/editor.css`	160 B	0 B
`build/block-library/blocks/group/style-rtl.css`	57 B	0 B
`build/block-library/blocks/group/style.css`	57 B	0 B
`build/block-library/blocks/heading/editor-rtl.css`	129 B	0 B
`build/block-library/blocks/heading/editor.css`	129 B	0 B
`build/block-library/blocks/heading/style-rtl.css`	76 B	0 B
`build/block-library/blocks/heading/style.css`	76 B	0 B
`build/block-library/blocks/html/editor-rtl.css`	281 B	0 B
`build/block-library/blocks/html/editor.css`	281 B	0 B
`build/block-library/blocks/image/editor-rtl.css`	717 B	0 B
`build/block-library/blocks/image/editor.css`	716 B	0 B
`build/block-library/blocks/image/style-rtl.css`	476 B	0 B
`build/block-library/blocks/image/style.css`	478 B	0 B
`build/block-library/blocks/latest-comments/style-rtl.css`	281 B	0 B
`build/block-library/blocks/latest-comments/style.css`	282 B	0 B
`build/block-library/blocks/latest-posts/editor-rtl.css`	137 B	0 B
`build/block-library/blocks/latest-posts/editor.css`	137 B	0 B
`build/block-library/blocks/latest-posts/style-rtl.css`	523 B	0 B
`build/block-library/blocks/latest-posts/style.css`	522 B	0 B
`build/block-library/blocks/legacy-widget/editor-rtl.css`	398 B	0 B
`build/block-library/blocks/legacy-widget/editor.css`	399 B	0 B
`build/block-library/blocks/list/style-rtl.css`	63 B	0 B
`build/block-library/blocks/list/style.css`	63 B	0 B
`build/block-library/blocks/media-text/editor-rtl.css`	191 B	0 B
`build/block-library/blocks/media-text/editor.css`	191 B	0 B
`build/block-library/blocks/media-text/style-rtl.css`	535 B	0 B
`build/block-library/blocks/media-text/style.css`	532 B	0 B
`build/block-library/blocks/more/editor-rtl.css`	434 B	0 B
`build/block-library/blocks/more/editor.css`	434 B	0 B
`build/block-library/blocks/navigation-link/editor-rtl.css`	597 B	0 B
`build/block-library/blocks/navigation-link/editor.css`	597 B	0 B
`build/block-library/blocks/navigation-link/style-rtl.css`	1.07 kB	0 B
`build/block-library/blocks/navigation-link/style.css`	1.07 kB	0 B
`build/block-library/blocks/navigation/editor-rtl.css`	1.24 kB	0 B
`build/block-library/blocks/navigation/editor.css`	1.24 kB	0 B
`build/block-library/blocks/navigation/style-rtl.css`	272 B	0 B
`build/block-library/blocks/navigation/style.css`	271 B	0 B
`build/block-library/blocks/nextpage/editor-rtl.css`	395 B	0 B
`build/block-library/blocks/nextpage/editor.css`	395 B	0 B
`build/block-library/blocks/page-list/editor-rtl.css`	239 B	0 B
`build/block-library/blocks/page-list/editor.css`	240 B	0 B
`build/block-library/blocks/page-list/style-rtl.css`	167 B	0 B
`build/block-library/blocks/page-list/style.css`	167 B	0 B
`build/block-library/blocks/paragraph/editor-rtl.css`	157 B	0 B
`build/block-library/blocks/paragraph/editor.css`	157 B	0 B
`build/block-library/blocks/paragraph/style-rtl.css`	247 B	0 B
`build/block-library/blocks/paragraph/style.css`	248 B	0 B
`build/block-library/blocks/post-author/editor-rtl.css`	209 B	0 B
`build/block-library/blocks/post-author/editor.css`	209 B	0 B
`build/block-library/blocks/post-author/style-rtl.css`	183 B	0 B
`build/block-library/blocks/post-author/style.css`	184 B	0 B
`build/block-library/blocks/post-comments-form/style-rtl.css`	250 B	0 B
`build/block-library/blocks/post-comments-form/style.css`	250 B	0 B
`build/block-library/blocks/post-content/editor-rtl.css`	139 B	0 B
`build/block-library/blocks/post-content/editor.css`	139 B	0 B
`build/block-library/blocks/post-excerpt/editor-rtl.css`	73 B	0 B
`build/block-library/blocks/post-excerpt/editor.css`	73 B	0 B
`build/block-library/blocks/post-excerpt/style-rtl.css`	69 B	0 B
`build/block-library/blocks/post-excerpt/style.css`	69 B	0 B
`build/block-library/blocks/post-featured-image/editor-rtl.css`	338 B	0 B
`build/block-library/blocks/post-featured-image/editor.css`	338 B	0 B
`build/block-library/blocks/post-featured-image/style-rtl.css`	100 B	0 B
`build/block-library/blocks/post-featured-image/style.css`	100 B	0 B
`build/block-library/blocks/post-title/style-rtl.css`	60 B	0 B
`build/block-library/blocks/post-title/style.css`	60 B	0 B
`build/block-library/blocks/preformatted/style-rtl.css`	103 B	0 B
`build/block-library/blocks/preformatted/style.css`	103 B	0 B
`build/block-library/blocks/pullquote/editor-rtl.css`	183 B	0 B
`build/block-library/blocks/pullquote/editor.css`	183 B	0 B
`build/block-library/blocks/pullquote/style-rtl.css`	318 B	0 B
`build/block-library/blocks/pullquote/style.css`	318 B	0 B
`build/block-library/blocks/query-loop/editor-rtl.css`	83 B	0 B
`build/block-library/blocks/query-loop/editor.css`	82 B	0 B
`build/block-library/blocks/query-loop/style-rtl.css`	315 B	0 B
`build/block-library/blocks/query-loop/style.css`	317 B	0 B
`build/block-library/blocks/query-pagination-numbers/editor-rtl.css`	122 B	0 B
`build/block-library/blocks/query-pagination-numbers/editor.css`	121 B	0 B
`build/block-library/blocks/query-pagination/editor-rtl.css`	270 B	0 B
`build/block-library/blocks/query-pagination/editor.css`	262 B	0 B
`build/block-library/blocks/query-pagination/style-rtl.css`	168 B	0 B
`build/block-library/blocks/query-pagination/style.css`	168 B	0 B
`build/block-library/blocks/query-title/editor-rtl.css`	86 B	0 B
`build/block-library/blocks/query-title/editor.css`	86 B	0 B
`build/block-library/blocks/query/editor-rtl.css`	131 B	0 B
`build/block-library/blocks/query/editor.css`	132 B	0 B
`build/block-library/blocks/quote/style-rtl.css`	169 B	0 B
`build/block-library/blocks/quote/style.css`	169 B	0 B
`build/block-library/blocks/rss/editor-rtl.css`	201 B	0 B
`build/block-library/blocks/rss/editor.css`	202 B	0 B
`build/block-library/blocks/rss/style-rtl.css`	290 B	0 B
`build/block-library/blocks/rss/style.css`	290 B	0 B
`build/block-library/blocks/search/editor-rtl.css`	189 B	0 B
`build/block-library/blocks/search/editor.css`	189 B	0 B
`build/block-library/blocks/search/style-rtl.css`	359 B	0 B
`build/block-library/blocks/search/style.css`	362 B	0 B
`build/block-library/blocks/separator/editor-rtl.css`	99 B	0 B
`build/block-library/blocks/separator/editor.css`	99 B	0 B
`build/block-library/blocks/separator/style-rtl.css`	251 B	0 B
`build/block-library/blocks/separator/style.css`	251 B	0 B
`build/block-library/blocks/shortcode/editor-rtl.css`	512 B	0 B
`build/block-library/blocks/shortcode/editor.css`	512 B	0 B
`build/block-library/blocks/site-logo/editor-rtl.css`	440 B	0 B
`build/block-library/blocks/site-logo/editor.css`	441 B	0 B
`build/block-library/blocks/site-logo/style-rtl.css`	154 B	0 B
`build/block-library/blocks/site-logo/style.css`	154 B	0 B
`build/block-library/blocks/social-link/editor-rtl.css`	164 B	0 B
`build/block-library/blocks/social-link/editor.css`	165 B	0 B
`build/block-library/blocks/social-links/editor-rtl.css`	796 B	0 B
`build/block-library/blocks/social-links/editor.css`	795 B	0 B
`build/block-library/blocks/social-links/style-rtl.css`	1.32 kB	0 B
`build/block-library/blocks/social-links/style.css`	1.33 kB	0 B
`build/block-library/blocks/spacer/editor-rtl.css`	308 B	0 B
`build/block-library/blocks/spacer/editor.css`	308 B	0 B
`build/block-library/blocks/spacer/style-rtl.css`	48 B	0 B
`build/block-library/blocks/spacer/style.css`	48 B	0 B
`build/block-library/blocks/table/editor-rtl.css`	478 B	0 B
`build/block-library/blocks/table/editor.css`	478 B	0 B
`build/block-library/blocks/table/style-rtl.css`	402 B	0 B
`build/block-library/blocks/table/style.css`	402 B	0 B
`build/block-library/blocks/tag-cloud/editor-rtl.css`	118 B	0 B
`build/block-library/blocks/tag-cloud/editor.css`	118 B	0 B
`build/block-library/blocks/tag-cloud/style-rtl.css`	94 B	0 B
`build/block-library/blocks/tag-cloud/style.css`	94 B	0 B
`build/block-library/blocks/template-part/editor-rtl.css`	552 B	0 B
`build/block-library/blocks/template-part/editor.css`	551 B	0 B
`build/block-library/blocks/term-description/editor-rtl.css`	90 B	0 B
`build/block-library/blocks/term-description/editor.css`	90 B	0 B
`build/block-library/blocks/text-columns/editor-rtl.css`	95 B	0 B
`build/block-library/blocks/text-columns/editor.css`	95 B	0 B
`build/block-library/blocks/text-columns/style-rtl.css`	166 B	0 B
`build/block-library/blocks/text-columns/style.css`	166 B	0 B
`build/block-library/blocks/verse/style-rtl.css`	87 B	0 B
`build/block-library/blocks/verse/style.css`	87 B	0 B
`build/block-library/blocks/video/editor-rtl.css`	568 B	0 B
`build/block-library/blocks/video/editor.css`	569 B	0 B
`build/block-library/blocks/video/style-rtl.css`	173 B	0 B
`build/block-library/blocks/video/style.css`	173 B	0 B
`build/block-library/common-rtl.css`	1.31 kB	0 B
`build/block-library/common.css`	1.31 kB	0 B
`build/block-library/editor-rtl.css`	9.47 kB	0 B
`build/block-library/editor.css`	9.46 kB	0 B
`build/block-library/index.js`	153 kB	0 B
`build/block-library/reset-rtl.css`	502 B	0 B
`build/block-library/reset.css`	503 B	0 B
`build/block-library/style-rtl.css`	9.44 kB	0 B
`build/block-library/style.css`	9.44 kB	0 B
`build/block-library/theme-rtl.css`	692 B	0 B
`build/block-library/theme.css`	693 B	0 B
`build/block-serialization-default-parser/index.js`	1.87 kB	0 B
`build/block-serialization-spec-parser/index.js`	3.06 kB	0 B
`build/blocks/index.js`	48.7 kB	0 B
`build/components/index.js`	285 kB	0 B
`build/components/style-rtl.css`	16.2 kB	0 B
`build/components/style.css`	16.2 kB	0 B
`build/compose/index.js`	11.6 kB	0 B
`build/core-data/index.js`	17 kB	0 B
`build/customize-widgets/index.js`	8.27 kB	0 B
`build/customize-widgets/style-rtl.css`	666 B	0 B
`build/customize-widgets/style.css`	667 B	0 B
`build/data-controls/index.js`	836 B	0 B
`build/data/index.js`	9.17 kB	0 B
`build/date/index.js`	31.9 kB	0 B
`build/deprecated/index.js`	787 B	0 B
`build/dom-ready/index.js`	576 B	0 B
`build/dom/index.js`	5.12 kB	0 B
`build/edit-navigation/index.js`	17.1 kB	0 B
`build/edit-navigation/style-rtl.css`	2.86 kB	0 B
`build/edit-navigation/style.css`	2.86 kB	0 B
`build/edit-post/classic-rtl.css`	454 B	0 B
`build/edit-post/classic.css`	454 B	0 B
`build/edit-post/index.js`	339 kB	0 B
`build/edit-post/style-rtl.css`	6.96 kB	0 B
`build/edit-post/style.css`	6.95 kB	0 B
`build/edit-site/index.js`	28.9 kB	0 B
`build/edit-site/style-rtl.css`	4.9 kB	0 B
`build/edit-site/style.css`	4.89 kB	0 B
`build/edit-widgets/index.js`	16.7 kB	0 B
`build/edit-widgets/style-rtl.css`	2.97 kB	0 B
`build/edit-widgets/style.css`	2.98 kB	0 B
`build/editor/index.js`	42.6 kB	0 B
`build/editor/style-rtl.css`	3.9 kB	0 B
`build/editor/style.css`	3.9 kB	0 B
`build/element/index.js`	4.62 kB	0 B
`build/escape-html/index.js`	735 B	0 B
`build/format-library/index.js`	6.77 kB	0 B
`build/format-library/style-rtl.css`	637 B	0 B
`build/format-library/style.css`	639 B	0 B
`build/hooks/index.js`	2.28 kB	0 B
`build/html-entities/index.js`	622 B	0 B
`build/i18n/index.js`	4.04 kB	0 B
`build/is-shallow-equal/index.js`	699 B	0 B
`build/keyboard-shortcuts/index.js`	2.53 kB	0 B
`build/keycodes/index.js`	1.95 kB	0 B
`build/list-reusable-blocks/index.js`	3.19 kB	0 B
`build/list-reusable-blocks/style-rtl.css`	629 B	0 B
`build/list-reusable-blocks/style.css`	628 B	0 B
`build/media-utils/index.js`	5.39 kB	0 B
`build/notices/index.js`	1.85 kB	0 B
`build/nux/index.js`	3.42 kB	0 B
`build/nux/style-rtl.css`	731 B	0 B
`build/nux/style.css`	727 B	0 B
`build/plugins/index.js`	2.95 kB	0 B
`build/primitives/index.js`	1.42 kB	0 B
`build/priority-queue/index.js`	791 B	0 B
`build/react-i18n/index.js`	1.45 kB	0 B
`build/redux-routine/index.js`	2.84 kB	0 B
`build/reusable-blocks/index.js`	3.8 kB	0 B
`build/reusable-blocks/style-rtl.css`	225 B	0 B
`build/reusable-blocks/style.css`	225 B	0 B
`build/rich-text/index.js`	13.5 kB	0 B
`build/server-side-render/index.js`	2.6 kB	0 B
`build/shortcode/index.js`	1.7 kB	0 B
`build/token-list/index.js`	1.27 kB	0 B
`build/url/index.js`	3.01 kB	0 B
`build/viewport/index.js`	1.85 kB	0 B
`build/warning/index.js`	1.14 kB	0 B
`build/wordcount/index.js`	1.22 kB	0 B

_{compressed-size-action}

getdave · 2021-02-05T16:55:03Z

cc @beaulebens You might be interested in this as a follow up to #18042 (comment)

getdave · 2021-02-12T22:09:10Z

I'd like some feedback on how we're suppressing errors that can be generated by DOMDocument::loadHTML. For example if you include <section> tags in the response data then it will throw an error. We are suppressing the error which doesn't seem to cause any problems and we can still parse the data.

getdave · 2021-02-13T10:25:46Z

lib/class-wp-rest-url-details-controller.php

@@ -137,6 +143,39 @@ public function parse_url_details( $request ) {
 		return apply_filters( 'rest_prepare_url_details', $response, $url, $request, $remote_url_response );


Note to self. Pass the $xpath value to the filter so folks can use it to query additional parts of response.

TimothyBJacobs · 2021-02-16T02:57:26Z

I'm not sure I have enough experience with DOMDocument to give specific feedback here.

lib/class-wp-rest-url-details-controller.php

getdave · 2021-04-22T12:10:05Z

Picking this up again.

…body

…MDocument::loadHTML

Addresses #28791 (comment)

hellofromtonya · 2021-04-23T17:50:04Z

DOMDocument is a powerful HTML parser but presents significant problems for WordPress sites:

Hosting: the following extensions are required: DOM, libxml, and iconv
Guarding could be added to protect against missing extensions. However, an alternative parsing mechanism would be needed to parse these webpages.
Libxml versions before 2.8.0 have a known bug: HTML parser error with <noscript> in the <head>
A transformation can be included to handle these instances.
While DOMDocument autofixes for malformed HTML, it is not accurate with closing nested divs, when an inner child div is missing its closing tag. Rather, it places the missing closing tag in the wrong place, changing the element relationships and structure.
The problem is the most problematic. Why? A missing inner closing div causes changes to the document's structure. See the problem in action here https://3v4l.org/ijrfW

Story time:
During my time at my last company, we rolled out a solution using DOMDocument. We discovered over 30% of the webpages parsed through it were badly malformed and caused inaccurate DOM building and, worst yet, broken webpages after processing.

Not a problem if:

If accuracy of inner div structures is not required and the Document is not converted back to HTML (i.e. via DOMDocument::saveHTML - or other methods)
Only allowing well-formed webpages

However, if the point is to ensure proper HTML parsing for processing the elements, the accuracy of autofixing the HTML needs to be improved, especially for the missing inner closing div.

getdave · 2021-04-23T18:53:15Z

@hellofromtonya Thanks for the detailed explanation. This isn't something I had considered and so it's extremely helpful to have this context. Much appreciated.

However, if the point is to ensure proper HTML parsing for processing the elements, the accuracy of autofixing the HTML needs to be improved, especially for the missing inner closing div.

What I will say is that this is definitely being used for progressive enhancement of the UI. It is not critical functionality. All we are doing is attempting to parse the remote website to gain some metadata to display to the user when they enter a valid link. If the parsing fails for any reason then it is absolutely fine and the worst the user will see will be the fallback UI for the link (which is what they see currently in the block editor).

I suppose there is an argument that if this endpoint exists then users might try to use it for more detailed parsing of a remote URL, but that is not its intent. Perhaps we could document as such or limit the response payload size to avoid folks using it as a scraper? This endpoint only runs in the admin if you have suitable permissions.

Libxml versions before 2.8.0 have a known bug: HTML parser error with in the
A transformation can be included to handle these instances.

I can look into this.

Hosting: the following extensions are required: DOM, libxml, and iconv
Guarding could be added to protect against missing extensions. However, an alternative parsing mechanism would be needed to parse these webpages.

I assume if we test for these extensions and bail if they don't cut the mustard then that's ok? Again, as this is progressive enhancement we can just not return any data and the block editor will provide fallback. The user won't really notice.

With the above context would you still say this approach is a no go?

hellofromtonya · 2021-04-23T19:44:32Z

Hey @getdave, thanks for providing more context for its use. Not a "no go" yet. Extracting metadata such as the <title> is doable with DOMDocument.

What type of metadata will be extracted?

The PR is using xpath to find the document's <title></title> element. This particular element could be fetched using regex instead, which would be less code, faster code (parsing the HTML takes time especially for larger web pages), and without the server setup and encoding issues with DOMDocument.

Will more metadata be extracted from the HTML?

getdave · 2021-04-26T08:11:41Z

Will more metadata be extracted from the HTML?

Yes. Moreover, it is possible to use a filter hook on the endpoint and parse any data you want from the remote URL response.

What type of metadata will be extracted?

The ultimate goal would be for the default data set to include:

<title> contents
site icon - eg: favicon...etc
meta description

The PR is using xpath to find the document's <title></title> element. This particular element could be fetched using regex instead, which would be less code, faster code (parsing the HTML takes time especially for larger web pages), and without the server setup and encoding issues with DOMDocument.

Ironically using regex is what the endpoint currently does to get the <title>:

gutenberg/lib/class-wp-rest-url-details-controller.php

Lines 210 to 216 in 164bef8

    
           private function get_title( $html ) { 
        
           	preg_match( '|<title>([^<]*?)</title>|is', $html, $match_title ); 
        
           	$title = isset( $match_title[1] ) ? trim( $match_title[1] ) : ''; 
        
           	return $title; 
        
           }

I wrote it and used regex and folks suggested using a more reliable mechanism if I wanted to extract more complex data which is why I raised this follow-up.

Ultimately would you advise that we ditch DOMDocument and just use regex for the parsing?

cc @swissspidy who has been using DOMDocument and may have suggested that approach.

getdave · 2021-05-04T11:40:38Z

Ok we're going to take this in a new direction and use regex as the simplest way to parse out the markup we need. As the endpoint is extensible, folks can still choose to use more advanced utilities for their own purposes but we shouldn't ship this as part of core.

Let's also only grab the <head> portion of the DOM to avoiding having to parse a potentially massive string of HTML in regex.

swissspidy · 2021-05-12T14:51:46Z

No strong opinion here as long as there is a reasonable (& documented) way for plugins to replace current parsing with DOMDocument if they want to.

getdave · 2021-05-12T15:10:51Z

Just putting this here so I don't forget it:

^(?=.*href="|\'(.*\.ico.*?)"|\')(?=.*rel="|\'(shortcut|icon)"|\').*$

getdave · 2021-05-12T15:34:03Z

Closing in favour of #31763

getdave requested review from swissspidy, TimothyBJacobs and obenland February 5, 2021 16:36

getdave self-assigned this Feb 5, 2021

getdave added the Core REST API Task Task for Core REST API efforts label Feb 5, 2021

TimothyBJacobs added REST API Interaction Related to REST API and removed Core REST API Task Task for Core REST API efforts labels Feb 5, 2021

This comment has been minimized.

Sign in to view

getdave marked this pull request as ready for review February 12, 2021 22:06

getdave requested a review from spacedmonkey as a code owner February 12, 2021 22:06

getdave commented Feb 13, 2021

View reviewed changes

swissspidy reviewed Feb 16, 2021

View reviewed changes

lib/class-wp-rest-url-details-controller.php Show resolved Hide resolved

swissspidy reviewed Feb 16, 2021

View reviewed changes

lib/class-wp-rest-url-details-controller.php Outdated Show resolved Hide resolved

Base automatically changed from master to trunk March 1, 2021 15:45

getdave mentioned this pull request Apr 22, 2021

Add URL Details endpoint to REST API to allow retrieval of info about a remote URL #18042

Merged

7 tasks

getdave mentioned this pull request Apr 22, 2021

Add experimental util to allow fetch remote url data from REST API #31085

Merged

7 tasks

getdave requested a review from swissspidy April 22, 2021 14:38

getdave added 7 commits April 22, 2021 15:45

Use DOMDocument to parse title from remote website response

3835066

Update tests to expect non-encoded data

01d1726

Extract method for building DOMXpath object from remote url response …

2e3ea49

…body

Handle UTF-8 safely

77df6d3

Strip HTML from title tag.

a30fb58

Check and handle and test for parsing errors

ef40bcc

Ellaborate on how we’re testing for responses that cause errors in DO…

67958ab

…MDocument::loadHTML

getdave added 2 commits April 22, 2021 15:45

Clear error buffer

e16ada9

Addresses #28791 (comment)

Separate DOMDoc and XPath into distinct function calls

3a1334d

getdave force-pushed the try/improve-parsing-of-remote-url-details branch from 2cbf914 to 3a1334d Compare April 22, 2021 14:46

getdave mentioned this pull request May 12, 2021

Improve parsing and retrieve additional data in REST url-details endpoint #31763

Merged

5 tasks

getdave closed this May 12, 2021

Mamaduka mentioned this pull request Jun 16, 2022

Fixed issue of background min height #41693

Merged

johnbillion deleted the try/improve-parsing-of-remote-url-details branch February 10, 2025 16:40

		@@ -137,6 +143,39 @@ public function parse_url_details( $request ) {
		return apply_filters( 'rest_prepare_url_details', $response, $url, $request, $remote_url_response );

Improve the parsing of remote url details in the URL Details endpoint to REST API #28791

Improve the parsing of remote url details in the URL Details endpoint to REST API #28791

Uh oh!

Conversation

getdave commented Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Testing Instructions

Manual Testing Instructions

Option 1

Option 2

Types of changes

Checklist:

Uh oh!

github-actions bot commented Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

getdave commented Feb 5, 2021

Uh oh!

This comment has been minimized.

getdave commented Feb 12, 2021

Uh oh!

getdave Feb 13, 2021

Choose a reason for hiding this comment

Uh oh!

TimothyBJacobs commented Feb 16, 2021

Uh oh!

Uh oh!

Uh oh!

getdave commented Apr 22, 2021

Uh oh!

hellofromtonya commented Apr 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

getdave commented Apr 23, 2021

Uh oh!

hellofromtonya commented Apr 23, 2021

Uh oh!

getdave commented Apr 26, 2021

Uh oh!

getdave commented May 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swissspidy commented May 12, 2021

Uh oh!

getdave commented May 12, 2021

Uh oh!

getdave commented May 12, 2021

Uh oh!

Uh oh!

getdave commented Feb 5, 2021 •

edited

Loading

github-actions bot commented Feb 5, 2021 •

edited

Loading

hellofromtonya commented Apr 23, 2021 •

edited

Loading

getdave commented May 4, 2021 •

edited

Loading