Skip to content

Conversation

getdave
Copy link
Contributor

@getdave getdave commented Feb 5, 2021

Builds on the original implementation from #18042 to add a better mechanism for parsing the information from the remote URL's response body HTML.

This helps to address feedback such as #27762 and also sets things up nicely for parsing additional detail from the remote URL (eg: favicon, Open Graph meta...etc).

This PR focuses solely on utilising DOMDocument and friends to improve the parsing mechanic.

Automated Testing Instructions

npm run test-php should run the unit tests which reside in phpunit/class-wp-rest-url-details-controller-test.php.

Manual Testing Instructions

  • Check out this PR.
  • Run npm run wp-env start to boot the local testing env.
  • Open the Gutenberg local testing env at http://localhost:8889/wp-admin/ and login.
  • Create a new Post.
  • Choose one of the following options:

Option 1

  • Open devtools and in the console enter:
wp.apiFetch( { path: '/__experimental/url-details/?url=https://wordpress.org' } ).then( data => {
    console.log( data );
} );

...also try some non-English websites:

wp.apiFetch( { path: '/__experimental/url-details/?url=http://www.baidu.com' } ).then( data => {
    console.log( data );
} );

You'll see the response returned there.

Option 2

  • Open devtools and in the console enter wpApiSettings.nonce. You should see a valid nonce returned as a string.
  • Copy the nonce.
  • In your browser goto: http://localhost:8889/wp-json/__experimental/url-details/?url=https://wordpress.org&_wpnonce=%YOUR_NONCE_HERE% - be sure to replace the nonce value with that copied from the previous step.
  • You should see a valid REST API response containing the contents of the title tag from wordpress.org (ie: "Blog Tool, Publishing Platform, and CMS \u2014 WordPress.org").

Types of changes

New feature (non-breaking change which adds functionality)

Checklist:

  • My code is tested.
  • My code follows the WordPress code style.
  • My code follows the accessibility standards.
  • My code has proper inline documentation.
  • I've included developer documentation if appropriate.
  • I've updated all React Native files affected by any refactorings/renamings in this PR.

@getdave getdave self-assigned this Feb 5, 2021
@getdave getdave added the Core REST API Task Task for Core REST API efforts label Feb 5, 2021
@github-actions
Copy link

github-actions bot commented Feb 5, 2021

Size Change: 0 B

Total Size: 1.47 MB

ℹ️ View Unchanged
Filename Size Change
build/a11y/index.js 1.14 kB 0 B
build/annotations/index.js 3.78 kB 0 B
build/api-fetch/index.js 3.41 kB 0 B
build/autop/index.js 2.83 kB 0 B
build/blob/index.js 664 B 0 B
build/block-directory/index.js 8.62 kB 0 B
build/block-directory/style-rtl.css 1 kB 0 B
build/block-directory/style.css 1.01 kB 0 B
build/block-editor/index.js 131 kB 0 B
build/block-editor/style-rtl.css 13 kB 0 B
build/block-editor/style.css 13 kB 0 B
build/block-library/blocks/archives/editor-rtl.css 61 B 0 B
build/block-library/blocks/archives/editor.css 60 B 0 B
build/block-library/blocks/audio/editor-rtl.css 58 B 0 B
build/block-library/blocks/audio/editor.css 58 B 0 B
build/block-library/blocks/audio/style-rtl.css 112 B 0 B
build/block-library/blocks/audio/style.css 112 B 0 B
build/block-library/blocks/block/editor-rtl.css 161 B 0 B
build/block-library/blocks/block/editor.css 161 B 0 B
build/block-library/blocks/button/editor-rtl.css 475 B 0 B
build/block-library/blocks/button/editor.css 474 B 0 B
build/block-library/blocks/button/style-rtl.css 503 B 0 B
build/block-library/blocks/button/style.css 503 B 0 B
build/block-library/blocks/buttons/editor-rtl.css 315 B 0 B
build/block-library/blocks/buttons/editor.css 315 B 0 B
build/block-library/blocks/buttons/style-rtl.css 368 B 0 B
build/block-library/blocks/buttons/style.css 368 B 0 B
build/block-library/blocks/calendar/style-rtl.css 208 B 0 B
build/block-library/blocks/calendar/style.css 208 B 0 B
build/block-library/blocks/categories/editor-rtl.css 84 B 0 B
build/block-library/blocks/categories/editor.css 83 B 0 B
build/block-library/blocks/categories/style-rtl.css 79 B 0 B
build/block-library/blocks/categories/style.css 79 B 0 B
build/block-library/blocks/code/style-rtl.css 90 B 0 B
build/block-library/blocks/code/style.css 90 B 0 B
build/block-library/blocks/columns/editor-rtl.css 190 B 0 B
build/block-library/blocks/columns/editor.css 190 B 0 B
build/block-library/blocks/columns/style-rtl.css 436 B 0 B
build/block-library/blocks/columns/style.css 435 B 0 B
build/block-library/blocks/cover/editor-rtl.css 605 B 0 B
build/block-library/blocks/cover/editor.css 605 B 0 B
build/block-library/blocks/cover/style-rtl.css 1.23 kB 0 B
build/block-library/blocks/cover/style.css 1.23 kB 0 B
build/block-library/blocks/embed/editor-rtl.css 486 B 0 B
build/block-library/blocks/embed/editor.css 486 B 0 B
build/block-library/blocks/embed/style-rtl.css 401 B 0 B
build/block-library/blocks/embed/style.css 400 B 0 B
build/block-library/blocks/file/editor-rtl.css 301 B 0 B
build/block-library/blocks/file/editor.css 300 B 0 B
build/block-library/blocks/file/frontend.js 765 B 0 B
build/block-library/blocks/file/style-rtl.css 255 B 0 B
build/block-library/blocks/file/style.css 255 B 0 B
build/block-library/blocks/freeform/editor-rtl.css 2.44 kB 0 B
build/block-library/blocks/freeform/editor.css 2.44 kB 0 B
build/block-library/blocks/gallery/editor-rtl.css 704 B 0 B
build/block-library/blocks/gallery/editor.css 705 B 0 B
build/block-library/blocks/gallery/style-rtl.css 1.09 kB 0 B
build/block-library/blocks/gallery/style.css 1.09 kB 0 B
build/block-library/blocks/group/editor-rtl.css 160 B 0 B
build/block-library/blocks/group/editor.css 160 B 0 B
build/block-library/blocks/group/style-rtl.css 57 B 0 B
build/block-library/blocks/group/style.css 57 B 0 B
build/block-library/blocks/heading/editor-rtl.css 129 B 0 B
build/block-library/blocks/heading/editor.css 129 B 0 B
build/block-library/blocks/heading/style-rtl.css 76 B 0 B
build/block-library/blocks/heading/style.css 76 B 0 B
build/block-library/blocks/html/editor-rtl.css 281 B 0 B
build/block-library/blocks/html/editor.css 281 B 0 B
build/block-library/blocks/image/editor-rtl.css 717 B 0 B
build/block-library/blocks/image/editor.css 716 B 0 B
build/block-library/blocks/image/style-rtl.css 476 B 0 B
build/block-library/blocks/image/style.css 478 B 0 B
build/block-library/blocks/latest-comments/style-rtl.css 281 B 0 B
build/block-library/blocks/latest-comments/style.css 282 B 0 B
build/block-library/blocks/latest-posts/editor-rtl.css 137 B 0 B
build/block-library/blocks/latest-posts/editor.css 137 B 0 B
build/block-library/blocks/latest-posts/style-rtl.css 523 B 0 B
build/block-library/blocks/latest-posts/style.css 522 B 0 B
build/block-library/blocks/legacy-widget/editor-rtl.css 398 B 0 B
build/block-library/blocks/legacy-widget/editor.css 399 B 0 B
build/block-library/blocks/list/style-rtl.css 63 B 0 B
build/block-library/blocks/list/style.css 63 B 0 B
build/block-library/blocks/media-text/editor-rtl.css 191 B 0 B
build/block-library/blocks/media-text/editor.css 191 B 0 B
build/block-library/blocks/media-text/style-rtl.css 535 B 0 B
build/block-library/blocks/media-text/style.css 532 B 0 B
build/block-library/blocks/more/editor-rtl.css 434 B 0 B
build/block-library/blocks/more/editor.css 434 B 0 B
build/block-library/blocks/navigation-link/editor-rtl.css 597 B 0 B
build/block-library/blocks/navigation-link/editor.css 597 B 0 B
build/block-library/blocks/navigation-link/style-rtl.css 1.07 kB 0 B
build/block-library/blocks/navigation-link/style.css 1.07 kB 0 B
build/block-library/blocks/navigation/editor-rtl.css 1.24 kB 0 B
build/block-library/blocks/navigation/editor.css 1.24 kB 0 B
build/block-library/blocks/navigation/style-rtl.css 272 B 0 B
build/block-library/blocks/navigation/style.css 271 B 0 B
build/block-library/blocks/nextpage/editor-rtl.css 395 B 0 B
build/block-library/blocks/nextpage/editor.css 395 B 0 B
build/block-library/blocks/page-list/editor-rtl.css 239 B 0 B
build/block-library/blocks/page-list/editor.css 240 B 0 B
build/block-library/blocks/page-list/style-rtl.css 167 B 0 B
build/block-library/blocks/page-list/style.css 167 B 0 B
build/block-library/blocks/paragraph/editor-rtl.css 157 B 0 B
build/block-library/blocks/paragraph/editor.css 157 B 0 B
build/block-library/blocks/paragraph/style-rtl.css 247 B 0 B
build/block-library/blocks/paragraph/style.css 248 B 0 B
build/block-library/blocks/post-author/editor-rtl.css 209 B 0 B
build/block-library/blocks/post-author/editor.css 209 B 0 B
build/block-library/blocks/post-author/style-rtl.css 183 B 0 B
build/block-library/blocks/post-author/style.css 184 B 0 B
build/block-library/blocks/post-comments-form/style-rtl.css 250 B 0 B
build/block-library/blocks/post-comments-form/style.css 250 B 0 B
build/block-library/blocks/post-content/editor-rtl.css 139 B 0 B
build/block-library/blocks/post-content/editor.css 139 B 0 B
build/block-library/blocks/post-excerpt/editor-rtl.css 73 B 0 B
build/block-library/blocks/post-excerpt/editor.css 73 B 0 B
build/block-library/blocks/post-excerpt/style-rtl.css 69 B 0 B
build/block-library/blocks/post-excerpt/style.css 69 B 0 B
build/block-library/blocks/post-featured-image/editor-rtl.css 338 B 0 B
build/block-library/blocks/post-featured-image/editor.css 338 B 0 B
build/block-library/blocks/post-featured-image/style-rtl.css 100 B 0 B
build/block-library/blocks/post-featured-image/style.css 100 B 0 B
build/block-library/blocks/post-title/style-rtl.css 60 B 0 B
build/block-library/blocks/post-title/style.css 60 B 0 B
build/block-library/blocks/preformatted/style-rtl.css 103 B 0 B
build/block-library/blocks/preformatted/style.css 103 B 0 B
build/block-library/blocks/pullquote/editor-rtl.css 183 B 0 B
build/block-library/blocks/pullquote/editor.css 183 B 0 B
build/block-library/blocks/pullquote/style-rtl.css 318 B 0 B
build/block-library/blocks/pullquote/style.css 318 B 0 B
build/block-library/blocks/query-loop/editor-rtl.css 83 B 0 B
build/block-library/blocks/query-loop/editor.css 82 B 0 B
build/block-library/blocks/query-loop/style-rtl.css 315 B 0 B
build/block-library/blocks/query-loop/style.css 317 B 0 B
build/block-library/blocks/query-pagination-numbers/editor-rtl.css 122 B 0 B
build/block-library/blocks/query-pagination-numbers/editor.css 121 B 0 B
build/block-library/blocks/query-pagination/editor-rtl.css 270 B 0 B
build/block-library/blocks/query-pagination/editor.css 262 B 0 B
build/block-library/blocks/query-pagination/style-rtl.css 168 B 0 B
build/block-library/blocks/query-pagination/style.css 168 B 0 B
build/block-library/blocks/query-title/editor-rtl.css 86 B 0 B
build/block-library/blocks/query-title/editor.css 86 B 0 B
build/block-library/blocks/query/editor-rtl.css 131 B 0 B
build/block-library/blocks/query/editor.css 132 B 0 B
build/block-library/blocks/quote/style-rtl.css 169 B 0 B
build/block-library/blocks/quote/style.css 169 B 0 B
build/block-library/blocks/rss/editor-rtl.css 201 B 0 B
build/block-library/blocks/rss/editor.css 202 B 0 B
build/block-library/blocks/rss/style-rtl.css 290 B 0 B
build/block-library/blocks/rss/style.css 290 B 0 B
build/block-library/blocks/search/editor-rtl.css 189 B 0 B
build/block-library/blocks/search/editor.css 189 B 0 B
build/block-library/blocks/search/style-rtl.css 359 B 0 B
build/block-library/blocks/search/style.css 362 B 0 B
build/block-library/blocks/separator/editor-rtl.css 99 B 0 B
build/block-library/blocks/separator/editor.css 99 B 0 B
build/block-library/blocks/separator/style-rtl.css 251 B 0 B
build/block-library/blocks/separator/style.css 251 B 0 B
build/block-library/blocks/shortcode/editor-rtl.css 512 B 0 B
build/block-library/blocks/shortcode/editor.css 512 B 0 B
build/block-library/blocks/site-logo/editor-rtl.css 440 B 0 B
build/block-library/blocks/site-logo/editor.css 441 B 0 B
build/block-library/blocks/site-logo/style-rtl.css 154 B 0 B
build/block-library/blocks/site-logo/style.css 154 B 0 B
build/block-library/blocks/social-link/editor-rtl.css 164 B 0 B
build/block-library/blocks/social-link/editor.css 165 B 0 B
build/block-library/blocks/social-links/editor-rtl.css 796 B 0 B
build/block-library/blocks/social-links/editor.css 795 B 0 B
build/block-library/blocks/social-links/style-rtl.css 1.32 kB 0 B
build/block-library/blocks/social-links/style.css 1.33 kB 0 B
build/block-library/blocks/spacer/editor-rtl.css 308 B 0 B
build/block-library/blocks/spacer/editor.css 308 B 0 B
build/block-library/blocks/spacer/style-rtl.css 48 B 0 B
build/block-library/blocks/spacer/style.css 48 B 0 B
build/block-library/blocks/table/editor-rtl.css 478 B 0 B
build/block-library/blocks/table/editor.css 478 B 0 B
build/block-library/blocks/table/style-rtl.css 402 B 0 B
build/block-library/blocks/table/style.css 402 B 0 B
build/block-library/blocks/tag-cloud/editor-rtl.css 118 B 0 B
build/block-library/blocks/tag-cloud/editor.css 118 B 0 B
build/block-library/blocks/tag-cloud/style-rtl.css 94 B 0 B
build/block-library/blocks/tag-cloud/style.css 94 B 0 B
build/block-library/blocks/template-part/editor-rtl.css 552 B 0 B
build/block-library/blocks/template-part/editor.css 551 B 0 B
build/block-library/blocks/term-description/editor-rtl.css 90 B 0 B
build/block-library/blocks/term-description/editor.css 90 B 0 B
build/block-library/blocks/text-columns/editor-rtl.css 95 B 0 B
build/block-library/blocks/text-columns/editor.css 95 B 0 B
build/block-library/blocks/text-columns/style-rtl.css 166 B 0 B
build/block-library/blocks/text-columns/style.css 166 B 0 B
build/block-library/blocks/verse/style-rtl.css 87 B 0 B
build/block-library/blocks/verse/style.css 87 B 0 B
build/block-library/blocks/video/editor-rtl.css 568 B 0 B
build/block-library/blocks/video/editor.css 569 B 0 B
build/block-library/blocks/video/style-rtl.css 173 B 0 B
build/block-library/blocks/video/style.css 173 B 0 B
build/block-library/common-rtl.css 1.31 kB 0 B
build/block-library/common.css 1.31 kB 0 B
build/block-library/editor-rtl.css 9.47 kB 0 B
build/block-library/editor.css 9.46 kB 0 B
build/block-library/index.js 153 kB 0 B
build/block-library/reset-rtl.css 502 B 0 B
build/block-library/reset.css 503 B 0 B
build/block-library/style-rtl.css 9.44 kB 0 B
build/block-library/style.css 9.44 kB 0 B
build/block-library/theme-rtl.css 692 B 0 B
build/block-library/theme.css 693 B 0 B
build/block-serialization-default-parser/index.js 1.87 kB 0 B
build/block-serialization-spec-parser/index.js 3.06 kB 0 B
build/blocks/index.js 48.7 kB 0 B
build/components/index.js 285 kB 0 B
build/components/style-rtl.css 16.2 kB 0 B
build/components/style.css 16.2 kB 0 B
build/compose/index.js 11.6 kB 0 B
build/core-data/index.js 17 kB 0 B
build/customize-widgets/index.js 8.27 kB 0 B
build/customize-widgets/style-rtl.css 666 B 0 B
build/customize-widgets/style.css 667 B 0 B
build/data-controls/index.js 836 B 0 B
build/data/index.js 9.17 kB 0 B
build/date/index.js 31.9 kB 0 B
build/deprecated/index.js 787 B 0 B
build/dom-ready/index.js 576 B 0 B
build/dom/index.js 5.12 kB 0 B
build/edit-navigation/index.js 17.1 kB 0 B
build/edit-navigation/style-rtl.css 2.86 kB 0 B
build/edit-navigation/style.css 2.86 kB 0 B
build/edit-post/classic-rtl.css 454 B 0 B
build/edit-post/classic.css 454 B 0 B
build/edit-post/index.js 339 kB 0 B
build/edit-post/style-rtl.css 6.96 kB 0 B
build/edit-post/style.css 6.95 kB 0 B
build/edit-site/index.js 28.9 kB 0 B
build/edit-site/style-rtl.css 4.9 kB 0 B
build/edit-site/style.css 4.89 kB 0 B
build/edit-widgets/index.js 16.7 kB 0 B
build/edit-widgets/style-rtl.css 2.97 kB 0 B
build/edit-widgets/style.css 2.98 kB 0 B
build/editor/index.js 42.6 kB 0 B
build/editor/style-rtl.css 3.9 kB 0 B
build/editor/style.css 3.9 kB 0 B
build/element/index.js 4.62 kB 0 B
build/escape-html/index.js 735 B 0 B
build/format-library/index.js 6.77 kB 0 B
build/format-library/style-rtl.css 637 B 0 B
build/format-library/style.css 639 B 0 B
build/hooks/index.js 2.28 kB 0 B
build/html-entities/index.js 622 B 0 B
build/i18n/index.js 4.04 kB 0 B
build/is-shallow-equal/index.js 699 B 0 B
build/keyboard-shortcuts/index.js 2.53 kB 0 B
build/keycodes/index.js 1.95 kB 0 B
build/list-reusable-blocks/index.js 3.19 kB 0 B
build/list-reusable-blocks/style-rtl.css 629 B 0 B
build/list-reusable-blocks/style.css 628 B 0 B
build/media-utils/index.js 5.39 kB 0 B
build/notices/index.js 1.85 kB 0 B
build/nux/index.js 3.42 kB 0 B
build/nux/style-rtl.css 731 B 0 B
build/nux/style.css 727 B 0 B
build/plugins/index.js 2.95 kB 0 B
build/primitives/index.js 1.42 kB 0 B
build/priority-queue/index.js 791 B 0 B
build/react-i18n/index.js 1.45 kB 0 B
build/redux-routine/index.js 2.84 kB 0 B
build/reusable-blocks/index.js 3.8 kB 0 B
build/reusable-blocks/style-rtl.css 225 B 0 B
build/reusable-blocks/style.css 225 B 0 B
build/rich-text/index.js 13.5 kB 0 B
build/server-side-render/index.js 2.6 kB 0 B
build/shortcode/index.js 1.7 kB 0 B
build/token-list/index.js 1.27 kB 0 B
build/url/index.js 3.01 kB 0 B
build/viewport/index.js 1.85 kB 0 B
build/warning/index.js 1.14 kB 0 B
build/wordcount/index.js 1.22 kB 0 B

compressed-size-action

@getdave
Copy link
Contributor Author

getdave commented Feb 5, 2021

cc @beaulebens You might be interested in this as a follow up to #18042 (comment)

@TimothyBJacobs TimothyBJacobs added REST API Interaction Related to REST API and removed Core REST API Task Task for Core REST API efforts labels Feb 5, 2021
@TimothyBJacobs

This comment has been minimized.

@getdave getdave marked this pull request as ready for review February 12, 2021 22:06
@getdave
Copy link
Contributor Author

getdave commented Feb 12, 2021

I'd like some feedback on how we're suppressing errors that can be generated by DOMDocument::loadHTML. For example if you include <section> tags in the response data then it will throw an error. We are suppressing the error which doesn't seem to cause any problems and we can still parse the data.

@@ -137,6 +143,39 @@ public function parse_url_details( $request ) {
return apply_filters( 'rest_prepare_url_details', $response, $url, $request, $remote_url_response );
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self. Pass the $xpath value to the filter so folks can use it to query additional parts of response.

@TimothyBJacobs
Copy link
Member

I'm not sure I have enough experience with DOMDocument to give specific feedback here.

Base automatically changed from master to trunk March 1, 2021 15:45
@getdave
Copy link
Contributor Author

getdave commented Apr 22, 2021

Picking this up again.

@getdave getdave force-pushed the try/improve-parsing-of-remote-url-details branch from 2cbf914 to 3a1334d Compare April 22, 2021 14:46
@hellofromtonya
Copy link
Contributor

hellofromtonya commented Apr 23, 2021

DOMDocument is a powerful HTML parser but presents significant problems for WordPress sites:

  • Hosting: the following extensions are required: DOM, libxml, and iconv
    Guarding could be added to protect against missing extensions. However, an alternative parsing mechanism would be needed to parse these webpages.

  • Libxml versions before 2.8.0 have a known bug: HTML parser error with <noscript> in the <head>
    A transformation can be included to handle these instances.

  • While DOMDocument autofixes for malformed HTML, it is not accurate with closing nested divs, when an inner child div is missing its closing tag. Rather, it places the missing closing tag in the wrong place, changing the element relationships and structure.
    The problem is the most problematic. Why? A missing inner closing div causes changes to the document's structure. See the problem in action here https://3v4l.org/ijrfW

Story time:
During my time at my last company, we rolled out a solution using DOMDocument. We discovered over 30% of the webpages parsed through it were badly malformed and caused inaccurate DOM building and, worst yet, broken webpages after processing.

Not a problem if:

  • If accuracy of inner div structures is not required and the Document is not converted back to HTML (i.e. via DOMDocument::saveHTML - or other methods)
  • Only allowing well-formed webpages

However, if the point is to ensure proper HTML parsing for processing the elements, the accuracy of autofixing the HTML needs to be improved, especially for the missing inner closing div.

@getdave
Copy link
Contributor Author

getdave commented Apr 23, 2021

@hellofromtonya Thanks for the detailed explanation. This isn't something I had considered and so it's extremely helpful to have this context. Much appreciated.

However, if the point is to ensure proper HTML parsing for processing the elements, the accuracy of autofixing the HTML needs to be improved, especially for the missing inner closing div.

What I will say is that this is definitely being used for progressive enhancement of the UI. It is not critical functionality. All we are doing is attempting to parse the remote website to gain some metadata to display to the user when they enter a valid link. If the parsing fails for any reason then it is absolutely fine and the worst the user will see will be the fallback UI for the link (which is what they see currently in the block editor).

I suppose there is an argument that if this endpoint exists then users might try to use it for more detailed parsing of a remote URL, but that is not its intent. Perhaps we could document as such or limit the response payload size to avoid folks using it as a scraper? This endpoint only runs in the admin if you have suitable permissions.

Libxml versions before 2.8.0 have a known bug: HTML parser error with in the
A transformation can be included to handle these instances.

I can look into this.

Hosting: the following extensions are required: DOM, libxml, and iconv
Guarding could be added to protect against missing extensions. However, an alternative parsing mechanism would be needed to parse these webpages.

I assume if we test for these extensions and bail if they don't cut the mustard then that's ok? Again, as this is progressive enhancement we can just not return any data and the block editor will provide fallback. The user won't really notice.

With the above context would you still say this approach is a no go?

@hellofromtonya
Copy link
Contributor

Hey @getdave, thanks for providing more context for its use. Not a "no go" yet. Extracting metadata such as the <title> is doable with DOMDocument.

What type of metadata will be extracted?

The PR is using xpath to find the document's <title></title> element. This particular element could be fetched using regex instead, which would be less code, faster code (parsing the HTML takes time especially for larger web pages), and without the server setup and encoding issues with DOMDocument.

Will more metadata be extracted from the HTML?

@getdave
Copy link
Contributor Author

getdave commented Apr 26, 2021

Will more metadata be extracted from the HTML?

Yes. Moreover, it is possible to use a filter hook on the endpoint and parse any data you want from the remote URL response.

What type of metadata will be extracted?

The ultimate goal would be for the default data set to include:

  • <title> contents
  • site icon - eg: favicon...etc
  • meta description

The PR is using xpath to find the document's <title></title> element. This particular element could be fetched using regex instead, which would be less code, faster code (parsing the HTML takes time especially for larger web pages), and without the server setup and encoding issues with DOMDocument.

Ironically using regex is what the endpoint currently does to get the <title>:

private function get_title( $html ) {
preg_match( '|<title>([^<]*?)</title>|is', $html, $match_title );
$title = isset( $match_title[1] ) ? trim( $match_title[1] ) : '';
return $title;
}

I wrote it and used regex and folks suggested using a more reliable mechanism if I wanted to extract more complex data which is why I raised this follow-up.

Ultimately would you advise that we ditch DOMDocument and just use regex for the parsing?

cc @swissspidy who has been using DOMDocument and may have suggested that approach.

@getdave
Copy link
Contributor Author

getdave commented May 4, 2021

Ok we're going to take this in a new direction and use regex as the simplest way to parse out the markup we need. As the endpoint is extensible, folks can still choose to use more advanced utilities for their own purposes but we shouldn't ship this as part of core.

Let's also only grab the <head> portion of the DOM to avoiding having to parse a potentially massive string of HTML in regex.

@swissspidy
Copy link
Member

No strong opinion here as long as there is a reasonable (& documented) way for plugins to replace current parsing with DOMDocument if they want to.

@getdave
Copy link
Contributor Author

getdave commented May 12, 2021

Just putting this here so I don't forget it:

^(?=.*href="|\'(.*\.ico.*?)"|\')(?=.*rel="|\'(shortcut|icon)"|\').*$

@getdave
Copy link
Contributor Author

getdave commented May 12, 2021

Closing in favour of #31763

@getdave getdave closed this May 12, 2021
@johnbillion johnbillion deleted the try/improve-parsing-of-remote-url-details branch February 10, 2025 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
REST API Interaction Related to REST API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants