-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Improve the parsing of remote url details in the URL Details endpoint to REST API #28791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Size Change: 0 B Total Size: 1.47 MB ℹ️ View Unchanged
|
cc @beaulebens You might be interested in this as a follow up to #18042 (comment) |
This comment has been minimized.
This comment has been minimized.
I'd like some feedback on how we're suppressing errors that can be generated by |
@@ -137,6 +143,39 @@ public function parse_url_details( $request ) { | |||
return apply_filters( 'rest_prepare_url_details', $response, $url, $request, $remote_url_response ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self. Pass the $xpath
value to the filter so folks can use it to query additional parts of response.
I'm not sure I have enough experience with |
Picking this up again. |
…MDocument::loadHTML
2cbf914
to
3a1334d
Compare
Story time: Not a problem if:
However, if the point is to ensure proper HTML parsing for processing the elements, the accuracy of autofixing the HTML needs to be improved, especially for the missing inner closing div. |
@hellofromtonya Thanks for the detailed explanation. This isn't something I had considered and so it's extremely helpful to have this context. Much appreciated.
What I will say is that this is definitely being used for progressive enhancement of the UI. It is not critical functionality. All we are doing is attempting to parse the remote website to gain some metadata to display to the user when they enter a valid link. If the parsing fails for any reason then it is absolutely fine and the worst the user will see will be the fallback UI for the link (which is what they see currently in the block editor). I suppose there is an argument that if this endpoint exists then users might try to use it for more detailed parsing of a remote URL, but that is not its intent. Perhaps we could document as such or limit the response payload size to avoid folks using it as a scraper? This endpoint only runs in the admin if you have suitable permissions.
I can look into this.
I assume if we test for these extensions and bail if they don't cut the mustard then that's ok? Again, as this is progressive enhancement we can just not return any data and the block editor will provide fallback. The user won't really notice. With the above context would you still say this approach is a no go? |
Hey @getdave, thanks for providing more context for its use. Not a "no go" yet. Extracting metadata such as the What type of metadata will be extracted? The PR is using xpath to find the document's Will more metadata be extracted from the HTML? |
Yes. Moreover, it is possible to use a filter hook on the endpoint and parse any data you want from the remote URL response.
The ultimate goal would be for the default data set to include:
Ironically using regex is what the endpoint currently does to get the gutenberg/lib/class-wp-rest-url-details-controller.php Lines 210 to 216 in 164bef8
I wrote it and used regex and folks suggested using a more reliable mechanism if I wanted to extract more complex data which is why I raised this follow-up. Ultimately would you advise that we ditch cc @swissspidy who has been using |
Ok we're going to take this in a new direction and use regex as the simplest way to parse out the markup we need. As the endpoint is extensible, folks can still choose to use more advanced utilities for their own purposes but we shouldn't ship this as part of core. Let's also only grab the |
No strong opinion here as long as there is a reasonable (& documented) way for plugins to replace current parsing with DOMDocument if they want to. |
Just putting this here so I don't forget it:
|
Closing in favour of #31763 |
Builds on the original implementation from #18042 to add a better mechanism for parsing the information from the remote URL's response body HTML.
This helps to address feedback such as #27762 and also sets things up nicely for parsing additional detail from the remote URL (eg: favicon, Open Graph meta...etc).
This PR focuses solely on utilising
DOMDocument
and friends to improve the parsing mechanic.Automated Testing Instructions
npm run test-php
should run the unit tests which reside inphpunit/class-wp-rest-url-details-controller-test.php
.Manual Testing Instructions
npm run wp-env start
to boot the local testing env.http://localhost:8889/wp-admin/
and login.Option 1
console
enter:...also try some non-English websites:
You'll see the response returned there.
Option 2
console
enterwpApiSettings.nonce
. You should see a valid nonce returned as a string.http://localhost:8889/wp-json/__experimental/url-details/?url=https://wordpress.org&_wpnonce=%YOUR_NONCE_HERE%
- be sure to replace the nonce value with that copied from the previous step.wordpress.org
(ie:"Blog Tool, Publishing Platform, and CMS \u2014 WordPress.org"
).Types of changes
New feature (non-breaking change which adds functionality)
Checklist: