-
Notifications
You must be signed in to change notification settings - Fork 578
Add ability to detect and mark ill-typed literals #1773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4c0853c
to
693f9bf
Compare
…ot match the datatype given if known
693f9bf
to
243995e
Compare
Not wishing to add to workload but ... would this be a good opportunity to also address #1326 “Literal should always have a datatype”? RDFLib‘s current stance on Literal datatypes (according to the error message text) dates from the long-superseded W3C Recommendation 10 February 2004 if lang is not None and datatype is not None:
raise TypeError(
"A Literal can only have one of lang or datatype, "
"per http://www.w3.org/TR/rdf-concepts/#section-Graph-Literal"
) In #1326, @white-gecko summarises the entailments of the current spec as:
AIUI, both of the above could be flagged via the “ill-formed” property. Ach, I just realised the implications of kouralex’ comment regarding serializer behaviour, conforming to the RDF 1.1 spec is going to require additional work, perhaps this is better left as it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patch looks good, just the concern that technically Literal.ill_formed == False
implies it is either not lexically well-typed/formed, or that we just don't have a checker registered for this.
Also may be better to call it ill_typed rather than ill_formed, though I'm not entirely sure about this.
Co-authored-by: Iwan Aucamp <aucampia@gmail.com>
When I started this patch, my assumption was that if we cannot prove the Literal is ill-formed, then it must be well-formed. Thus ill_formed is False until proven to the true by the checks. But thinking about it since then, you're right, maybe it should be a three valued property, with Regarding naming.. I'm using "ill-formed" because the term "well-formed" is used in the PySHACL spec document, in the context of describing a "well-formed literal". A Literal that is not "well-formed" is one where the the lexical does not match the datatype. Therefore, I concluded a Literal that is not "well-formed" could be called "ill-formed". But maybe "ill-typed" is a better description. EDIT: I just saw that indeed the language in the W3C SHACL test suite refers to these literals as "ill-formed". |
Okay, happy with name then.
This is actually even a bit more hairy now that I think about it, we have the following scenarios as I interpret RDF 1.1 Concepts and Abstract Syntax / 3.3 Literals
I made a PR to change The logic it follows is in a docstring:
Happy to make any further changes to it if you want, just let me know. |
@ashleysommer let me know if you need help on this, happy to work to get this in. |
I will rebase this next week and merge ashleysommer#7 so we can merge this into main. |
`ill_formed` will only be `True` or `False` if the datatype IRI is recognized. Also added some more tests.
Merged ashleysommer#7 and merged master. I also did another review. To me it looks like a fairly good addition, one concern is that this will make parsing slower, maybe not that much, but somewhat. We should maybe consider setting up something to measure performance, not that familiar with the various solutions for it. Just a heads up to @rchateauneu as you are very focused on performance, probably good to get some input from you here. |
@ajnelson-nist as this addresses or relates to an issue you raised (#1757) please check to see if this makes sense, at least on the interface side, as we can fix the implementation easier than the interface. |
Planning to merge this on or after 2022-04-24. |
@@ -1485,6 +1513,94 @@ def _parseBoolean(value: str) -> bool: | |||
return False | |||
|
|||
|
|||
def _well_formed_by_value(lexical: Union[str, bytes], value: Any) -> bool: | |||
return value is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looked like an error on first glance, like it was intended to review self._value
somehow. After seeing how it's called, I see this is intended to be a fallback method. Can it get a short documentation string summarizing this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Summarizing with my latecomer's perspective to this thread (I didn't get a notice about this issue until @aucampia pinged me): I appreciate @gjhiggins 's last remark on #1757 , with external data sources being potentially significant problems. I'm personally a fan of programs that give a So, I agree with this implementation of a The next question, which we could agree to leave out of scope of this PR, is how to review the |
I'm open to having a flag on Graph to make it strict, in which case we would check all literals for |
I would argue for a state-reporting method rather than a state-asserting property because (unless it's a property of the store as @aucampia suggests), it could get quite messy in the context of operators, e.g. |
@ajnelson-nist thank you very much for the review, I addressed all the issues you raised I think. Will merge this tomorrow morning (CET) if there is no further feedback. |
I'm getting the feeling the matter of a first broad-consumption interface should be raised as a separate PR. I'm partial to a strategy that lets me most quickly find where the problem is. From that use case, I'd be happiest with the I don't think this discussion should hold up making the ill-typedness property available. |
I agree with this, was mainly a suggestion, PR scope should be clamped to the |
Add ability to mark ill-typed literals where the lexical value does not match the datatype given if known
Parameterized test is included to test feature
Fixes #1757
Fixes #848
Related to, but doesn't quite relate to #737