Add ability to detect and mark ill-typed literals #1773

ashleysommer · 2022-03-24T02:05:25Z

Add ability to mark ill-typed literals where the lexical value does not match the datatype given if known

Parameterized test is included to test feature

Related to, but doesn't quite relate to #737

…ot match the datatype given if known

ghost · 2022-03-24T15:17:58Z

Not wishing to add to workload but ... would this be a good opportunity to also address #1326 “Literal should always have a datatype”?

RDFLib‘s current stance on Literal datatypes (according to the error message text) dates from the long-superseded W3C Recommendation 10 February 2004

if lang is not None and datatype is not None:
    raise TypeError(
        "A Literal can only have one of lang or datatype, "
        "per http://www.w3.org/TR/rdf-concepts/#section-Graph-Literal"
    )

In #1326, @white-gecko summarises the entailments of the current spec as:

The datatype of a Literal should always be set
If a language tag is specified, the datatype must be http://www.w3.org/1999/02/22-rdf-syntax-ns#langString

AIUI, both of the above could be flagged via the “ill-formed” property.

Ach, I just realised the implications of kouralex’ comment regarding serializer behaviour, conforming to the RDF 1.1 spec is going to require additional work, perhaps this is better left as it is.

rdflib/term.py

aucampia

Patch looks good, just the concern that technically Literal.ill_formed == False implies it is either not lexically well-typed/formed, or that we just don't have a checker registered for this.

Also may be better to call it ill_typed rather than ill_formed, though I'm not entirely sure about this.

Co-authored-by: Iwan Aucamp <aucampia@gmail.com>

ashleysommer · 2022-03-25T05:29:36Z

When I started this patch, my assumption was that if we cannot prove the Literal is ill-formed, then it must be well-formed. Thus ill_formed is False until proven to the true by the checks. But thinking about it since then, you're right, maybe it should be a three valued property, with None being the default. If there are built-in tests, then it can be marked as true/false. otherwise it is None, which is unknown.

Regarding naming.. I'm using "ill-formed" because the term "well-formed" is used in the PySHACL spec document, in the context of describing a "well-formed literal". A Literal that is not "well-formed" is one where the the lexical does not match the datatype. Therefore, I concluded a Literal that is not "well-formed" could be called "ill-formed". But maybe "ill-typed" is a better description.

EDIT: I just saw that indeed the language in the W3C SHACL test suite refers to these literals as "ill-formed".

aucampia · 2022-03-26T10:27:47Z

Regarding naming.. I'm using "ill-formed" because the term "well-formed" is used in the PySHACL spec document, in the context of describing a "well-formed literal". A Literal that is not "well-formed" is one where the the lexical does not match the datatype. Therefore, I concluded a Literal that is not "well-formed" could be called "ill-formed". But maybe "ill-typed" is a better description.

EDIT: I just saw that indeed the language in the W3C SHACL test suite refers to these literals as "ill-formed".

Okay, happy with name then.

When I started this patch, my assumption was that if we cannot prove the Literal is ill-formed, then it must be well-formed. Thus ill_formed is False until proven to the true by the checks. But thinking about it since then, you're right, maybe it should be a three valued property, with None being the default. If there are built-in tests, then it can be marked as true/false. otherwise it is None, which is unknown.

This is actually even a bit more hairy now that I think about it, we have the following scenarios as I interpret RDF 1.1 Concepts and Abstract Syntax / 3.3 Literals

The Literal's datatype is not a recognized datatype IRIs: In this case the concept is not well defined, as it is only defined for recognized datatype IRIs

In this case I would say the reasonable value for Literal.ill_formed is None.
The Literal's datatype is a recognized datatype IRIs: In this case the value should ideally only ever be true or false, and I think in this case it is maybe reasonable to conclude that it is NOT ill_typed unless we can show it is ill_typed, however maybe there are cases here which we are not yet sure for which Literal.ill_formed could also be None.

I made a PR to change ill_typed to be three valued: ashleysommer#7

The logic it follows is in a docstring:

property ill_formed: Optional[bool ]

For recognized datatype IRIs, this value will be True if the literal is ill formed, otherwise it will be False. Literal.value (i.e. the literal value) should always be defined if this value is True, but should not be considered reliable if this property is not True.

If the literal’s datatype is not in the set of recognized datatype IRIs this value will be None.

Happy to make any further changes to it if you want, just let me know.

aucampia · 2022-04-04T17:04:25Z

@ashleysommer let me know if you need help on this, happy to work to get this in.

aucampia · 2022-04-10T10:13:10Z

I will rebase this next week and merge ashleysommer#7 so we can merge this into main.

`ill_formed` will only be `True` or `False` if the datatype IRI is recognized. Also added some more tests.

…terals

aucampia · 2022-04-12T18:54:32Z

Merged ashleysommer#7 and merged master. I also did another review. To me it looks like a fairly good addition, one concern is that this will make parsing slower, maybe not that much, but somewhat. We should maybe consider setting up something to measure performance, not that familiar with the various solutions for it.

Just a heads up to @rchateauneu as you are very focused on performance, probably good to get some input from you here.

aucampia · 2022-04-12T22:19:26Z

@ajnelson-nist as this addresses or relates to an issue you raised (#1757) please check to see if this makes sense, at least on the interface side, as we can fix the implementation easier than the interface.

aucampia · 2022-04-14T09:11:59Z

Planning to merge this on or after 2022-04-24.

ajnelson-nist · 2022-04-14T13:35:39Z

rdflib/term.py

@@ -1485,6 +1513,94 @@ def _parseBoolean(value: str) -> bool:
    return False


+def _well_formed_by_value(lexical: Union[str, bytes], value: Any) -> bool:
+    return value is not None


This looked like an error on first glance, like it was intended to review self._value somehow. After seeing how it's called, I see this is intended to be a fallback method. Can it get a short documentation string summarizing this behavior?

Added a docstring.

test/test_literal/test_literal.py

rdflib/term.py

ajnelson-nist · 2022-04-14T14:20:04Z

Summarizing with my latecomer's perspective to this thread (I didn't get a notice about this issue until @aucampia pinged me):

I appreciate @gjhiggins 's last remark on #1757 , with external data sources being potentially significant problems. I'm personally a fan of programs that give a --strict flag, but it's unacceptable for data someone else hands me to halt my program if it includes a bad restricted-integer.

So, I agree with this implementation of a .ill_formed property, deferring enforcement decisions but enabling a surveyable property.

The next question, which we could agree to leave out of scope of this PR, is how to review the Literals for where this property is True. Should rdflib.Graph provide a convenience method/property for that?

aucampia · 2022-04-14T14:23:56Z

Should rdflib.Graph provide a convenience method/property for that?

I'm open to having a flag on Graph to make it strict, in which case we would check all literals for ill_typed is not True, or even possibly ill_typed is False, a relatively easy interim measure is possibly to just extend the store you are using to only accept literals if they have ill_typed is not True

ghost · 2022-04-14T15:20:57Z

Should rdflib.Graph provide a convenience method/property for that?

I'm open to having a flag on Graph to make it strict, in which case we would check all literals for ill_typed is not True, or even possibly ill_typed is False, a relatively easy interim measure is possibly to just extend the store you are using to only accept literals if they have ill_typed is not True

I would argue for a state-reporting method rather than a state-asserting property because (unless it's a property of the store as @aucampia suggests), it could get quite messy in the context of operators, e.g. flaggedgraph += unflaggedgraph.

…terals

aucampia · 2022-04-14T17:22:24Z

@ajnelson-nist thank you very much for the review, I addressed all the issues you raised I think. Will merge this tomorrow morning (CET) if there is no further feedback.

ajnelson-nist · 2022-04-14T17:29:53Z

Should rdflib.Graph provide a convenience method/property for that?

I'm open to having a flag on Graph to make it strict, in which case we would check all literals for ill_typed is not True, or even possibly ill_typed is False, a relatively easy interim measure is possibly to just extend the store you are using to only accept literals if they have ill_typed is not True

I would argue for a state-reporting method rather than a state-asserting property because (unless it's a property of the store as @aucampia suggests), it could get quite messy in the context of operators, e.g. flaggedgraph += unflaggedgraph.

I'm getting the feeling the matter of a first broad-consumption interface should be raised as a separate PR. I'm partial to a strategy that lets me most quickly find where the problem is. From that use case, I'd be happiest with the Graph keeping a dictionary of all problematic literals, with the dictionary's value being a set of all triples that have the literal. That implementation helps me fix data problems, but introduces a cache design pattern that has to reconcile with graph-graph operations (at least adding and subtracting), and might also have performance implications.

I don't think this discussion should hold up making the ill-typedness property available.

aucampia · 2022-04-14T17:53:25Z

I'm getting the feeling the matter of a first broad-consumption interface should be raised as a separate PR.

I don't think this discussion should hold up making the ill-typedness property available.

I agree with this, was mainly a suggestion, PR scope should be clamped to the ill_typed property. We can make a separate issue for further matters.

ashleysommer requested review from nicholascar and aucampia and removed request for nicholascar March 24, 2022 02:07

ashleysommer force-pushed the mark_ill_typed_literals branch from 4c0853c to 693f9bf Compare March 24, 2022 02:12

Add ability to mark ill-typed literals where the lexical value does n…

243995e

…ot match the datatype given if known

ashleysommer force-pushed the mark_ill_typed_literals branch from 693f9bf to 243995e Compare March 24, 2022 02:52

add more type hinting, to try to pass validator

8769af6

aucampia reviewed Mar 24, 2022

View reviewed changes

rdflib/term.py Outdated Show resolved Hide resolved

aucampia reviewed Mar 24, 2022

View reviewed changes

rdflib/term.py Outdated Show resolved Hide resolved

aucampia reviewed Mar 24, 2022

View reviewed changes

rdflib/term.py Outdated Show resolved Hide resolved

aucampia approved these changes Mar 24, 2022

View reviewed changes

Update rdflib/term.py

05bd71d

Co-authored-by: Iwan Aucamp <aucampia@gmail.com>

aucampia mentioned this pull request Mar 28, 2022

Change ill_formed to be three valued. ashleysommer/rdflib#7

Closed

aucampia added 2 commits April 12, 2022 20:35

Change ill_formed to be three valued.

7d9ee8a

`ill_formed` will only be `True` or `False` if the datatype IRI is recognized. Also added some more tests.

Merge remote-tracking branch 'upstream/master' into mark_ill_typed_li…

c05a29f

…terals

aucampia requested review from a user and white-gecko April 12, 2022 18:55

ghost approved these changes Apr 12, 2022

View reviewed changes

ajnelson-nist reviewed Apr 14, 2022

View reviewed changes

test/test_literal/test_literal.py Show resolved Hide resolved

ajnelson-nist reviewed Apr 14, 2022

View reviewed changes

test/test_literal/test_literal.py Outdated Show resolved Hide resolved

ajnelson-nist reviewed Apr 14, 2022

View reviewed changes

rdflib/term.py Outdated Show resolved Hide resolved

aucampia added 2 commits April 14, 2022 18:59

Merge remote-tracking branch 'upstream/master' into mark_ill_typed_li…

8a604c6

…terals

Fixes from @ajnelson-nist's review

c92de9f

aucampia merged commit 8b27fe2 into RDFLib:master Apr 15, 2022

ashleysommer mentioned this pull request May 25, 2022

Consider renaming Literal.ill_formed to Literal.ill_typed #1960

Closed

Add ability to detect and mark ill-typed literals #1773

Add ability to detect and mark ill-typed literals #1773

Uh oh!

Conversation

ashleysommer commented Mar 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Mar 24, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aucampia left a comment

Choose a reason for hiding this comment

Uh oh!

ashleysommer commented Mar 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aucampia commented Mar 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aucampia commented Apr 4, 2022

Uh oh!

aucampia commented Apr 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aucampia commented Apr 12, 2022

Uh oh!

aucampia commented Apr 12, 2022

Uh oh!

aucampia commented Apr 14, 2022

Uh oh!

ajnelson-nist Apr 14, 2022

Choose a reason for hiding this comment

Uh oh!

aucampia Apr 14, 2022

Choose a reason for hiding this comment

Uh oh!

ajnelson-nist Apr 14, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajnelson-nist commented Apr 14, 2022

Uh oh!

aucampia commented Apr 14, 2022

Uh oh!

ghost commented Apr 14, 2022

Uh oh!

aucampia commented Apr 14, 2022

Uh oh!

ajnelson-nist commented Apr 14, 2022

Uh oh!

aucampia commented Apr 14, 2022

Uh oh!

Uh oh!

ashleysommer commented Mar 24, 2022 •

edited

Loading

ghost commented Mar 24, 2022 •

edited by ghost

Loading

ashleysommer commented Mar 25, 2022 •

edited

Loading

aucampia commented Mar 26, 2022 •

edited

Loading

aucampia commented Apr 10, 2022 •

edited

Loading