Skip to content

Add ability to detect and mark ill-typed literals #1773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 15, 2022

Conversation

ashleysommer
Copy link
Contributor

@ashleysommer ashleysommer commented Mar 24, 2022

Add ability to mark ill-typed literals where the lexical value does not match the datatype given if known

Parameterized test is included to test feature

Fixes #1757
Fixes #848

Related to, but doesn't quite relate to #737

@ashleysommer ashleysommer requested review from nicholascar and aucampia and removed request for nicholascar March 24, 2022 02:07
@ashleysommer ashleysommer force-pushed the mark_ill_typed_literals branch from 4c0853c to 693f9bf Compare March 24, 2022 02:12
@ashleysommer ashleysommer force-pushed the mark_ill_typed_literals branch from 693f9bf to 243995e Compare March 24, 2022 02:52
@ghost
Copy link

ghost commented Mar 24, 2022

Not wishing to add to workload but ... would this be a good opportunity to also address #1326 “Literal should always have a datatype”?

RDFLib‘s current stance on Literal datatypes (according to the error message text) dates from the long-superseded W3C Recommendation 10 February 2004

if lang is not None and datatype is not None:
    raise TypeError(
        "A Literal can only have one of lang or datatype, "
        "per http://www.w3.org/TR/rdf-concepts/#section-Graph-Literal"
    )

In #1326, @white-gecko summarises the entailments of the current spec as:

  1. The datatype of a Literal should always be set
  2. If a language tag is specified, the datatype must be http://www.w3.org/1999/02/22-rdf-syntax-ns#langString

AIUI, both of the above could be flagged via the “ill-formed” property.

Ach, I just realised the implications of kouralex’ comment regarding serializer behaviour, conforming to the RDF 1.1 spec is going to require additional work, perhaps this is better left as it is.

Copy link
Member

@aucampia aucampia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch looks good, just the concern that technically Literal.ill_formed == False implies it is either not lexically well-typed/formed, or that we just don't have a checker registered for this.

Also may be better to call it ill_typed rather than ill_formed, though I'm not entirely sure about this.

Co-authored-by: Iwan Aucamp <aucampia@gmail.com>
@ashleysommer
Copy link
Contributor Author

ashleysommer commented Mar 25, 2022

When I started this patch, my assumption was that if we cannot prove the Literal is ill-formed, then it must be well-formed. Thus ill_formed is False until proven to the true by the checks. But thinking about it since then, you're right, maybe it should be a three valued property, with None being the default. If there are built-in tests, then it can be marked as true/false. otherwise it is None, which is unknown.

Regarding naming.. I'm using "ill-formed" because the term "well-formed" is used in the PySHACL spec document, in the context of describing a "well-formed literal". A Literal that is not "well-formed" is one where the the lexical does not match the datatype. Therefore, I concluded a Literal that is not "well-formed" could be called "ill-formed". But maybe "ill-typed" is a better description.

EDIT: I just saw that indeed the language in the W3C SHACL test suite refers to these literals as "ill-formed".

@aucampia
Copy link
Member

aucampia commented Mar 26, 2022

Regarding naming.. I'm using "ill-formed" because the term "well-formed" is used in the PySHACL spec document, in the context of describing a "well-formed literal". A Literal that is not "well-formed" is one where the the lexical does not match the datatype. Therefore, I concluded a Literal that is not "well-formed" could be called "ill-formed". But maybe "ill-typed" is a better description.

EDIT: I just saw that indeed the language in the W3C SHACL test suite refers to these literals as "ill-formed".

Okay, happy with name then.

When I started this patch, my assumption was that if we cannot prove the Literal is ill-formed, then it must be well-formed. Thus ill_formed is False until proven to the true by the checks. But thinking about it since then, you're right, maybe it should be a three valued property, with None being the default. If there are built-in tests, then it can be marked as true/false. otherwise it is None, which is unknown.

This is actually even a bit more hairy now that I think about it, we have the following scenarios as I interpret RDF 1.1 Concepts and Abstract Syntax / 3.3 Literals

  1. The Literal's datatype is not a recognized datatype IRIs: In this case the concept is not well defined, as it is only defined for recognized datatype IRIs

    In this case I would say the reasonable value for Literal.ill_formed is None.

  2. The Literal's datatype is a recognized datatype IRIs: In this case the value should ideally only ever be true or false, and I think in this case it is maybe reasonable to conclude that it is NOT ill_typed unless we can show it is ill_typed, however maybe there are cases here which we are not yet sure for which Literal.ill_formed could also be None.

I made a PR to change ill_typed to be three valued: ashleysommer#7

The logic it follows is in a docstring:

property ill_formed: Optional[bool]

For recognized datatype IRIs, this value will be True if the literal is ill formed, otherwise it will be False. Literal.value (i.e. the literal value) should always be defined if this value is True, but should not be considered reliable if this property is not True.

If the literal’s datatype is not in the set of recognized datatype IRIs this value will be None.

Happy to make any further changes to it if you want, just let me know.

@aucampia
Copy link
Member

aucampia commented Apr 4, 2022

@ashleysommer let me know if you need help on this, happy to work to get this in.

@aucampia
Copy link
Member

aucampia commented Apr 10, 2022

I will rebase this next week and merge ashleysommer#7 so we can merge this into main.

`ill_formed` will only be `True` or `False` if the datatype IRI is
recognized.

Also added some more tests.
@aucampia
Copy link
Member

Merged ashleysommer#7 and merged master. I also did another review. To me it looks like a fairly good addition, one concern is that this will make parsing slower, maybe not that much, but somewhat. We should maybe consider setting up something to measure performance, not that familiar with the various solutions for it.

Just a heads up to @rchateauneu as you are very focused on performance, probably good to get some input from you here.

@aucampia aucampia requested review from a user and white-gecko April 12, 2022 18:55
@aucampia
Copy link
Member

@ajnelson-nist as this addresses or relates to an issue you raised (#1757) please check to see if this makes sense, at least on the interface side, as we can fix the implementation easier than the interface.

@aucampia
Copy link
Member

Planning to merge this on or after 2022-04-24.

@@ -1485,6 +1513,94 @@ def _parseBoolean(value: str) -> bool:
return False


def _well_formed_by_value(lexical: Union[str, bytes], value: Any) -> bool:
return value is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looked like an error on first glance, like it was intended to review self._value somehow. After seeing how it's called, I see this is intended to be a fallback method. Can it get a short documentation string summarizing this behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a docstring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ajnelson-nist
Copy link
Contributor

Summarizing with my latecomer's perspective to this thread (I didn't get a notice about this issue until @aucampia pinged me):

I appreciate @gjhiggins 's last remark on #1757 , with external data sources being potentially significant problems. I'm personally a fan of programs that give a --strict flag, but it's unacceptable for data someone else hands me to halt my program if it includes a bad restricted-integer.

So, I agree with this implementation of a .ill_formed property, deferring enforcement decisions but enabling a surveyable property.

The next question, which we could agree to leave out of scope of this PR, is how to review the Literals for where this property is True. Should rdflib.Graph provide a convenience method/property for that?

@aucampia
Copy link
Member

Should rdflib.Graph provide a convenience method/property for that?

I'm open to having a flag on Graph to make it strict, in which case we would check all literals for ill_typed is not True, or even possibly ill_typed is False, a relatively easy interim measure is possibly to just extend the store you are using to only accept literals if they have ill_typed is not True

@ghost
Copy link

ghost commented Apr 14, 2022

Should rdflib.Graph provide a convenience method/property for that?

I'm open to having a flag on Graph to make it strict, in which case we would check all literals for ill_typed is not True, or even possibly ill_typed is False, a relatively easy interim measure is possibly to just extend the store you are using to only accept literals if they have ill_typed is not True

I would argue for a state-reporting method rather than a state-asserting property because (unless it's a property of the store as @aucampia suggests), it could get quite messy in the context of operators, e.g. flaggedgraph += unflaggedgraph.

@aucampia
Copy link
Member

@ajnelson-nist thank you very much for the review, I addressed all the issues you raised I think. Will merge this tomorrow morning (CET) if there is no further feedback.

@ajnelson-nist
Copy link
Contributor

Should rdflib.Graph provide a convenience method/property for that?

I'm open to having a flag on Graph to make it strict, in which case we would check all literals for ill_typed is not True, or even possibly ill_typed is False, a relatively easy interim measure is possibly to just extend the store you are using to only accept literals if they have ill_typed is not True

I would argue for a state-reporting method rather than a state-asserting property because (unless it's a property of the store as @aucampia suggests), it could get quite messy in the context of operators, e.g. flaggedgraph += unflaggedgraph.

I'm getting the feeling the matter of a first broad-consumption interface should be raised as a separate PR. I'm partial to a strategy that lets me most quickly find where the problem is. From that use case, I'd be happiest with the Graph keeping a dictionary of all problematic literals, with the dictionary's value being a set of all triples that have the literal. That implementation helps me fix data problems, but introduces a cache design pattern that has to reconcile with graph-graph operations (at least adding and subtracting), and might also have performance implications.

I don't think this discussion should hold up making the ill-typedness property available.

@aucampia
Copy link
Member

I'm getting the feeling the matter of a first broad-consumption interface should be raised as a separate PR.

I don't think this discussion should hold up making the ill-typedness property available.

I agree with this, was mainly a suggestion, PR scope should be clamped to the ill_typed property. We can make a separate issue for further matters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Polar-integer datatypes not verifying polarity of lexical value rdflib.Literals need a well-formed / ill-formed status flag
3 participants