-
Notifications
You must be signed in to change notification settings - Fork 578
Move Store API to work with identifiers, not graphs #1646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
update to pytest, add failure annotations, change `urn:` to `urn:x-rdflib:`
The default assignment of a BNode identifier to Graph is causing some blocking issues: >>> from rdflib import Graph, URIRef
>>> tarek = URIRef("urn:x-rdflib:tarek")
>>> likes = URIRef("urn:x-rdflib:likes")
>>> pizza = URIRef("urn:x-rdflib:pizza")
>>> g = Graph(store="SPARQLUpdateStore")
>>> g.open(
... configuration=(
... "http://localhost:3030/db/query",
... "http://localhost:3030/db/update",
... )
... )
>>> g.add((tarek, likes, pizza))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../rdflib/graph.py", line 431, in add
self.__store.add((s, p, o), self.identifier, quoted=False)
File ".../rdflib/plugins/stores/sparqlstore.py", line 628, in add
q = "INSERT DATA { GRAPH %s { %s } }" % (nts(context), triple)
File ".../rdflib/plugins/stores/sparqlstore.py", line 33, in _node_to_sparql
raise Exception(
Exception: SPARQLStore does not support BNodes! See http://www.w3.org/TR/sparql11-query/#BGPsparqlBNodes Although it wasn't included in the original PR, this change: diff --git a/rdflib/graph.py b/rdflib/graph.py
index cab825b1..69e31c32 100644
--- a/rdflib/graph.py
+++ b/rdflib/graph.py
@@ -1940,7 +1940,7 @@ class Dataset(ConjunctiveGraph):
"""
def __init__(self, store="default", default_union=False, default_graph_base=None):
- super(Dataset, self).__init__(store=store, identifier=None)
+ super(Dataset, self).__init__(store=store, identifier=DATASET_DEFAULT_GRAPH_ID)
if not self.store.graph_aware:
raise Exception("DataSet must be backed by a graph-aware store!") allows the the otherwise failing tests in More generally, niklasl 's observations in #436 remain true “When ConjunctiveGraph.parse is called, it wraps its underlying store in a regular Graph instance. This causes problems for parsers of datasets, e.g. NQuads, TriG and JSON-LD. Specifically, the triples in the default graph of a dataset haphazardly end up in bnode-named contexts.” so while programmatic operations on Datasets are coherent, after this PR is applied, parsing is not. |
…RAPH_ID instead of None. Revert test failure annotations in `test_dataset_generators` and `test_named_graphs`
Re: ea1c2eb, I'd be happier if I had a better model of why this works but being able to resolve the jsonld test_named_graphs failure seems worth it. There are other reasons, too.
+++ test_dataset_graph_operators.py 2022-01-06 12:07:32.119924381 +0000
@@ -37,8 +37,7 @@
with pytest.raises(ValueError): # too many values to unpack (expected 3)
assert len(ds - g) == 1 # removes pizza
- with pytest.raises(AssertionError): # 0 != 1
- assert len(ds * g) == 1 # only pizza
+ assert len(ds * g) == 1 # only pizza
with pytest.raises(ValueError): # too many values to unpack (expected 3)
assert len(ds ^ g) == 2 # removes pizza, adds cheese
@@ -64,10 +63,7 @@
): # Trying to add to a context that isn't an identifier: None
assert len(ds - cg) == 1 # removes pizza
- with pytest.raises(
- Exception
- ): # Trying to add to a context that isn't an identifier: None
- assert len(ds * cg) == 1 # only pizza
+ assert len(ds * cg) == 1 # only pizza
def test_operators_with_dataset_and_namedgraph():
@@ -90,8 +86,7 @@
): # Trying to add to a context that isn't an identifier: None
assert len(ds - ng) == 1 # removes pizza
- with pytest.raises(AssertionError): # 0 != 1
- assert len(ds * ng) == 1 # only pizza
+ assert len(ds * ng) == 1 # only pizza
def test_reversed_operators_with_dataset_and_graph():
@@ -107,8 +102,7 @@
with pytest.raises(ValueError): # too many values to unpack (expected 3)
assert len(g + ds) == 3 # adds cheese as liking
- with pytest.raises(AssertionError): # 2 != 1
- assert len(g - ds) == 1 # removes pizza
+ assert len(g - ds) == 1 # removes pizza
with pytest.raises(ValueError): # too many values to unpack (expected 3)
assert len(g * ds) == 1 # only pizza
“To clarify, if I were to run import rdflib
from rdflib.namespace import FOAF
foaf = rdflib.Graph(identifier = FOAF)
foaf.parse("http://xmlns.com/foaf/spec/index.rdf", format="xml")
ds = rdflib.Dataset()
ds.add_graph(foaf) am I correct in assuming that expected behavior is that len(foaf) should equal len(ds) (when in fact len(ds) is 0)?” ... the reality is that a Dataset is a collection of Graphs, one of which is a default --- in this light, perhaps |
I just want to check we don't break any current multi-graph functionality. I sometimes want to get a single graph out of a multi-graph and a function like this will currently do it: def get_graph(
self,
identifier: Union[URIRef, BNode]
) -> Union[Graph, None]:
"""Returns the graph identified by given identifier"""
return [x for x in self.contexts() if x.identifier == identifier][0] How would a this need to be implemented after this change, because the whole graph object couldn't be returned, could it? Perhaps I would have to copy content out of the multi-graph, triple-by-triple like this: def get_graph(
self,
identifier: Union[URIRef, BNode]
) -> Union[Graph, None]:
"""Returns the graph identified by given identifier"""
g2 = Graph()
for s, p, o in self.triples((None, None, None, identifier)):
g2.add((s, p, o))
return g2 |
No, you're good, it returns the Graph (this, from the extended tests I'm about to commit): def test_dataset_add_graph():
data = """
<http://data.yyx.me/jack> <http://onto.yyx.me/work_for> <http://data.yyx.me/company_a> .
<http://data.yyx.me/david> <http://onto.yyx.me/work_for> <http://data.yyx.me/company_b> .
"""
ds = Dataset()
subgraph_identifier = URIRef("urn:x-rdflib:subgraph1")
g = ds.graph(subgraph_identifier) # intended use pattern [1]
g.parse(data=data, format="n3")
assert len(g) == 2
subgraph = ds.get_context(subgraph_identifier) # Your use case
assert type(subgraph) is Graph
assert len(subgraph) == 2 [1] see Dataset doc |
Just a note, en passant. Because all RDFLib Graphs are named graphs, the Dataset implementation is slightly adrift from the RDF1.1 spec ...
|
Yeah, after this PR, my and Ashley's plan is to dedup |
Gory details of the discussion are collated in this gist, enjoy 😁 |
Documentation note. Oracle how-to: “RDF Datasets, Named and Default Graphs Concepts and How to Query Them. |
A bit brutish maybe but effective in that the issues are fixed without causing any additional tests to fail.
Perhaps this PR should be moved to Draft if there are still a series of expected commits @gjhiggins? Just so that I and others don't approve too early or something! |
I meant to set it to Draft status but forgot and there doesn't seem to be any way to change it. The only remaining task (apart from documentation) is to implement the missing Dataset operators/in-place operators and corresponding tests. Fixes quite a few issues 😄 https://github.com/gjhiggins/rdflib/tree/identifier-as-context/test/test_identifier_as_context/test_issues_fixed |
OK, I can live! I'm watching this closely so won't be a problem really. Ready to test myself and update documentation and examples like examples/datasets.py on completion. |
Most of the work is now complete so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looked at store.py and graph.py.
(Glad to see this finally being ironed out :))
yield s, p, o, None | ||
else: | ||
elif len(quad) == 3: | ||
yield s, p, o, c.identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think quads
should return a Graph for the context in order to be consistent with other implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand your point. All three implementations of quads
are iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm saying that ConjunctiveGraph.quads
returns a Graph
for the fourth member of each quad, but Dataset.quads
returns an Identifier
. Admittedly, that isn't really the main point of this PR, so that may be a thing to consider later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you did explicitly write "return a Graph for the context", sorry for being obtuse. You're correct that the current Dataset quads implementation returns an identifier but that behaviour is documented which is sort of the de facto definition of the public API.
I guess if this is to be changed, it needs to have a period of deprecation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy has just reminded me that if quads are going to have a Graph as a context, then the type signatures of add
and remove
need to reflect that - hence my original typing of Optional[Union["Graph", "Identifier"]]
else None | ||
) | ||
ctxId = context.identifier if context is not None else None | ||
ctxId = context if context is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equivalent to ctxId = context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to make this change to the original PR if it's generally agreed, otherwise - seeing as it was a rebase of a rebase, I left the original untouched.
@@ -968,6 +967,12 @@ def qname(self, uri): | |||
def compute_qname(self, uri, generate=True): | |||
return self.namespace_manager.compute_qname(uri, generate) | |||
|
|||
def qname_strict(self, uri): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add more wrappers to namespace_manager? I don't see any additional usages in this PR.
In my experience, Graph
is a poor place access a namespace manager from for anything except serializers: Any given namespace will span multiple graphs and stores, so I've found it's better to manage the namespace manager mostly separately vs trying to add and remove from this graph or that to get the prefixes I want..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely the move here is just to reach consistency of API? If you, @mwatts15, want a change in usage patterns, perhaps after this PR you would have to address the existing qname()
& ``compute_qname()as well as
qname_strict`?
So are you saying you use a stand-alone Namespace Manager object do do all your namespaceing, perhaps like:
from rdflib.namespace import NamespaceManager
nm = NamespaceManager()
g = Graph()
g.namespace_manager(nm)
other_gr = Graph()
other_g.namespace_manager(nm)
# etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add more wrappers to namespace_manager? I don't see any additional usages in this PR.
The other part of this commit details the issue resolved, when serialized as xml an Exception is raised by the XML serialiser: # 'Dataset' object has no attribute 'compute_qname_strict'
. I'm not minded to prohibit serialisation of a Dataset as triples in XML (albeit, only the triples in the default graph get output) and, given that nicholascar and ashleysommer are considering a ConjunctiveGraph/Dataset dedupe, I thought this approach was the least impactful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gjhiggins, from where is 'Dataset' object has no attribute 'compute_qname_strict'
reported? I would think the solution is just to pass dataset.namespace_manager
instead of dataset
in such a case. (I vaguely recall seeing a Graph
passed off as a NamespaceManager
in some RDFLib code, but I think that's really just a misuse of the API)
I don't mean to prohibit serialization of anything?? I'm just saying there's already a public API through the Graph
's namespace_manager
attribute: why duplicate it in Graph
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely the move here is just to reach consistency of API? If you, @mwatts15, want a change in usage patterns, perhaps after this PR you would have to address the existing
qname()
&compute_qname()
as well asqname_strict
?
My argument is for less redundancy in the API, so I'd say these methods should be deprecated. OTOH, I can see the argument for leaving them as part of the "legacy" API. Either way, I wouldn't encourage and deepen reliance on this redundancy by adding more methods that just pass through to namespace_manager
.
As far as I can tell, the "Graph as namespace manager" thing traces back to a decision @eikeon made back in 2005 (6fc8ce9), probably to maintain backwards compatibility with earlier versions of Graph
. I'd say remaining compatible with something before version 2.3.1 hasn't been possible for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gjhiggins, from where is
'Dataset' object has no attribute 'compute_qname_strict'
reported? I would think the solution is just to passdataset.namespace_manager
instead ofdataset
in such a case.
In this test of serializations of the Dataset default graph.
dataset_default_graph = ds.get_context(DATASET_DEFAULT_GRAPH_ID)
dataset_default_graph.serialize(format="xml") # <-- results in the Exception
Apologies for mistakenly stating earlier that it was a Dataset issue, not a Dataset default graph issue.
@@ -1608,13 +1613,12 @@ def _spoc( | |||
either triples or quads | |||
""" | |||
if triple_or_quad is None: | |||
return (None, None, None, self.default_context if default else None) | |||
return (None, None, None, self.identifier if default else None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong, isn't it? You want _spoc
to return a Graph
for the 4th member.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong, isn't it? You want
_spoc
to return aGraph
for the 4th member.
This specific change is inherited verbatim from the rebased PR see LoC#1313. Changing it back temporarily doesn't cause any test failures which suggests that if if is wrong, we need a test for it. I added a warning and found the only code that triggers the warning is test_conjunctive_graph.py::test_quad_contexts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fourth member of the return value of _spoc
is a Graph
per the annotation, and it's treated as a Graph
within this module except in a few cases where I've noted we check whether it is but could more intentionally compare against None
(and maybe throw a TypeError
for the case that we don't receive a Graph
since that is expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. The collective model of quads seems to have changed somewhat from the original, as documented in the “Working with multi-graphs” section of the "Introduction to parsing", where the type of the fourth member is explicitly described as an identifier, not a Graph object,
“To read and query multi-graphs, that is RDF data that is context-aware, you need to use rdflib's :class:rdflib.ConjunctiveGraph
or :class:rdflib.Dataset
class. These are extensions to :class:rdflib.Graph
that know all about quads (triples + graph IDs).” (my emphasis)
The example is equally explicit:
You could parse the file and query it like this:
from rdflib import Dataset
from rdflib.namespace import RDF
g = Dataset()
g.parse("demo.trig")
for s, p, o, g in g.quads((None, RDF.type, None, None)):
print(s, g)
This will print out:
http://example.com/person/drewp http://example.com/person/graph-1
http://example.com/person/nick http://example.com/person/graph-2
@@ -1686,14 +1690,14 @@ def triples(self, triple_or_quad, context=None): | |||
""" | |||
|
|||
s, p, o, c = self._spoc(triple_or_quad) | |||
context = self._graph(context or c) | |||
context = context or c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should explicitly check for context is None
since an empty Graph
is treated as a False
value since __len__
is defined, but __bool__
isn't defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, an explicit check would be required, except that issue appears to be addressed in the code that immediately follows the changed LoC:
if self.default_union:
if context == self.identifier:
context = None
else:
if context is None:
context = self.identifier
if len(triple_or_quad) == 4: | ||
super(Dataset, self).add(triple_or_quad) | ||
else: | ||
super(Dataset, self).add(triple_or_quad + (DATASET_DEFAULT_GRAPH_ID,)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this rather pass in the self.default_context
?
|
||
if not self.store.graph_aware: | ||
raise Exception("DataSet must be backed by a graph-aware store!") | ||
self.default_context = Graph( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why delete this??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in the original PR from gromgull - LoC #1605
it returns the root context. | ||
""" | ||
|
||
quadformats = ["trig", "trix", "nquads", "hext", "json-ld"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is meant to solve the problem of dataset serialization formats that send the default graph to the graph matching the "publicID" (whatever a "public ID" is), but keeping a list of formats RDFLib happens to know have a concept of "quads" or "graphs" seems a very poor way to handle that. More importantly, I think handling that problem can be done separately from just changing the store API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, what does publicID
actually mean? Is it just alternate (archaic?) terminology for the graph identifier, i which case we should probably just purge it, or does it have some other meaning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, what does
publicID
actually mean? Is it just alternate (archaic?) terminology for the graph identifier, i which case we should probably just purge it, or does it have some other meaning?
From the docstring: “The publicID argument is for specifying the logical URI for the case that it's different from the physical source URI”.
It's necessary for correct serialization, e.g. reading the file examples/foaf.rdf
and parsing it as data
without a publicID results in URLs being rewritten as the file source, e.g.
rdflib.term.URIRef('file:///home/gjh/dev/rdflib/images/timbl-image-by-Coz-cropped.jpg')
whereas using the publicID of http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf
results in the correct serialisation:
rdflib.term.URIRef('http://dig.csail.mit.edu/2008/webdav/timbl/images/timbl-image-by-Coz-cropped.jpg')
Just using the CSAIL URL as identifier doesn't result in correct serialization so it does appear that publicID
has a valid and continued role.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, ok, I think I knew that or should have worked that out! So publicID
stays as is.
@@ -2169,7 +2257,7 @@ class ReadOnlyGraphAggregate(ConjunctiveGraph): | |||
"""Utility class for treating a set of graphs as a single graph | |||
|
|||
Only read operations are supported (hence the name). Essentially a | |||
ConjunctiveGraph over an explicit subset of the entire store. | |||
ConjunctiveGraph over an explicit subset of an entire single store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't "single" redundant here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so...
rdflib/store.py
Outdated
def remove( | ||
self, | ||
triple: Tuple["Node", "Node", "Node"], | ||
context: Optional[Union["Graph", "Identifier"]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is Graph in this annotation? Isn't the point of this PR to make it always an Identifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good point...
From: FWIW Dept.Digging back into the commit history reveals that ConjunctiveGraph was introduced back in Oct 2005 in 750e8b8. Whilst this is contemporaneous with discussions of reification on the www-rdf-interest mailing list (see below), this significantly predates the official WC3 description of Dataset - which wasn't in the RDF1.1 draft of Aug 2011, only appearing in rough form in June 2012. ConjunctiveGraph was eikeon's implementation of context-as-reification, a concept later refined and formalized by W3C as RDF Datasets. Hence the apparent duplication arises from the common requirements of representing reified triples. Ivan Herman, the author of the RDFLib Dataset implementation was an active member of W3C at the time (and who, tbf, isn't exactly complimentary about his implementation) The first actual release of RDFLib - 2.0.6 was a couple of months prior to the publication of the Bizer, Carroll, Hayes and Stickler paper "Named Graphs, Provenance and Trust" at WWW May 2005, Chiba, Japan. I suspect the fact that all RDFLib Graphs are named graphs is an early implementation of the principles defined therein. In Dec 2005, Chimezie committed a change to the Graph docstring referencing a 2004 paper on Named Graphs and TriX. The notion had been discussed for a couple of years previously, e.g. in this Feb 2004 paper after its airing in Dec 2003 by Chris Bizer on the www-rdf-interest mailing list - not that it was the first time reification had been discussed, as Jeremy Carroll observes:
Given that Datasets are at least described in the W3C specs (albeit they declined to present a formal specification of the semantics) then following a clean, robust implementation of Datasets, the use of ConjunctiveGraphs could be deprecated but I imagine that would have profound implications for legacy apps/code. |
Ah, found the "convert PR to draft" link. |
That's what a 7.0.0 would allow for! The |
Update. I went the whole hog and implemented the de-dupe. |
Great tidying @gjhiggins! Standing by to review/help/release. |
This work unavoidably breaks the API (in particular, |
@gjhiggins are you saying you are expecting to re-propose this PR for a 7.0.0 release? |
Yes. Sorry if I wasn't clear. I've been busy working up a more complete implementation which includes changes to the stores, parsers, serializers and the SPARQL implementation in order to accommodate the changes in the public API. All bar a couple of the extant suite of tests pass and apart from a small number of (some admittedly non-trivial) issues (e.g. Dataset operators/isomorphism [1] and the round-tripping of QuotedGraphs [2]) it's looking pretty good overall. Now that a large chunk of the test files reorg has been done, I'm in a position to create a branch to carry the work which I am in the process of effecting right now and expect to commit later today. [1] “RESOLVED: close issue-17 -- there is no general purpose way to merge datasets; it can only be done with external knowledge.” [2] In the context of test_graph_formula.py def test_n3_roundtrip():
ds = Dataset()
ds.parse(data=testN3, format="n3")
assert len(list(ds.contexts())) == 2
assert sorted(list(ds.contexts())) == [
rdflib.term.BNode('_:Formula2'),
rdflib.term.BNode('_:Formula3'),
]
assert sorted([c.n3() for c in list(ds.contexts())]) == ['_:_:Formula2', '_:_:Formula3'] |
This is a rebase of PR #958, PR #409 rebased against master to fix #167
Proposed Changes
As per original PR, all instances of Graph objects as contexts changed to Graph identifiers.