Skip to content

Move Store API to work with identifiers, not graphs #1646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 42 commits into from
Closed

Move Store API to work with identifiers, not graphs #1646

wants to merge 42 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Jan 6, 2022

This is a rebase of PR #958, PR #409 rebased against master to fix #167

Proposed Changes

As per original PR, all instances of Graph objects as contexts changed to Graph identifiers.

  • graph.py
  • auditable.py
  • notation3.py
  • sparqlstore.py
  • store.py
  • update.py
  • corresponding changes to tests and three annotations of failure.

Graham Higgins added 5 commits January 6, 2022 01:24
This is a rebase of PR #958, PR #409 rebased against master to fix #167
update to pytest, add failure annotations, change `urn:` to `urn:x-rdflib:`
@ghost
Copy link
Author

ghost commented Jan 6, 2022

The default assignment of a BNode identifier to Graph is causing some blocking issues:

>>> from rdflib import Graph, URIRef
>>> tarek = URIRef("urn:x-rdflib:tarek")
>>> likes = URIRef("urn:x-rdflib:likes")
>>> pizza = URIRef("urn:x-rdflib:pizza")
>>> g = Graph(store="SPARQLUpdateStore")
>>> g.open(
...     configuration=(
...         "http://localhost:3030/db/query",
...         "http://localhost:3030/db/update",
...     )
... )
>>> g.add((tarek, likes, pizza))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../rdflib/graph.py", line 431, in add
    self.__store.add((s, p, o), self.identifier, quoted=False)
  File ".../rdflib/plugins/stores/sparqlstore.py", line 628, in add
    q = "INSERT DATA { GRAPH %s { %s } }" % (nts(context), triple)
  File ".../rdflib/plugins/stores/sparqlstore.py", line 33, in _node_to_sparql
    raise Exception(
Exception: SPARQLStore does not support BNodes! See http://www.w3.org/TR/sparql11-query/#BGPsparqlBNodes

Although it wasn't included in the original PR, this change:

diff --git a/rdflib/graph.py b/rdflib/graph.py
index cab825b1..69e31c32 100644
--- a/rdflib/graph.py
+++ b/rdflib/graph.py
@@ -1940,7 +1940,7 @@ class Dataset(ConjunctiveGraph):
     """
 
     def __init__(self, store="default", default_union=False, default_graph_base=None):
-        super(Dataset, self).__init__(store=store, identifier=None)
+        super(Dataset, self).__init__(store=store, identifier=DATASET_DEFAULT_GRAPH_ID)
 
         if not self.store.graph_aware:
             raise Exception("DataSet must be backed by a graph-aware store!")

allows the the otherwise failing tests in test_dataset_generators and json-ld test_named_graphs to succeed.

More generally, niklasl 's observations in #436 remain true “When ConjunctiveGraph.parse is called, it wraps its underlying store in a regular Graph instance. This causes problems for parsers of datasets, e.g. NQuads, TriG and JSON-LD. Specifically, the triples in the default graph of a dataset haphazardly end up in bnode-named contexts.” so while programmatic operations on Datasets are coherent, after this PR is applied, parsing is not.

…RAPH_ID instead of None. Revert test failure annotations in `test_dataset_generators` and `test_named_graphs`
@ghost
Copy link
Author

ghost commented Jan 6, 2022

Re: ea1c2eb, I'd be happier if I had a better model of why this works but being able to resolve the jsonld test_named_graphs failure seems worth it.

There are other reasons, too.

  1. I followed aucampia's suggestion of creating a test_dataset_operators correlate to test_dataset_generators, which I meant to introduce before commit in order to illustrate what doesn't yet work with just the rebased PR as-is. Now that I've mucked it up, I'll include the diff as illiustration (the ValueErrors are solved in a subsequent change):
+++ test_dataset_graph_operators.py   2022-01-06 12:07:32.119924381 +0000
@@ -37,8 +37,7 @@
     with pytest.raises(ValueError):  # too many values to unpack (expected 3)
         assert len(ds - g) == 1  # removes pizza
 
-    with pytest.raises(AssertionError):  # 0 != 1
-        assert len(ds * g) == 1  # only pizza
+    assert len(ds * g) == 1  # only pizza
 
     with pytest.raises(ValueError):  # too many values to unpack (expected 3)
         assert len(ds ^ g) == 2  # removes pizza, adds cheese
@@ -64,10 +63,7 @@
     ):  # Trying to add to a context that isn't an identifier: None
         assert len(ds - cg) == 1  # removes pizza
 
-    with pytest.raises(
-        Exception
-    ):  # Trying to add to a context that isn't an identifier: None
-        assert len(ds * cg) == 1  # only pizza
+    assert len(ds * cg) == 1  # only pizza
 
 
 def test_operators_with_dataset_and_namedgraph():
@@ -90,8 +86,7 @@
     ):  # Trying to add to a context that isn't an identifier: None
         assert len(ds - ng) == 1  # removes pizza
 
-    with pytest.raises(AssertionError):  # 0 != 1
-        assert len(ds * ng) == 1  # only pizza
+    assert len(ds * ng) == 1  # only pizza
 
 
 def test_reversed_operators_with_dataset_and_graph():
@@ -107,8 +102,7 @@
     with pytest.raises(ValueError):  # too many values to unpack (expected 3)
         assert len(g + ds) == 3  # adds cheese as liking
 
-    with pytest.raises(AssertionError):  # 2 != 1
-        assert len(g - ds) == 1  # removes pizza
+    assert len(g - ds) == 1  # removes pizza
 
     with pytest.raises(ValueError):  # too many values to unpack (expected 3)
         assert len(g * ds) == 1  # only pizza
  1. Assigning DATASET_DEFAULT_GRAPH_ID as the identifier of the Dataset default graph has an impact on Dataset.__len__() which then becomes the length of the default graph. Although this can accord with users' expectations (c.f. Dataset.graph should not allow adding random graphs to the store #698 (comment)) ...

“To clarify, if I were to run

import rdflib
from rdflib.namespace import FOAF
foaf = rdflib.Graph(identifier = FOAF)
foaf.parse("http://xmlns.com/foaf/spec/index.rdf", format="xml")
ds = rdflib.Dataset()
ds.add_graph(foaf)

am I correct in assuming that expected behavior is that len(foaf) should equal len(ds) (when in fact len(ds) is 0)?”


... the reality is that a Dataset is a collection of Graphs, one of which is a default --- in this light, perhaps len(ds) is perhaps more properly 0 and should be documented as such. fwiw, I favour len(ds) to be the length of the default graph, I doubt that users with deeper models of RDF1.1 would be thrown by this.

@nicholascar
Copy link
Member

nicholascar commented Jan 6, 2022

I just want to check we don't break any current multi-graph functionality.

I sometimes want to get a single graph out of a multi-graph and a function like this will currently do it:

def get_graph(
    self,
    identifier: Union[URIRef, BNode]
) -> Union[Graph, None]:
    """Returns the graph identified by given identifier"""
    return [x for x in self.contexts() if x.identifier == identifier][0]

How would a this need to be implemented after this change, because the whole graph object couldn't be returned, could it? Perhaps I would have to copy content out of the multi-graph, triple-by-triple like this:

def get_graph(
    self,
    identifier: Union[URIRef, BNode]
) -> Union[Graph, None]:
    """Returns the graph identified by given identifier"""
    g2 = Graph()
    for s, p, o in self.triples((None, None, None, identifier)):
        g2.add((s, p, o))
    return g2

@ghost
Copy link
Author

ghost commented Jan 6, 2022

I just want to check we don't break any current multi-graph functionality.

I sometimes want to get a single graph out of a multi-graph and a function like this will currently do it:
...
How would a this need to be implemented after this change, because the whole graph object couldn't be returned, could it? Perhaps I would have to copy content out of the multi-graph, triple-by-triple like this:

No, you're good, it returns the Graph (this, from the extended tests I'm about to commit):

def test_dataset_add_graph():
    data = """
    <http://data.yyx.me/jack> <http://onto.yyx.me/work_for> <http://data.yyx.me/company_a> .
    <http://data.yyx.me/david> <http://onto.yyx.me/work_for> <http://data.yyx.me/company_b> .
    """
    ds = Dataset()
    subgraph_identifier = URIRef("urn:x-rdflib:subgraph1")
    g = ds.graph(subgraph_identifier)  # intended use pattern [1] 
    g.parse(data=data, format="n3")
    assert len(g) == 2
    subgraph = ds.get_context(subgraph_identifier)  # Your use case
    assert type(subgraph) is Graph
    assert len(subgraph) == 2

[1] see Dataset doc

@ghost
Copy link
Author

ghost commented Jan 6, 2022

Just a note, en passant. Because all RDFLib Graphs are named graphs, the Dataset implementation is slightly adrift from the RDF1.1 spec ...
“An RDF dataset is a collection of RDF graphs, and comprises:

  • Exactly one default graph, being an RDF graph. The default graph does not have a name and MAY be empty.
  • Zero or more named graphs. Each named graph is a pair consisting of an IRI or a blank node (the graph name), and an RDF graph. Graph names are unique within an RDF dataset.”

@nicholascar
Copy link
Member

the Dataset implementation is slightly adrift from the RDF1.1 spec

Yeah, after this PR, my and Ashley's plan is to dedup ConjunctiveGraph & Dataset with a more spec implementation for an rdflib 7.0.0. I can't think of any benefits for having both classes that are similar but slightly different. It's confusing. I suppose someone added Dataset long after ConjunctiveGraph and had some specific reasons but I've not found work reasons to be able to differentiate them.

@ghost
Copy link
Author

ghost commented Jan 6, 2022

the Dataset implementation is slightly adrift from the RDF1.1 spec

Yeah, after this PR, my and Ashley's plan is to dedup ConjunctiveGraph & Dataset with a more spec implementation for an rdflib 7.0.0. I can't think of any benefits for having both classes that are similar but slightly different. It's confusing. I suppose someone added Dataset long after ConjunctiveGraph and had some specific reasons but I've not found work reasons to be able to differentiate them.

Gory details of the discussion are collated in this gist, enjoy 😁

@ghost
Copy link
Author

ghost commented Jan 6, 2022

Documentation note. Oracle how-to: “RDF Datasets, Named and Default Graphs Concepts and How to Query Them.
Emma Thomas. May 5, 2020”
- purely SPARQL-oriented, a Pythonic API will need to have a different perspective.

@nicholascar
Copy link
Member

Perhaps this PR should be moved to Draft if there are still a series of expected commits @gjhiggins? Just so that I and others don't approve too early or something!

@ghost
Copy link
Author

ghost commented Jan 7, 2022

Perhaps this PR should be moved to Draft if there are still a series of expected commits @gjhiggins? Just so that I and others don't approve too early or something!

I meant to set it to Draft status but forgot and there doesn't seem to be any way to change it.

The only remaining task (apart from documentation) is to implement the missing Dataset operators/in-place operators and corresponding tests.

Fixes quite a few issues 😄 https://github.com/gjhiggins/rdflib/tree/identifier-as-context/test/test_identifier_as_context/test_issues_fixed

@nicholascar
Copy link
Member

OK, I can live!

I'm watching this closely so won't be a problem really. Ready to test myself and update documentation and examples like examples/datasets.py on completion.

@ghost
Copy link
Author

ghost commented Jan 8, 2022

OK, I can live!

I'm watching this closely so won't be a problem really. Ready to test myself and update documentation and examples like examples/datasets.py on completion.

Most of the work is now complete so examples/datasets.py executes without issue. You might be interested in the contents of test_issue436 which (along with the parsed input trig) is the proof-of-concept response to your comment.

Copy link
Contributor

@mwatts15 mwatts15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looked at store.py and graph.py.

(Glad to see this finally being ironed out :))

yield s, p, o, None
else:
elif len(quad) == 3:
yield s, p, o, c.identifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think quads should return a Graph for the context in order to be consistent with other implementations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand your point. All three implementations of quads are iterators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying that ConjunctiveGraph.quads returns a Graph for the fourth member of each quad, but Dataset.quads returns an Identifier. Admittedly, that isn't really the main point of this PR, so that may be a thing to consider later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you did explicitly write "return a Graph for the context", sorry for being obtuse. You're correct that the current Dataset quads implementation returns an identifier but that behaviour is documented which is sort of the de facto definition of the public API.

I guess if this is to be changed, it needs to have a period of deprecation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy has just reminded me that if quads are going to have a Graph as a context, then the type signatures of add and remove need to reflect that - hence my original typing of Optional[Union["Graph", "Identifier"]]

else None
)
ctxId = context.identifier if context is not None else None
ctxId = context if context is not None else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to ctxId = context

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to make this change to the original PR if it's generally agreed, otherwise - seeing as it was a rebase of a rebase, I left the original untouched.

@@ -968,6 +967,12 @@ def qname(self, uri):
def compute_qname(self, uri, generate=True):
return self.namespace_manager.compute_qname(uri, generate)

def qname_strict(self, uri):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add more wrappers to namespace_manager? I don't see any additional usages in this PR.

In my experience, Graph is a poor place access a namespace manager from for anything except serializers: Any given namespace will span multiple graphs and stores, so I've found it's better to manage the namespace manager mostly separately vs trying to add and remove from this graph or that to get the prefixes I want..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely the move here is just to reach consistency of API? If you, @mwatts15, want a change in usage patterns, perhaps after this PR you would have to address the existing qname() & ``compute_qname()as well as qname_strict`?

So are you saying you use a stand-alone Namespace Manager object do do all your namespaceing, perhaps like:

from rdflib.namespace import NamespaceManager

nm = NamespaceManager()

g = Graph()
g.namespace_manager(nm)

other_gr = Graph()
other_g.namespace_manager(nm)

# etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add more wrappers to namespace_manager? I don't see any additional usages in this PR.

The other part of this commit details the issue resolved, when serialized as xml an Exception is raised by the XML serialiser: # 'Dataset' object has no attribute 'compute_qname_strict'. I'm not minded to prohibit serialisation of a Dataset as triples in XML (albeit, only the triples in the default graph get output) and, given that nicholascar and ashleysommer are considering a ConjunctiveGraph/Dataset dedupe, I thought this approach was the least impactful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gjhiggins, from where is 'Dataset' object has no attribute 'compute_qname_strict' reported? I would think the solution is just to pass dataset.namespace_manager instead of dataset in such a case. (I vaguely recall seeing a Graph passed off as a NamespaceManager in some RDFLib code, but I think that's really just a misuse of the API)

I don't mean to prohibit serialization of anything?? I'm just saying there's already a public API through the Graph's namespace_manager attribute: why duplicate it in Graph?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely the move here is just to reach consistency of API? If you, @mwatts15, want a change in usage patterns, perhaps after this PR you would have to address the existing qname() & compute_qname()as well as qname_strict?

My argument is for less redundancy in the API, so I'd say these methods should be deprecated. OTOH, I can see the argument for leaving them as part of the "legacy" API. Either way, I wouldn't encourage and deepen reliance on this redundancy by adding more methods that just pass through to namespace_manager.

As far as I can tell, the "Graph as namespace manager" thing traces back to a decision @eikeon made back in 2005 (6fc8ce9), probably to maintain backwards compatibility with earlier versions of Graph. I'd say remaining compatible with something before version 2.3.1 hasn't been possible for a while.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gjhiggins, from where is 'Dataset' object has no attribute 'compute_qname_strict' reported? I would think the solution is just to pass dataset.namespace_manager instead of dataset in such a case.

In this test of serializations of the Dataset default graph.

dataset_default_graph = ds.get_context(DATASET_DEFAULT_GRAPH_ID)
dataset_default_graph.serialize(format="xml")  # <-- results in the Exception

Apologies for mistakenly stating earlier that it was a Dataset issue, not a Dataset default graph issue.

@@ -1608,13 +1613,12 @@ def _spoc(
either triples or quads
"""
if triple_or_quad is None:
return (None, None, None, self.default_context if default else None)
return (None, None, None, self.identifier if default else None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong, isn't it? You want _spoc to return a Graph for the 4th member.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong, isn't it? You want _spoc to return a Graph for the 4th member.

This specific change is inherited verbatim from the rebased PR see LoC#1313. Changing it back temporarily doesn't cause any test failures which suggests that if if is wrong, we need a test for it. I added a warning and found the only code that triggers the warning is test_conjunctive_graph.py::test_quad_contexts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fourth member of the return value of _spoc is a Graph per the annotation, and it's treated as a Graph within this module except in a few cases where I've noted we check whether it is but could more intentionally compare against None (and maybe throw a TypeError for the case that we don't receive a Graph since that is expected)

Copy link
Author

@ghost ghost Jan 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. The collective model of quads seems to have changed somewhat from the original, as documented in the “Working with multi-graphs” section of the "Introduction to parsing", where the type of the fourth member is explicitly described as an identifier, not a Graph object,

“To read and query multi-graphs, that is RDF data that is context-aware, you need to use rdflib's :class:rdflib.ConjunctiveGraph or :class:rdflib.Dataset class. These are extensions to :class:rdflib.Graph that know all about quads (triples + graph IDs).” (my emphasis)

The example is equally explicit:

You could parse the file and query it like this:

from rdflib import Dataset
from rdflib.namespace import RDF

g = Dataset()
g.parse("demo.trig")

for s, p, o, g in g.quads((None, RDF.type, None, None)):
    print(s, g)

This will print out:

http://example.com/person/drewp http://example.com/person/graph-1
http://example.com/person/nick http://example.com/person/graph-2

@@ -1686,14 +1690,14 @@ def triples(self, triple_or_quad, context=None):
"""

s, p, o, c = self._spoc(triple_or_quad)
context = self._graph(context or c)
context = context or c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should explicitly check for context is None since an empty Graph is treated as a False value since __len__ is defined, but __bool__ isn't defined.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, an explicit check would be required, except that issue appears to be addressed in the code that immediately follows the changed LoC:

        if self.default_union:
            if context == self.identifier:
                context = None
        else:
            if context is None:
                context = self.identifier

if len(triple_or_quad) == 4:
super(Dataset, self).add(triple_or_quad)
else:
super(Dataset, self).add(triple_or_quad + (DATASET_DEFAULT_GRAPH_ID,))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this rather pass in the self.default_context?


if not self.store.graph_aware:
raise Exception("DataSet must be backed by a graph-aware store!")
self.default_context = Graph(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why delete this??

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the original PR from gromgull - LoC #1605

it returns the root context.
"""

quadformats = ["trig", "trix", "nquads", "hext", "json-ld"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is meant to solve the problem of dataset serialization formats that send the default graph to the graph matching the "publicID" (whatever a "public ID" is), but keeping a list of formats RDFLib happens to know have a concept of "quads" or "graphs" seems a very poor way to handle that. More importantly, I think handling that problem can be done separately from just changing the store API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, what does publicID actually mean? Is it just alternate (archaic?) terminology for the graph identifier, i which case we should probably just purge it, or does it have some other meaning?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, what does publicID actually mean? Is it just alternate (archaic?) terminology for the graph identifier, i which case we should probably just purge it, or does it have some other meaning?

From the docstring: “The publicID argument is for specifying the logical URI for the case that it's different from the physical source URI”.

It's necessary for correct serialization, e.g. reading the file examples/foaf.rdf and parsing it as data without a publicID results in URLs being rewritten as the file source, e.g.

rdflib.term.URIRef('file:///home/gjh/dev/rdflib/images/timbl-image-by-Coz-cropped.jpg')

whereas using the publicID of http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf results in the correct serialisation:

rdflib.term.URIRef('http://dig.csail.mit.edu/2008/webdav/timbl/images/timbl-image-by-Coz-cropped.jpg')

Just using the CSAIL URL as identifier doesn't result in correct serialization so it does appear that publicID has a valid and continued role.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, ok, I think I knew that or should have worked that out! So publicID stays as is.

@@ -2169,7 +2257,7 @@ class ReadOnlyGraphAggregate(ConjunctiveGraph):
"""Utility class for treating a set of graphs as a single graph

Only read operations are supported (hence the name). Essentially a
ConjunctiveGraph over an explicit subset of the entire store.
ConjunctiveGraph over an explicit subset of an entire single store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't "single" redundant here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so...

rdflib/store.py Outdated
def remove(
self,
triple: Tuple["Node", "Node", "Node"],
context: Optional[Union["Graph", "Identifier"]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Graph in this annotation? Isn't the point of this PR to make it always an Identifier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good point...

@ghost
Copy link
Author

ghost commented Jan 10, 2022

Yeah, after this PR, my and Ashley's plan is to dedup ConjunctiveGraph & Dataset with a more spec implementation for an rdflib 7.0.0. I can't think of any benefits for having both classes that are similar but slightly different. It's confusing. I suppose someone added Dataset long after ConjunctiveGraph and had some specific reasons but I've not found work reasons to be able to differentiate them.

From: FWIW Dept.

Digging back into the commit history reveals that ConjunctiveGraph was introduced back in Oct 2005 in 750e8b8. Whilst this is contemporaneous with discussions of reification on the www-rdf-interest mailing list (see below), this significantly predates the official WC3 description of Dataset - which wasn't in the RDF1.1 draft of Aug 2011, only appearing in rough form in June 2012.

ConjunctiveGraph was eikeon's implementation of context-as-reification, a concept later refined and formalized by W3C as RDF Datasets. Hence the apparent duplication arises from the common requirements of representing reified triples. Ivan Herman, the author of the RDFLib Dataset implementation was an active member of W3C at the time (and who, tbf, isn't exactly complimentary about his implementation)

The first actual release of RDFLib - 2.0.6 was a couple of months prior to the publication of the Bizer, Carroll, Hayes and Stickler paper "Named Graphs, Provenance and Trust" at WWW May 2005, Chiba, Japan.

I suspect the fact that all RDFLib Graphs are named graphs is an early implementation of the principles defined therein. In Dec 2005, Chimezie committed a change to the Graph docstring referencing a 2004 paper on Named Graphs and TriX. The notion had been discussed for a couple of years previously, e.g. in this Feb 2004 paper after its airing in Dec 2003 by Chris Bizer on the www-rdf-interest mailing list - not that it was the first time reification had been discussed, as Jeremy Carroll observes:

... the interpretation of a reified statement is clearly
defined by RDF Semantics - but this differs from the applications that
reification is often used for. That was the best the WG could do and it is a
known limitation with the current round of RDF. That's why things like
contexts ended up in the postponed pile.

Given that Datasets are at least described in the W3C specs (albeit they declined to present a formal specification of the semantics) then following a clean, robust implementation of Datasets, the use of ConjunctiveGraphs could be deprecated but I imagine that would have profound implications for legacy apps/code.

@ghost ghost marked this pull request as draft January 10, 2022 18:52
@ghost
Copy link
Author

ghost commented Jan 10, 2022

I meant to set it to Draft status but forgot and there doesn't seem to be any way to change it.

Ah, found the "convert PR to draft" link.

@nicholascar
Copy link
Member

...following a clean, robust implementation of Datasets, the use of ConjunctiveGraphs could be deprecated but I imagine that would have profound implications for legacy apps/code.

That's what a 7.0.0 would allow for!

The ConjunctiveGraph is dead, all hail the Dataset!

@ghost
Copy link
Author

ghost commented Feb 5, 2022

...following a clean, robust implementation of Datasets, the use of ConjunctiveGraphs could be deprecated but I imagine that would have profound implications for legacy apps/code.

That's what a 7.0.0 would allow for!

The ConjunctiveGraph is dead, all hail the Dataset!

Update. I went the whole hog and implemented the de-dupe. Dataset now inherits from Graph and the implementation is much cleaner. There are no longer any calls to ConjunctiveGraph in the RDFLib code, its presence in graph.py is now merely for legacy support. It's not quite ready for a PR just yet, even as a Draft, but soon enough (I hope) to be in 6.2 and enable a generous deprecation period for ConjunctiveGraph.

@nicholascar
Copy link
Member

Great tidying @gjhiggins! Standing by to review/help/release.

@ghost ghost closed this Mar 22, 2022
@ghost ghost deleted the identifier-as-context branch March 22, 2022 10:58
@ghost
Copy link
Author

ghost commented Mar 22, 2022

This work unavoidably breaks the API (in particular, Dataset.get_context() which now returns a URIRef, not a Graph) and so can only be included as part of a version 0.7 release, so I have closed this PR and removed the branch from my fork.

@nicholascar
Copy link
Member

@gjhiggins are you saying you are expecting to re-propose this PR for a 7.0.0 release?

@ghost
Copy link
Author

ghost commented Mar 22, 2022

Yes. Sorry if I wasn't clear. I've been busy working up a more complete implementation which includes changes to the stores, parsers, serializers and the SPARQL implementation in order to accommodate the changes in the public API.

All bar a couple of the extant suite of tests pass and apart from a small number of (some admittedly non-trivial) issues (e.g. Dataset operators/isomorphism [1] and the round-tripping of QuotedGraphs [2]) it's looking pretty good overall.

Now that a large chunk of the test files reorg has been done, I'm in a position to create a branch to carry the work which I am in the process of effecting right now and expect to commit later today.

[1] “RESOLVED: close issue-17 -- there is no general purpose way to merge datasets; it can only be done with external knowledge.”

[2] In the context of test_graph_formula.py

def test_n3_roundtrip():
    ds = Dataset()
    ds.parse(data=testN3, format="n3")
    assert len(list(ds.contexts())) == 2
    assert sorted(list(ds.contexts())) == [
        rdflib.term.BNode('_:Formula2'),
        rdflib.term.BNode('_:Formula3'),
    ]
    assert sorted([c.n3() for c in list(ds.contexts())]) == ['_:_:Formula2', '_:_:Formula3']

@ghost ghost mentioned this pull request Jul 27, 2022
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarify confusion around type of context element in ConjunctiveGraphs and context aware stores
4 participants