30 November 2006

SPARQL Basic Graph Pattern Matching

I wrote about the SPARQL algebra last time, not with the idea of a regular update about DAWG work but noting a significant issue being considered by the working. I hope that the wider community will comment and contribute; much of the ground work for the algebra is drawn from outside the working group anyway.

This week DAWG made a significant decision in a different, but nearby area, that of basic graph pattern matching (BGP).

Basic Graph Patterns

Basic graph patterns (BGPs) are the building block in SPARQL. They are a sequence of adjacent triple patterns in the query string. A BGP is matched against whatever is being queried and the results of matching feed into the algebra.

The decision is that no change, or rather the "implementation hint" for BGP matching, made more formal, becomes the basis of SPARQL. No queries will be changed by the decision.

The Issue

Having reveal the punch line, what's the joke? The issue is whether blank nodes in BGP matching behave like hidden, named variables or do they behave as a different kind of variable all together. This matters for counting - it does not matter for the logical meaning of a solution.

Example data:

  :a :p 1 .
  :a :p 2 .

Example query pattern:

 { ?x :p [] }

How many answers? Blank nodes are existential variables in RDF; named variables (the regular ?x ones) are universal variables. Queries don't return the binding of an existential; queries can return the binding of a named variable and the bound value of a named variables can be passed to other parts of the query (FILTERs, OPTIONALs etc) via the algebra.

In the absence of a DISTINCT in the query, are the solutions to this pattern:

  • 1 : ?x = :a
  • 2 : ?x = :a , ?x = :a
  • Either 1 or 2

In OWL-DL, existential variables give more expressivity: if there is a disjunction in the ontology, the reasoner can know that something exists, but not need to find an actual value - and it may not be able to find the value anyway - but the reasoner does know there is "something there".

Users and Implementers

For application writers, the main use of blank nodes in queries that I've seen is the convenience of using []

 {
   ?x :something [ :p ?v ; :q ?w ]
 } 

[] saves having to have a named variable and splitting the expression. It would be one more thing to learn if that causes different answers (duplicates) to the pattern using query variables:

 {
   ?x :something ?z .
   ?z :p ?v ; 
      :q ?w .
 } 

especially if the first pattern had been written:

 {
   ?x  :something _:bnode .
   _:bnode :p ?v ; 
           :q ?w . 
 } 

which is the [] example written out with a labelled blank node. This is also why it's BGPs that are the building block, not triple patterns - blank nodes are scoped to BGPs.

For implementers, having two kinds of variable that behave different in matching can make life a harder, especially if they using some existing technology, like an SQL database which has one preferred type (universal-like). Work on SPARQL semantics suggests that there can be one (complicated) SQL statement for one SPARQL query. Nice if it's true because all the work that has gone into SQL database optimizers is reusable. Having two types of variables would need extra SQL to be generated like nested SELECTs, or some post-processing done.

More work for implementers is not good; many open source RDF toolkits are built by people in their own time or by people with other things to do at work. So, it might be a noticeable dampener on toolkit development and hence deployment. Not a good idea to do this without a reasonable need.

DAWG Resolution

Logically, duplicates don't make any difference. DAWG decided there are two solutions for simple entailment matching example above. It's what implementations currently do.

The DAWG process is often test case driven. We use test cases as both a concrete way to define what results we expect and also to make concrete decisions. So once consensus was emerging to have duplicates, the working group decided to approve a test case that captured that. In this case, the test is rdfsemantics-bnode-type-var (the example above is the same idea, just shorter).

Defining BGP Matching for SPARQL

In Last Call 2, the first CR publication and the intermediate publication versions of SPARQL, BGP matching is a complicated definition that allows for different entailment regimes for extensibility within a single entailment framework. SPARQL is defined for simple entailment, and so for anything where there a virtual graph, but it allows for extension to more complicated entailment regimes, like OWL-DL.

The trouble is the matching definition is that it has a bug in it; it allows too many solutions when the graph is redundant. The bug has to be fixed.

SPARQL is defined for simple entailment. With this decision on how blank nodes behave, the working group can now rework the BGP matching.  The options are outlined in the message 2006OctDec/0140 and either of the variable mapping approaches can be made to work. It's more akin to a formalised version of the last call 1 style and the implementation hint of the complex version.

... and Extensibility

Extensibility comes by replacing the whole of BGP matching, with just a few natural conditions such as all named variables must be bound in a BGP match and there are no accidental clashes of blank nodes.

 

(Vested interest declaration)

OK - ARQ isn't built on SQL; it's a self contained query engine with various extension points - by default, it queries any Jena graph and exploits Jena's direct connection to the storage layer but only for basic graph patterns, not more complex patterns. It could sort out the two types of variables quite easily.

But we also have a Jena SPARQL database engine (called SDB) that is built on SQL and is specifically designed for SPARQL, for large datasets and for named graphs.

SDB fits in as an ARQ query engine and the more it can turn into a single SQL statement the better. All of the query being the ideal case (and hopefully the normal one).  So making the translation to a relational quad store simpler is helpful for SDB.

No comments: