23 November 2006

An Algebra for SPARQL

SPARQL, like Gaul, is dividied into three parts:

  • Basic graph pattern matching
  • Graph expressions
  • Solution modifiers

Basic graph patterns (BGPs) are a block of adjacent triple patterns that match, or don't match, all together. In BGP matching every named variable must have some value for the BGP to have been matched. Filters add restrictions on the values variables can take.

They are also an extension point in SPARQL because you can plug in a different BGP matcher (e.g. a DL reasoner or a wrapper for legacy SQL data), then reuse the other two layers to form complex queries out of the exact matches from BGP matching.

Graph expressions (OPTIONAL, UNION, GRAPH, Groups (things between {}) combine BGPs in various ways to give more complicated patterns. Graph expressions are recursive so you can have patterns within patterns. A SPARQL query has one graph expression in the WHERE clause.

Solution modifiers (Project, DISTINCT, ORDER BY, LIMIT/OFFSET) take the output of matching the query graph pattern and process it in various ways to yield the result set.

Current Situation

The semantics for SPARQL have, so far, been declarative and top-down. A solution to a query passes if it meets all the conditions following from the definitions of each of the SPARQL graph pattern forms and filters. That means, in theory, the whole of the solution is available.

In fact, ARQ builds solutions up by adding the minimum necessary at each subpattern to make the solution match. This is what gives ARQ streaming evaluation which keeps the total memory use down. But the effect is the same, whole solutions are available everywhere in the pattern.

Another way is to have a bottom-up algebra. The relational algebra works this way. Subexpressions are evaluated to give tables (relations) and then these tables combined together to form new tables.

It turns out that these are nearly the same as shown by Richard Cyganiak and Jorge Perez et al (see references below). The class of graph patterns that differ are a certain kind of nested OPTIONAL where a variable is used on the outside, fixed part, and in the inner OPTIONAL, but not in between.

SELECT *
{ :x1 :p ?v .
  OPTIONAL
  { :x3 :q ?w .
    OPTIONAL { :x2 :p ?v }
  }
}

Now that is a rather unusual query. The tricky bit is the use of ?v at the outermost and innermost levels but not in between.The query can be rewritten so as not to nest (note the repeat of :x3 :q ?w).

SELECT *
{ :x1 :p ?v .
  OPTIONAL
    { :x3 :q ?w }
  OPTIONAL 
    { :x3 :q ?w  . :x2 :p ?v }
}

A few references for work in this area:

A SPARQL Algebra

In DAWG, we're considering an algebra for SPARQL, based on the paper Semantics of SPARQL. This will mean a change of results in the case above. I've never seen such a query in the wild, only in artificial test cases.

The change from declarative semantics to a constructional algebra is motivated primarily by the implementers in the working group wanting to apply all the good stuff on relational algebra optimization to SPARQL engines. The complexity of the exact mapping to SQL in the semantics preserving SPARQL-to-SQL is daunting.

Not that simple!

But other things differ as well if a purely relational style is applied (and that's not being proposed currently) because "Semantics of SPARQL" does not consider the difference in treatment of filters (it's focus is on the combination of graph patterns).

Two cases arise: one where filters are in OPTIONALs, and one where filters are nested inside groups.

SELECT * 
{ ?x :p ?v .
  OPTIONAL
  { 
    ?y :q ?w .
    FILTER(?v = 2)
  }
}

The FILTER uses a variable from the fixed part of the OPTIONAL. Under Jorge's semantics the variable isn't in-scope, so the filter evaluation is an error (unbound variable), otherwise known as false; the optional part never matches. This form of query is realistic and does occur for real so the current proposal for DAWG makes this work. Like SQL and LEFT OUTER JOIN with the ON clause, the LeftJoin operation in the SPARQL algebra can take a condition which applies over the whole LeftJoin.

The fixed pattern could be repeated inside the optional part but such repetition for something application might well use is bad, very bad.

Application writers will have to take care about scope and groups: putting FILTERs inside {} changes their scope.

Currently,

{ :x :p ?v . FILTER(?v < 5) }

and

{ :x :p ?v . { FILTER(?v < 5) } }

have the same effect. But they aren't in the algebraic form because { FILTER(?v < 5) } evaluates on the empty table that does not have a ?v in it; evaluation is an error and hence false.

ARQ

There is an implementation of the algebra in ARQ in CVS (not in the current release ARQ 1.4). It's enabled in the command line tool arq.sparql with --engine=ref from the command line and

QueryEngine2.register() ;

in code.

The work-in-progress unfinished editor's working text is available. Check it is actually up to date because that's just work space and isn't a DAWG document. It does not reflect working group decisions - it more a proposal for consideration.

 

Disclaimer: Any views expressed here are mime and not the working group (which at the time of writing is thinking about it but hasn't decided anything yet).

1 comment:

Anonymous said...

Interesting that SPARQL semantics
considers the path of an algebra as
for TMQL (Topic Maps query language)
we are doing the same

The situation is for us much easier,
though, as TMQL is not entailment
neutral, has closed-world assumption,...