23 November 2006

An Algebra for SPARQL

SPARQL, like Gaul, is dividied into three parts:

  • Basic graph pattern matching
  • Graph expressions
  • Solution modifiers

Basic graph patterns (BGPs) are a block of adjacent triple patterns that match, or don't match, all together. In BGP matching every named variable must have some value for the BGP to have been matched. Filters add restrictions on the values variables can take.

They are also an extension point in SPARQL because you can plug in a different BGP matcher (e.g. a DL reasoner or a wrapper for legacy SQL data), then reuse the other two layers to form complex queries out of the exact matches from BGP matching.

Graph expressions (OPTIONAL, UNION, GRAPH, Groups (things between {}) combine BGPs in various ways to give more complicated patterns. Graph expressions are recursive so you can have patterns within patterns. A SPARQL query has one graph expression in the WHERE clause.

Solution modifiers (Project, DISTINCT, ORDER BY, LIMIT/OFFSET) take the output of matching the query graph pattern and process it in various ways to yield the result set.

Current Situation

The semantics for SPARQL have, so far, been declarative and top-down. A solution to a query passes if it meets all the conditions following from the definitions of each of the SPARQL graph pattern forms and filters. That means, in theory, the whole of the solution is available.

In fact, ARQ builds solutions up by adding the minimum necessary at each subpattern to make the solution match. This is what gives ARQ streaming evaluation which keeps the total memory use down. But the effect is the same, whole solutions are available everywhere in the pattern.

Another way is to have a bottom-up algebra. The relational algebra works this way. Subexpressions are evaluated to give tables (relations) and then these tables combined together to form new tables.

It turns out that these are nearly the same as shown by Richard Cyganiak and Jorge Perez et al (see references below). The class of graph patterns that differ are a certain kind of nested OPTIONAL where a variable is used on the outside, fixed part, and in the inner OPTIONAL, but not in between.

SELECT *
{ :x1 :p ?v .
  OPTIONAL
  { :x3 :q ?w .
    OPTIONAL { :x2 :p ?v }
  }
}

Now that is a rather unusual query. The tricky bit is the use of ?v at the outermost and innermost levels but not in between.The query can be rewritten so as not to nest (note the repeat of :x3 :q ?w).

SELECT *
{ :x1 :p ?v .
  OPTIONAL
    { :x3 :q ?w }
  OPTIONAL 
    { :x3 :q ?w  . :x2 :p ?v }
}

A few references for work in this area:

A SPARQL Algebra

In DAWG, we're considering an algebra for SPARQL, based on the paper Semantics of SPARQL. This will mean a change of results in the case above. I've never seen such a query in the wild, only in artificial test cases.

The change from declarative semantics to a constructional algebra is motivated primarily by the implementers in the working group wanting to apply all the good stuff on relational algebra optimization to SPARQL engines. The complexity of the exact mapping to SQL in the semantics preserving SPARQL-to-SQL is daunting.

Not that simple!

But other things differ as well if a purely relational style is applied (and that's not being proposed currently) because "Semantics of SPARQL" does not consider the difference in treatment of filters (it's focus is on the combination of graph patterns).

Two cases arise: one where filters are in OPTIONALs, and one where filters are nested inside groups.

SELECT * 
{ ?x :p ?v .
  OPTIONAL
  { 
    ?y :q ?w .
    FILTER(?v = 2)
  }
}

The FILTER uses a variable from the fixed part of the OPTIONAL. Under Jorge's semantics the variable isn't in-scope, so the filter evaluation is an error (unbound variable), otherwise known as false; the optional part never matches. This form of query is realistic and does occur for real so the current proposal for DAWG makes this work. Like SQL and LEFT OUTER JOIN with the ON clause, the LeftJoin operation in the SPARQL algebra can take a condition which applies over the whole LeftJoin.

The fixed pattern could be repeated inside the optional part but such repetition for something application might well use is bad, very bad.

Application writers will have to take care about scope and groups: putting FILTERs inside {} changes their scope.

Currently,

{ :x :p ?v . FILTER(?v < 5) }

and

{ :x :p ?v . { FILTER(?v < 5) } }

have the same effect. But they aren't in the algebraic form because { FILTER(?v < 5) } evaluates on the empty table that does not have a ?v in it; evaluation is an error and hence false.

ARQ

There is an implementation of the algebra in ARQ in CVS (not in the current release ARQ 1.4). It's enabled in the command line tool arq.sparql with --engine=ref from the command line and

QueryEngine2.register() ;

in code.

The work-in-progress unfinished editor's working text is available. Check it is actually up to date because that's just work space and isn't a DAWG document. It does not reflect working group decisions - it more a proposal for consideration.

 

Disclaimer: Any views expressed here are mime and not the working group (which at the time of writing is thinking about it but hasn't decided anything yet).

12 November 2006

LARQ = Lucene + ARQ

SPARQL is normally thought of as only querying fixed RDF data. At the core of SPARQL are the building blocks of basic graph patterns, and on top of these there is an algebra to create more complex patterns (OPTIONAL UNION, FILTER, GRAPH).

The key question a basic graph pattern asks is "does this pattern match the graph". The named variables record how the pattern matches.

Not all information needs to be in the raw data. ARQ property functions are a way to let the application add some relationships to be computed at query execution time.

LARQ adds free text search. The real work is done by Lucene. LARQ adds ways to create a Lucene index from RDF data and a property function to perform free text matching in a SPARQL query.

Example: find all the string literals that match '+keyword'

PREFIX pf: <java:com.hp.hpl.jena.query.pfunction.library.>

SELECT *
  { ?lit pf:textMatch '+keyword' }

Any simple or complex Lucene query string can be used.

LARQ provides utilities to index string literals. As the literal can be stored as well, a query can find the subjects with some property value matching the free text search.

So to find all the document that have titles matching some free text search:

PREFIX pf: <java:com.hp.hpl.jena.query.pfunction.library.>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
  
SELECT ?doc {
    ?lit pf:textMatch '+text' .
    ?doc ?p ?lit
  }

More details in the ARQ documentation for LARQ

This will be in ARQ 1.5 but is available from ARQ CVS now. Hopefully, this will be useful to users and application writers. Comments and feedback on the design are welcome, especially before the next ARQ release.