30 November 2006

SPARQL Basic Graph Pattern Matching

I wrote about the SPARQL algebra last time, not with the idea of a regular update about DAWG work but noting a significant issue being considered by the working. I hope that the wider community will comment and contribute; much of the ground work for the algebra is drawn from outside the working group anyway.

This week DAWG made a significant decision in a different, but nearby area, that of basic graph pattern matching (BGP).

Basic Graph Patterns

Basic graph patterns (BGPs) are the building block in SPARQL. They are a sequence of adjacent triple patterns in the query string. A BGP is matched against whatever is being queried and the results of matching feed into the algebra.

The decision is that no change, or rather the "implementation hint" for BGP matching, made more formal, becomes the basis of SPARQL. No queries will be changed by the decision.

The Issue

Having reveal the punch line, what's the joke? The issue is whether blank nodes in BGP matching behave like hidden, named variables or do they behave as a different kind of variable all together. This matters for counting - it does not matter for the logical meaning of a solution.

Example data:

  :a :p 1 .
  :a :p 2 .

Example query pattern:

 { ?x :p [] }

How many answers? Blank nodes are existential variables in RDF; named variables (the regular ?x ones) are universal variables. Queries don't return the binding of an existential; queries can return the binding of a named variable and the bound value of a named variables can be passed to other parts of the query (FILTERs, OPTIONALs etc) via the algebra.

In the absence of a DISTINCT in the query, are the solutions to this pattern:

  • 1 : ?x = :a
  • 2 : ?x = :a , ?x = :a
  • Either 1 or 2

In OWL-DL, existential variables give more expressivity: if there is a disjunction in the ontology, the reasoner can know that something exists, but not need to find an actual value - and it may not be able to find the value anyway - but the reasoner does know there is "something there".

Users and Implementers

For application writers, the main use of blank nodes in queries that I've seen is the convenience of using []

 {
   ?x :something [ :p ?v ; :q ?w ]
 } 

[] saves having to have a named variable and splitting the expression. It would be one more thing to learn if that causes different answers (duplicates) to the pattern using query variables:

 {
   ?x :something ?z .
   ?z :p ?v ; 
      :q ?w .
 } 

especially if the first pattern had been written:

 {
   ?x  :something _:bnode .
   _:bnode :p ?v ; 
           :q ?w . 
 } 

which is the [] example written out with a labelled blank node. This is also why it's BGPs that are the building block, not triple patterns - blank nodes are scoped to BGPs.

For implementers, having two kinds of variable that behave different in matching can make life a harder, especially if they using some existing technology, like an SQL database which has one preferred type (universal-like). Work on SPARQL semantics suggests that there can be one (complicated) SQL statement for one SPARQL query. Nice if it's true because all the work that has gone into SQL database optimizers is reusable. Having two types of variables would need extra SQL to be generated like nested SELECTs, or some post-processing done.

More work for implementers is not good; many open source RDF toolkits are built by people in their own time or by people with other things to do at work. So, it might be a noticeable dampener on toolkit development and hence deployment. Not a good idea to do this without a reasonable need.

DAWG Resolution

Logically, duplicates don't make any difference. DAWG decided there are two solutions for simple entailment matching example above. It's what implementations currently do.

The DAWG process is often test case driven. We use test cases as both a concrete way to define what results we expect and also to make concrete decisions. So once consensus was emerging to have duplicates, the working group decided to approve a test case that captured that. In this case, the test is rdfsemantics-bnode-type-var (the example above is the same idea, just shorter).

Defining BGP Matching for SPARQL

In Last Call 2, the first CR publication and the intermediate publication versions of SPARQL, BGP matching is a complicated definition that allows for different entailment regimes for extensibility within a single entailment framework. SPARQL is defined for simple entailment, and so for anything where there a virtual graph, but it allows for extension to more complicated entailment regimes, like OWL-DL.

The trouble is the matching definition is that it has a bug in it; it allows too many solutions when the graph is redundant. The bug has to be fixed.

SPARQL is defined for simple entailment. With this decision on how blank nodes behave, the working group can now rework the BGP matching.  The options are outlined in the message 2006OctDec/0140 and either of the variable mapping approaches can be made to work. It's more akin to a formalised version of the last call 1 style and the implementation hint of the complex version.

... and Extensibility

Extensibility comes by replacing the whole of BGP matching, with just a few natural conditions such as all named variables must be bound in a BGP match and there are no accidental clashes of blank nodes.

 

(Vested interest declaration)

OK - ARQ isn't built on SQL; it's a self contained query engine with various extension points - by default, it queries any Jena graph and exploits Jena's direct connection to the storage layer but only for basic graph patterns, not more complex patterns. It could sort out the two types of variables quite easily.

But we also have a Jena SPARQL database engine (called SDB) that is built on SQL and is specifically designed for SPARQL, for large datasets and for named graphs.

SDB fits in as an ARQ query engine and the more it can turn into a single SQL statement the better. All of the query being the ideal case (and hopefully the normal one).  So making the translation to a relational quad store simpler is helpful for SDB.

23 November 2006

An Algebra for SPARQL

SPARQL, like Gaul, is dividied into three parts:

  • Basic graph pattern matching
  • Graph expressions
  • Solution modifiers

Basic graph patterns (BGPs) are a block of adjacent triple patterns that match, or don't match, all together. In BGP matching every named variable must have some value for the BGP to have been matched. Filters add restrictions on the values variables can take.

They are also an extension point in SPARQL because you can plug in a different BGP matcher (e.g. a DL reasoner or a wrapper for legacy SQL data), then reuse the other two layers to form complex queries out of the exact matches from BGP matching.

Graph expressions (OPTIONAL, UNION, GRAPH, Groups (things between {}) combine BGPs in various ways to give more complicated patterns. Graph expressions are recursive so you can have patterns within patterns. A SPARQL query has one graph expression in the WHERE clause.

Solution modifiers (Project, DISTINCT, ORDER BY, LIMIT/OFFSET) take the output of matching the query graph pattern and process it in various ways to yield the result set.

Current Situation

The semantics for SPARQL have, so far, been declarative and top-down. A solution to a query passes if it meets all the conditions following from the definitions of each of the SPARQL graph pattern forms and filters. That means, in theory, the whole of the solution is available.

In fact, ARQ builds solutions up by adding the minimum necessary at each subpattern to make the solution match. This is what gives ARQ streaming evaluation which keeps the total memory use down. But the effect is the same, whole solutions are available everywhere in the pattern.

Another way is to have a bottom-up algebra. The relational algebra works this way. Subexpressions are evaluated to give tables (relations) and then these tables combined together to form new tables.

It turns out that these are nearly the same as shown by Richard Cyganiak and Jorge Perez et al (see references below). The class of graph patterns that differ are a certain kind of nested OPTIONAL where a variable is used on the outside, fixed part, and in the inner OPTIONAL, but not in between.

SELECT *
{ :x1 :p ?v .
  OPTIONAL
  { :x3 :q ?w .
    OPTIONAL { :x2 :p ?v }
  }
}

Now that is a rather unusual query. The tricky bit is the use of ?v at the outermost and innermost levels but not in between.The query can be rewritten so as not to nest (note the repeat of :x3 :q ?w).

SELECT *
{ :x1 :p ?v .
  OPTIONAL
    { :x3 :q ?w }
  OPTIONAL 
    { :x3 :q ?w  . :x2 :p ?v }
}

A few references for work in this area:

A SPARQL Algebra

In DAWG, we're considering an algebra for SPARQL, based on the paper Semantics of SPARQL. This will mean a change of results in the case above. I've never seen such a query in the wild, only in artificial test cases.

The change from declarative semantics to a constructional algebra is motivated primarily by the implementers in the working group wanting to apply all the good stuff on relational algebra optimization to SPARQL engines. The complexity of the exact mapping to SQL in the semantics preserving SPARQL-to-SQL is daunting.

Not that simple!

But other things differ as well if a purely relational style is applied (and that's not being proposed currently) because "Semantics of SPARQL" does not consider the difference in treatment of filters (it's focus is on the combination of graph patterns).

Two cases arise: one where filters are in OPTIONALs, and one where filters are nested inside groups.

SELECT * 
{ ?x :p ?v .
  OPTIONAL
  { 
    ?y :q ?w .
    FILTER(?v = 2)
  }
}

The FILTER uses a variable from the fixed part of the OPTIONAL. Under Jorge's semantics the variable isn't in-scope, so the filter evaluation is an error (unbound variable), otherwise known as false; the optional part never matches. This form of query is realistic and does occur for real so the current proposal for DAWG makes this work. Like SQL and LEFT OUTER JOIN with the ON clause, the LeftJoin operation in the SPARQL algebra can take a condition which applies over the whole LeftJoin.

The fixed pattern could be repeated inside the optional part but such repetition for something application might well use is bad, very bad.

Application writers will have to take care about scope and groups: putting FILTERs inside {} changes their scope.

Currently,

{ :x :p ?v . FILTER(?v < 5) }

and

{ :x :p ?v . { FILTER(?v < 5) } }

have the same effect. But they aren't in the algebraic form because { FILTER(?v < 5) } evaluates on the empty table that does not have a ?v in it; evaluation is an error and hence false.

ARQ

There is an implementation of the algebra in ARQ in CVS (not in the current release ARQ 1.4). It's enabled in the command line tool arq.sparql with --engine=ref from the command line and

QueryEngine2.register() ;

in code.

The work-in-progress unfinished editor's working text is available. Check it is actually up to date because that's just work space and isn't a DAWG document. It does not reflect working group decisions - it more a proposal for consideration.

 

Disclaimer: Any views expressed here are mime and not the working group (which at the time of writing is thinking about it but hasn't decided anything yet).

12 November 2006

LARQ = Lucene + ARQ

SPARQL is normally thought of as only querying fixed RDF data. At the core of SPARQL are the building blocks of basic graph patterns, and on top of these there is an algebra to create more complex patterns (OPTIONAL UNION, FILTER, GRAPH).

The key question a basic graph pattern asks is "does this pattern match the graph". The named variables record how the pattern matches.

Not all information needs to be in the raw data. ARQ property functions are a way to let the application add some relationships to be computed at query execution time.

LARQ adds free text search. The real work is done by Lucene. LARQ adds ways to create a Lucene index from RDF data and a property function to perform free text matching in a SPARQL query.

Example: find all the string literals that match '+keyword'

PREFIX pf: <java:com.hp.hpl.jena.query.pfunction.library.>

SELECT *
  { ?lit pf:textMatch '+keyword' }

Any simple or complex Lucene query string can be used.

LARQ provides utilities to index string literals. As the literal can be stored as well, a query can find the subjects with some property value matching the free text search.

So to find all the document that have titles matching some free text search:

PREFIX pf: <java:com.hp.hpl.jena.query.pfunction.library.>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
  
SELECT ?doc {
    ?lit pf:textMatch '+text' .
    ?doc ?p ?lit
  }

More details in the ARQ documentation for LARQ

This will be in ARQ 1.5 but is available from ARQ CVS now. Hopefully, this will be useful to users and application writers. Comments and feedback on the design are welcome, especially before the next ARQ release.