30 November 2006

SPARQL Basic Graph Pattern Matching

I wrote about the SPARQL algebra last time, not with the idea of a regular update about DAWG work but noting a significant issue being considered by the working. I hope that the wider community will comment and contribute; much of the ground work for the algebra is drawn from outside the working group anyway.

This week DAWG made a significant decision in a different, but nearby area, that of basic graph pattern matching (BGP).

Basic Graph Patterns

Basic graph patterns (BGPs) are the building block in SPARQL. They are a sequence of adjacent triple patterns in the query string. A BGP is matched against whatever is being queried and the results of matching feed into the algebra.

The decision is that no change, or rather the "implementation hint" for BGP matching, made more formal, becomes the basis of SPARQL. No queries will be changed by the decision.

The Issue

Having reveal the punch line, what's the joke? The issue is whether blank nodes in BGP matching behave like hidden, named variables or do they behave as a different kind of variable all together. This matters for counting - it does not matter for the logical meaning of a solution.

Example data:

  :a :p 1 .
  :a :p 2 .

Example query pattern:

 { ?x :p [] }

How many answers? Blank nodes are existential variables in RDF; named variables (the regular ?x ones) are universal variables. Queries don't return the binding of an existential; queries can return the binding of a named variable and the bound value of a named variables can be passed to other parts of the query (FILTERs, OPTIONALs etc) via the algebra.

In the absence of a DISTINCT in the query, are the solutions to this pattern:

  • 1 : ?x = :a
  • 2 : ?x = :a , ?x = :a
  • Either 1 or 2

In OWL-DL, existential variables give more expressivity: if there is a disjunction in the ontology, the reasoner can know that something exists, but not need to find an actual value - and it may not be able to find the value anyway - but the reasoner does know there is "something there".

Users and Implementers

For application writers, the main use of blank nodes in queries that I've seen is the convenience of using []

 {
   ?x :something [ :p ?v ; :q ?w ]
 } 

[] saves having to have a named variable and splitting the expression. It would be one more thing to learn if that causes different answers (duplicates) to the pattern using query variables:

 {
   ?x :something ?z .
   ?z :p ?v ; 
      :q ?w .
 } 

especially if the first pattern had been written:

 {
   ?x  :something _:bnode .
   _:bnode :p ?v ; 
           :q ?w . 
 } 

which is the [] example written out with a labelled blank node. This is also why it's BGPs that are the building block, not triple patterns - blank nodes are scoped to BGPs.

For implementers, having two kinds of variable that behave different in matching can make life a harder, especially if they using some existing technology, like an SQL database which has one preferred type (universal-like). Work on SPARQL semantics suggests that there can be one (complicated) SQL statement for one SPARQL query. Nice if it's true because all the work that has gone into SQL database optimizers is reusable. Having two types of variables would need extra SQL to be generated like nested SELECTs, or some post-processing done.

More work for implementers is not good; many open source RDF toolkits are built by people in their own time or by people with other things to do at work. So, it might be a noticeable dampener on toolkit development and hence deployment. Not a good idea to do this without a reasonable need.

DAWG Resolution

Logically, duplicates don't make any difference. DAWG decided there are two solutions for simple entailment matching example above. It's what implementations currently do.

The DAWG process is often test case driven. We use test cases as both a concrete way to define what results we expect and also to make concrete decisions. So once consensus was emerging to have duplicates, the working group decided to approve a test case that captured that. In this case, the test is rdfsemantics-bnode-type-var (the example above is the same idea, just shorter).

Defining BGP Matching for SPARQL

In Last Call 2, the first CR publication and the intermediate publication versions of SPARQL, BGP matching is a complicated definition that allows for different entailment regimes for extensibility within a single entailment framework. SPARQL is defined for simple entailment, and so for anything where there a virtual graph, but it allows for extension to more complicated entailment regimes, like OWL-DL.

The trouble is the matching definition is that it has a bug in it; it allows too many solutions when the graph is redundant. The bug has to be fixed.

SPARQL is defined for simple entailment. With this decision on how blank nodes behave, the working group can now rework the BGP matching.  The options are outlined in the message 2006OctDec/0140 and either of the variable mapping approaches can be made to work. It's more akin to a formalised version of the last call 1 style and the implementation hint of the complex version.

... and Extensibility

Extensibility comes by replacing the whole of BGP matching, with just a few natural conditions such as all named variables must be bound in a BGP match and there are no accidental clashes of blank nodes.

 

(Vested interest declaration)

OK - ARQ isn't built on SQL; it's a self contained query engine with various extension points - by default, it queries any Jena graph and exploits Jena's direct connection to the storage layer but only for basic graph patterns, not more complex patterns. It could sort out the two types of variables quite easily.

But we also have a Jena SPARQL database engine (called SDB) that is built on SQL and is specifically designed for SPARQL, for large datasets and for named graphs.

SDB fits in as an ARQ query engine and the more it can turn into a single SQL statement the better. All of the query being the ideal case (and hopefully the normal one).  So making the translation to a relational quad store simpler is helpful for SDB.

23 November 2006

An Algebra for SPARQL

SPARQL, like Gaul, is dividied into three parts:

  • Basic graph pattern matching
  • Graph expressions
  • Solution modifiers

Basic graph patterns (BGPs) are a block of adjacent triple patterns that match, or don't match, all together. In BGP matching every named variable must have some value for the BGP to have been matched. Filters add restrictions on the values variables can take.

They are also an extension point in SPARQL because you can plug in a different BGP matcher (e.g. a DL reasoner or a wrapper for legacy SQL data), then reuse the other two layers to form complex queries out of the exact matches from BGP matching.

Graph expressions (OPTIONAL, UNION, GRAPH, Groups (things between {}) combine BGPs in various ways to give more complicated patterns. Graph expressions are recursive so you can have patterns within patterns. A SPARQL query has one graph expression in the WHERE clause.

Solution modifiers (Project, DISTINCT, ORDER BY, LIMIT/OFFSET) take the output of matching the query graph pattern and process it in various ways to yield the result set.

Current Situation

The semantics for SPARQL have, so far, been declarative and top-down. A solution to a query passes if it meets all the conditions following from the definitions of each of the SPARQL graph pattern forms and filters. That means, in theory, the whole of the solution is available.

In fact, ARQ builds solutions up by adding the minimum necessary at each subpattern to make the solution match. This is what gives ARQ streaming evaluation which keeps the total memory use down. But the effect is the same, whole solutions are available everywhere in the pattern.

Another way is to have a bottom-up algebra. The relational algebra works this way. Subexpressions are evaluated to give tables (relations) and then these tables combined together to form new tables.

It turns out that these are nearly the same as shown by Richard Cyganiak and Jorge Perez et al (see references below). The class of graph patterns that differ are a certain kind of nested OPTIONAL where a variable is used on the outside, fixed part, and in the inner OPTIONAL, but not in between.

SELECT *
{ :x1 :p ?v .
  OPTIONAL
  { :x3 :q ?w .
    OPTIONAL { :x2 :p ?v }
  }
}

Now that is a rather unusual query. The tricky bit is the use of ?v at the outermost and innermost levels but not in between.The query can be rewritten so as not to nest (note the repeat of :x3 :q ?w).

SELECT *
{ :x1 :p ?v .
  OPTIONAL
    { :x3 :q ?w }
  OPTIONAL 
    { :x3 :q ?w  . :x2 :p ?v }
}

A few references for work in this area:

A SPARQL Algebra

In DAWG, we're considering an algebra for SPARQL, based on the paper Semantics of SPARQL. This will mean a change of results in the case above. I've never seen such a query in the wild, only in artificial test cases.

The change from declarative semantics to a constructional algebra is motivated primarily by the implementers in the working group wanting to apply all the good stuff on relational algebra optimization to SPARQL engines. The complexity of the exact mapping to SQL in the semantics preserving SPARQL-to-SQL is daunting.

Not that simple!

But other things differ as well if a purely relational style is applied (and that's not being proposed currently) because "Semantics of SPARQL" does not consider the difference in treatment of filters (it's focus is on the combination of graph patterns).

Two cases arise: one where filters are in OPTIONALs, and one where filters are nested inside groups.

SELECT * 
{ ?x :p ?v .
  OPTIONAL
  { 
    ?y :q ?w .
    FILTER(?v = 2)
  }
}

The FILTER uses a variable from the fixed part of the OPTIONAL. Under Jorge's semantics the variable isn't in-scope, so the filter evaluation is an error (unbound variable), otherwise known as false; the optional part never matches. This form of query is realistic and does occur for real so the current proposal for DAWG makes this work. Like SQL and LEFT OUTER JOIN with the ON clause, the LeftJoin operation in the SPARQL algebra can take a condition which applies over the whole LeftJoin.

The fixed pattern could be repeated inside the optional part but such repetition for something application might well use is bad, very bad.

Application writers will have to take care about scope and groups: putting FILTERs inside {} changes their scope.

Currently,

{ :x :p ?v . FILTER(?v < 5) }

and

{ :x :p ?v . { FILTER(?v < 5) } }

have the same effect. But they aren't in the algebraic form because { FILTER(?v < 5) } evaluates on the empty table that does not have a ?v in it; evaluation is an error and hence false.

ARQ

There is an implementation of the algebra in ARQ in CVS (not in the current release ARQ 1.4). It's enabled in the command line tool arq.sparql with --engine=ref from the command line and

QueryEngine2.register() ;

in code.

The work-in-progress unfinished editor's working text is available. Check it is actually up to date because that's just work space and isn't a DAWG document. It does not reflect working group decisions - it more a proposal for consideration.

 

Disclaimer: Any views expressed here are mime and not the working group (which at the time of writing is thinking about it but hasn't decided anything yet).

12 November 2006

LARQ = Lucene + ARQ

SPARQL is normally thought of as only querying fixed RDF data. At the core of SPARQL are the building blocks of basic graph patterns, and on top of these there is an algebra to create more complex patterns (OPTIONAL UNION, FILTER, GRAPH).

The key question a basic graph pattern asks is "does this pattern match the graph". The named variables record how the pattern matches.

Not all information needs to be in the raw data. ARQ property functions are a way to let the application add some relationships to be computed at query execution time.

LARQ adds free text search. The real work is done by Lucene. LARQ adds ways to create a Lucene index from RDF data and a property function to perform free text matching in a SPARQL query.

Example: find all the string literals that match '+keyword'

PREFIX pf: <java:com.hp.hpl.jena.query.pfunction.library.>

SELECT *
  { ?lit pf:textMatch '+keyword' }

Any simple or complex Lucene query string can be used.

LARQ provides utilities to index string literals. As the literal can be stored as well, a query can find the subjects with some property value matching the free text search.

So to find all the document that have titles matching some free text search:

PREFIX pf: <java:com.hp.hpl.jena.query.pfunction.library.>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
  
SELECT ?doc {
    ?lit pf:textMatch '+text' .
    ?doc ?p ?lit
  }

More details in the ARQ documentation for LARQ

This will be in ARQ 1.5 but is available from ARQ CVS now. Hopefully, this will be useful to users and application writers. Comments and feedback on the design are welcome, especially before the next ARQ release.

13 October 2006

Assignment Property Function

SPARQL is a graph pattern matching language, The matching can be semantic (basic graph patterns are entailed by the data in order to match) or through the algebra that works over the top of basic graph pattern matching.

Complex queries make use of UNION and OPTIONAL, including the idiom with OPTIONAL and !BOUND, in all sorts of creative ways. And sometimes they want the answers to indicate which way the query matched to get a particular solution.

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?label
{
   { ?concept skos:prefLabel ?label }
 UNION
   { ?concept skos:altLabel ?label }
}

Which matched? the prefLabel or altLabel? The application can assign the different to branches to different variables although that might be inconvenient in a larger query if ?label is used elsewhere in the query.

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?label_pref ?label_alt
{
  { ?concept skos:prefLabel ?label_pref }
UNION
  { ?concept skos:altLabel  ?label_alt }
}

ARQ's property functions provide a way of adding different ways to match a triple pattern. Uses include calling out to custom code to perform some calculation, or calling out to use a external index, such as Lucene.

Property functions can assign to variables in the solution - unlike FILTER functions which must not have side effects. Property functions can't change values but they can bind an unbound variable or check an existing binding is compatible.

PREFIX apf:   <java:com.hp.hpl.jena.query.pfunction.library.>
SELECT ?x
{
 ?x apf:assign "Hello World"
}
-----------------
| x             |
=================
| "Hello World" |
-----------------

It works both ways round: { "Hello World" apf:assign ?x } gives the same results.

If subject and object are constants or already bound, apf:assign checks to see if they are the same. If they are, the tripe pattern matches with no change to the solution, otherwise it does not match and that solution is rejected. The "sameness" is Jena's sameValueAs, just like the rest of graph matching. If both variables are unbound, apf:assign complains and the query is broken.

07 May 2006

Parameterized Queries

Sometimes, an application will be making a SPARQL query, using the results from a previous query or using some RDF term found through the other Jena APIs.

SQL has prepared statements - they allow an SQL statement to take a number of parameters. The application fills in the parameters and executes the statement.

One way is to resort to doing this in SPARQL by building a complete, new query string, parsing it and executing it. But it takes a little care to handle all cases like quoting special characters; you can at least use some of the many utilities in ARQ for producing strings such as FmtUtils.stringForResource (it's not in the application API but in the util package currently).

Queries in ARQ can be built programmatically but it is tedious, especially when the documentation hasn't been written yet.

Another way is to use query variables and bind them to initial values that apply to all query solutions. Consider the query:

PREFIX dc <http://purl.org/dc/elements/1.1/>
SELECT ?doc { ?doc dc:title ?title }

It gets documents and their titles.

Executing a query in program might look like:

import com.hp.hpl.jena.query.* ;

Model model = ... ;
String queryString = StringUtils.join("\n",
         new String[]{
     "PREFIX dc <http://purl.org/dc/elements/1.1/>",
     "SELECT ?doc { ?doc dc:title ?title }"
         }) ;
Query query = QueryFactory.create(queryString) ;
QueryExecution qexec =
    QueryExecutionFactory.create(query, model) ;
try {
    ResultSet results = qexec.execSelect() ;
    for ( ; results.hasNext() ; )
    {
       QuerySolution soln = results.nextSolution() ;
       Literal l = soln.getLiteral("doc") ;
    }
} finally { qexec.close() ; }

Suppose the application knows the title it's interesting in - can it use this to get the document?

The value of ?title made a parameter to the query and fixed by an initial binding. All query solutions will be restricted to patterns matches where ?title is that RDF term.

QuerySolutionMap initialSettings = new QuerySolutionMap() ;
initialSettings.add("title", node) ;

and this is passed to the factory that creates QueryExecution's:

QueryExecution qexec = 
    QueryExecutionFactory.create(query,
                                 model,
                                 initialSettings) ;

It doesn't matter if the node is a literal, a resource with URI or a blank node. It becomes a fixed value in the query, even a blank node, because it's not part of the SPARQL syntax, it's a fixed part of every solution.

This gives named parameters to queries enabling something like SQL prepared statements except with named parameters not positional ones.

This can make a complex application easier to structure and clearer to read. It's better than bashing strings together, which is error prone, inflexible, and does not lead to clear code.

13 April 2006

From RDQL to SPARQL

SPARQL is now (April 2006) a W3C Candidate Recommendation which means it is stable enough for wide spread implementation.  Actually, there are quite a few implementations already (SPARQL implementations page on ESW wiki).

SPARQL is defined by three documents:

and there are tutorials like this one.

RDQL predates SPARQL - in fact, RDQL design predates the current RDF specifications and some of the design decisions in RDQL are a reflection of that. The biggest of these is that RDF didn't have any datatyping so RDQL handles tests on, say, integers without checking the datatype (if it looks like an integer, it can be tested as integer).

SPARQL has all the features of RDQL and more:

  • ability to add optional information to query results
  • disjunction of graph patterns
  • more expression testing (date-time support, for example)
  • named graphs
  • sorting

but, above all, it is more tightly specified so queries in one implementation should behave the same in all other implementations.

ARQ - A Query Engine for Jena

In parallel with the developing the SPARQL specification, I have been developing a new query subsystem for Jena called ARQ. ARQ is now part of the Jena download.

ARQ builds on top of the existing Jena query support for matching of basic graph patterns (BGPs are the building block in SPARQL). ARQ can execute SPARQL and RDQL as well as an extended form of SPARQL. It has several extension points, such as Property functions. The ARQ query engine works with any Jena Graph or Model.

Converting RDQL code to SPARQL code

The functionality of RDQL is a subset of SPARQL so it's not hard to convert RDQL queries to SPARQL. What needs to be done is convert the triple syntax and convert any constraints.

Syntax

SPARQL syntax uses a Turtle-like syntax which is familiar to anyone knowing N3. Namespaces go at the start of the query, not after like USING. There are no () around triple patterns; instaead there is a "." (a single dot) between triple patterns. An RDQL only ever has one graph pattern, in SPARQL, blocks of triple patterns are delimited by {}

You can even use the command line arq.qparse to read in an RDQL query and write out the SPARQL query but it's a rough approximation you'll need to check and it may not be completely legal SPARQL.

Constraints

The constraints need the most care because SPARQL uses RDF datatyping and RDQL doesn't. Some common areas are:

  • regular expressions
  • string equality and numeric equality

Regular expressions

A SPARQL regular expression looks like:

regex(expr, "pattern")
regex(expr, "pattern", "i")

The catch is that the expr must be a literal; it can't be a URI. (Well - it can, but it will never match!). If you want to perform a regular expression match on a URI, use the str() built-in to get the string form of the URI.

regex(str(?uri), "^http://example/ns#")

Equality

RDQL has = for numeric equality and eq for string equality. A number in RDQL was anything that can be parsed an a number, whether it had a datatype or not (or even the wrong datatype). Likewise, anything could be treated as a string (like URIs in regular expressions).

SPARQL has = which is taken from XQuery/XPath Functions and Operators. It decides whether that is numeric equals, string equals or URI-equals based on the kind of arguments it is given.

API Changes

The ARQ API is in the package com.hp.hpl.jena.query. The RDQL API is deprecated, starting with Jena 2.4. The new API is similar in style to the old one for SELECT, with iteration over the rows of the results (javadoc). Differences include the widespread use of factories, naming consistent with the SPARQL specifications, and different exec operations for the different kinds of SPARQL query. QueryExecution objects should be properly closed.

One change is that to get the triples that matched a query, instead of asking the binding for the triples that were used in the matching, the application should now make a CONSTRUCT query.

Experimenting with SPARQL

There is a set of command line utilities to try out SPARQL queries from the command line.

A nice graphical interface is twinkle by Leigh Dodds.

There is also an implementation of the SPARQL protocol using ARQ, project Joseki, and a demo site at http://www.sparql.org where you can validate SPARQL queries and try them out.

Questions?

Send question and comments about ARQ to jena-dev.

Send general questions and comments about SPARQL to the W3C list sparql-dev (archive).

If you have experiences converting from RDQL to SPARQL, then let me know and I'll compile a list of common issues.

16 February 2006

Property Functions in ARQ

These are properties that are calculated by some custom code, and not done by the usual matching. There are two provided now: applications are free to provide application-specifc ones.

  • list:member - access the members of an RDF list.
  • rdfs:member - access the members of rdf:Bag, rdf:Seq and rdf:Alt structures

Full Extension documentation

Normally, the unit of matching in ARQ is the basic graph pattern (a sequence of triple patterns). These sets of triple patterns are dispatched to Jena for matching by Jena's graph-level query handler. Each kind of storage provides the appropriate query handler. For example, the database fastpath is a translation of a set of triple patterns into a single SQL query involving joins.

There is also a default implementation that works by using plain graph find (a triple with possible wildcards) so a new storage system does not need to provide it's own query handler until it wants to exploit some feature of the storage.

If a function property is encountered, then it is internally treated as a call to be an extension. There is a registry of function properties to implementing code.

  # Find all the members of a list (RDF collection)
  PREFIX  list:   <http://www.jena.hpl.hp.com/ARQ/list#>
  SELECT ?member
  { ?x :p ?list .
    ?list list:member ?member .
  }

The functionality of list:member is handled by a class in the extension library so this query is treated much like the ARQ extension:

  # Find all the members of a list (RDF collection)
  PREFIX  ext: <java:com.hp.hpl.jena.query.extension.library.>
  SELECT ?member
  { ?x :p ?list .
    EXT ext:list(?list, ?x)
  }

where ext:list is a function that bind its arguments (unlike a FILTER function). The property function form is legal SPARQL.

So, this mechanism shows that collection access can be done in SPARQL without resorting to handling told blank nodes.

cwm (which is a forward chaining rules engine) and Euler (which is a backward-chaining rules engine) already provide this style of access. Their property is - the subject and object meanings are the other way.

ARQ provides list:member to be like rdfs:member.

— Andy

11 February 2006

Progress with Jena.Net

[GNU] wrote about building Jena for Mono, using IKVM to compile the jar files into .Net IL.

This approach means that the same source code is used for both the Java world and the .Net world, making future improvements visible to both from a single source tree.

I tried doing it for .Net on Windows with C# Express and IKVM-0.24.0.1.

Summary

SPARQL queries work.

Using Jena from C# works for small scale cases - lots of checking to do but it should be a matter of verifying everything from the dependent libraries works properly.

Some things aren't working but there are a few hotspots of trouble that, when fixed, mean that the majority (may be all) of the Jena test suite will run. As it is at the moment, quite a lot can be done including using the ARQ command line programs.

The Conversion

The IKVM bytecode conversion route is my preferred choice because it means one source codebase, not two. When I tried this before, I got an early version of ARQ up and running. But it wasn't complete; the first big block was the lack of java.nio.charset support in GNU Classpath. Jena and ARQ have lots of tests of internationalization and charsets. That alone was enough to make it not worthwhile exploring further at the time.

Now (Feb 2006) GNU Classpath coverage is much better. See the coverage of GNU Classpath compared to Java 1.4.

The process is simple: run ikvmc on all the jars to get a library. Ignore all the warnings about missing stuff. It's surprising what various libraries actually reference - Log4j has references to a lot of log record transports. At the simplest:

ikvmc *.jar -out:XXX.dll -target:library

I've now broken this in two DLLs: jena-libs.dll (all the jars except the jena ones) and jena.dll (jena.jar, jenatest.jar, arq.,jar, iri.jar) but that is just because I keep building the DDLs while testing.

It takes a minute or so (less time than building jena.jar itself). The result is two DLLs of about totaling 16M - the whole assembly is about 23M including the three IKVM DLLs. Not small - but it works and it is simple to do.

What's been tried: in-memory graphs, reading and writing turtle files (but XML types literals broken) and SPARQL queries.

Jena bugs: (this is relative to CVS and so after Jena 2.3)

  • file:///c:/absolute was incorrectly turned into a windows filename. Worked OK with Sun's Java but not IKVM. Fixed.

GNU Classpath bugs:

  • InputStreamReader(InputStream, Charset) is broken although the other two constructors that allow the charset conversion to be explicitly controlled do seem to work. This can be worked around in Jena. Bugzilla Entry.
  • Zero-width lookbehind regexs aren't implemented. They are used by JJC's new IRI code. Bugzilla Entry.

ARQ Test Suite

As a rough comparision, I ran the ARQ test suite:

BEfore any fixes, with Java 5 JVM:
Tests run: 1119, Failures: 0, Errors: 0

Using ikvm as the JVM:
Tests run: 1119, Failures: 32, Errors: 17

Converting to .Net:
Tests run: 1119, Failures: 32, Errors: 59

[20 Feb: JJC recoded around the lack of lookbehind and now its down to 4 failures of which 3 are because GNUClasspath is just different to Sun's runtime]

Next

Now it's work through the broken tests in the ARQ test suite to determine what's the cause as time permits.

IronPython to Jena?

Updates

  • Calling Jena from VB.Net works
  • The GNU Classpath/InputStreamReader bug has been fixed
  • The GNU Classpath/lookbehind bug had already been fixed but very recently so IKVM hasn't picked it up yet.

Now 4 failures, 3 of which are corner case differences of URI resolution in unusual cases.