02 February 2007

ARQ : what next?

In January, I got round to releasing a new version of ARQ and also doing a first full release of SPARQL version of Joseki. ARQ version 1.5 had a problem with a performance hit for certain database usages (boo, hiss) but this is fixed in version 1.5.1.

A key feature of this version of ARQ is support for free text search. The core indexing is done by Lucene 2. Searches bind matching resources to a variable and are done using a property function. This means they can be fast because it directly exploits the indexing capabilities of Lucene.

Example:

PREFIX pf: <java:com.hp.hpl.jena.query.pfunction.library.>
  SELECT ?doc {
    ?lit pf:textMatch '+text' .
    ?doc ?p ?lit
  }

What next? This release matches the published DAWG working draft but the working group is considering a SPARQL algebra the formalization of the semantic of SPARQL to be based on some like the relational algebra.

This version of ARQ has a number of query engines:

  • QueryEngine is the one used normally.
  • QueryEngineHTTP is a SPARQL protocol client for remote query.
  • QueryEngineRef is a direct, simple implementation of the SPARQL algebra.
  • QueryEngineQuad is a modified version of the reference query engine that compiles SPARQL to quads, not triples.

QueryEngineRef compiles a SPARQL query into the SPARQL algebra and it also provides a very simple evaluation of that algebra expression. The evaluation can be slow because it calculates all sub-expressions, then joins/left-joins/unions them but it is written to be correct, rather than fast. It is much better to extend the reference engine and convert the algebra expression into something cleverer.

If you use the --engine=ref argument to arq.sparql, it will use the reference engine. To see the algebra expression, use arq.qparse with arguments --engine=ref --print=op.

The original query engine, which is still the default one, uses a substitution algorithm to evaluate a query. This will exploit the indexes that Jena maintains for all graphs by replacing any variables already bound to some value with their current value and so giving more information to the basic graph pattern matcher. This is done in a steaming fashion which is why this query engine only takes a constant amount of memory regardless of the data size.

The next version of ARQ will combine the two approaches in a new query engine - where possible (and, for real world, queries that means nearly always), it will use the streaming, indexed approach to execution and only resort to a potentially more memory-intensive approach for parts of the query that really need it. The good news is that while it might sound like writing a whole new subsystem from scratch, it isn't. The SPARQL algebra compiler is already part of the reference engine and new query engine extends that; in addition, much of the code of the streaming engine is unchanged and it's just the conversion from the algebra to the iterator structure that needs to be written.

It's also a chance to go back and clean up some code that has been around for a long time and, with hindsight, can be structured better. My programming style has changed over the time ARQ has been in development as I try different patterns and designs then learn what works and what doesn't (makes me get frustrated at some of the design of Java as well). Some old-style code does the right thing; it just does it in a way I would not choose to do it now. Don't get that chance very often.

3 comments:

Martynas said...

I'm just wondering, what's the point having property functions, when there are filter functions already? Is that not enough? PFs look a bit confusing, I would.

Martynas said...

say.

AndyS said...

Property functions can bind variables and can calculate terms that are not in the base data, A filter can only accept or reject a solution - it can't set a variable in a solution.