12 February 2007

Jena SDB

SDB was released for the first time last week. While this is the first alpha release, it's actually at least the second internal query architecture and second loader architecture (Damian did the work to get the loader working fast).

SPARQL now has a formal algebra. ARQ is used to turn the SPARQL query syntax into a algebra expression; SDB takes over and compiles it first to a relational algebra(ish) structure then generates SQL syntax. Now there is a SPARQL algebra, this is all quite a bit simpler for SDB which is why this is the second design for query generation; much of the work now comes naturally from ARQ.

Patterns

At the moment, SDB only translates basic graph patterns, join and leftjoin expressions to SQL and leaves anything else to ARQ to stitch back together. For the current patterns, there are two tricky cases to consider:

  • SPARQL join relationships aren't exactly SQL's because they may involve unbound variables.
  • Multiple OPTIONALs may need to be COALESCEd.

For the first, sometimes the join needs to involve more than equality relationships like "if col1 = null or col1 = col2", which is a bit of scope tracking, and for the second, if a variable can be bound in two or more OPTIONALs , you have to take the first binding. The scope tracking is needed anyway.

Over time, more and more of the SPARQL algebra expression will be translated to SQL.

Layouts

SDB supports a number of databases layouts in two main classes:
  • Type 1 layouts have a single triple table, with all information encoded into the columns of the table.
  • Type 2 layouts use a triple table with a separate RDF terms table

Type 1 layouts are good at supporting fine-grained API calls where the need to join to get the actual RDF terms is removed because they are encoded into the triple tables columns. Jena's existing database layer, RDB, is an example of this. When the move was made from Jena1 to Jena2, the DB layout changed to a type 1 layout and it went faster. The two type 1 layouts supported are Jena's existing RDB layout and a debug version which encodes RDF terms in SPARQL-syntax directly into the triples table so you can simple read the table with an SQL query browser.

Type 2 layouts, where the triples table has pointers to a nodes table, are better as SPARQL queries gets larger. Databases prefer small, fixed width columns to variable string comparisons. SDB has two variations, one using 4 bytes integer indexes and one using 8 byte hashes. The hash form means that hash of query constants can be calculated and don't have to be looked up in the SQL.

It seemed that the hash form would be better all round.  But it isn't - loading was 2x slower (sometimes worse) despite the fact that RDF terms don't have to be inserted into the nodes table first to get their auto-allocated sequence id. Databases we have tried are significant slower indexing 8 byte quantities than 4 byte quantities and this dominates the load performance.

Next

There are three directions to go:

  1. Inference support
  2. Application-specific layout control
  3. Filters support

(1) and (2) are linked by the fact it is looking at a query and deciding, for certain known predicates and part-patterns, that different database tables should be used instead.  See Kevin's work on property tables which uses the approach to put some higher level understanding of the data back into the RDF store. (3) is "just" a matter of doing it.

And if you have some feature request, do suggest it.

02 February 2007

ARQ : what next?

In January, I got round to releasing a new version of ARQ and also doing a first full release of SPARQL version of Joseki. ARQ version 1.5 had a problem with a performance hit for certain database usages (boo, hiss) but this is fixed in version 1.5.1.

A key feature of this version of ARQ is support for free text search. The core indexing is done by Lucene 2. Searches bind matching resources to a variable and are done using a property function. This means they can be fast because it directly exploits the indexing capabilities of Lucene.

Example:

PREFIX pf: <java:com.hp.hpl.jena.query.pfunction.library.>
  SELECT ?doc {
    ?lit pf:textMatch '+text' .
    ?doc ?p ?lit
  }

What next? This release matches the published DAWG working draft but the working group is considering a SPARQL algebra the formalization of the semantic of SPARQL to be based on some like the relational algebra.

This version of ARQ has a number of query engines:

  • QueryEngine is the one used normally.
  • QueryEngineHTTP is a SPARQL protocol client for remote query.
  • QueryEngineRef is a direct, simple implementation of the SPARQL algebra.
  • QueryEngineQuad is a modified version of the reference query engine that compiles SPARQL to quads, not triples.

QueryEngineRef compiles a SPARQL query into the SPARQL algebra and it also provides a very simple evaluation of that algebra expression. The evaluation can be slow because it calculates all sub-expressions, then joins/left-joins/unions them but it is written to be correct, rather than fast. It is much better to extend the reference engine and convert the algebra expression into something cleverer.

If you use the --engine=ref argument to arq.sparql, it will use the reference engine. To see the algebra expression, use arq.qparse with arguments --engine=ref --print=op.

The original query engine, which is still the default one, uses a substitution algorithm to evaluate a query. This will exploit the indexes that Jena maintains for all graphs by replacing any variables already bound to some value with their current value and so giving more information to the basic graph pattern matcher. This is done in a steaming fashion which is why this query engine only takes a constant amount of memory regardless of the data size.

The next version of ARQ will combine the two approaches in a new query engine - where possible (and, for real world, queries that means nearly always), it will use the streaming, indexed approach to execution and only resort to a potentially more memory-intensive approach for parts of the query that really need it. The good news is that while it might sound like writing a whole new subsystem from scratch, it isn't. The SPARQL algebra compiler is already part of the reference engine and new query engine extends that; in addition, much of the code of the streaming engine is unchanged and it's just the conversion from the algebra to the iterator structure that needs to be written.

It's also a chance to go back and clean up some code that has been around for a long time and, with hindsight, can be structured better. My programming style has changed over the time ARQ has been in development as I try different patterns and designs then learn what works and what doesn't (makes me get frustrated at some of the design of Java as well). Some old-style code does the right thing; it just does it in a way I would not choose to do it now. Don't get that chance very often.