ARQtick: 2007

08 September 2007

Counting and GROUP BY

One thing people miss from SPARQL is counting. It's a feature that working group didn't have time for.

There's an implementation, following the design in SQL, in ARQ SVN which will be in the next release (v2.1). v2.1 introductions the cost-based optimizer for in-memory basic graph patterns by Markus.

It's a syntactic extension, not strict SPARQL, so you have to tell the system to parse queries in the "ARQ" langauge by passing Syntax.syntaxARQ to the query factory.

The following queries will work:

SELECT count(*) { ... }

SELECT (count(*) AS ?count) { ... }

This is based on having SELECT expressions as well as grouping. Using AS to give a named variable is better style because the results can go into the SPARQLXML results format; otherwise, an internal variable is allocated and they have illegal SPARQL names.

Other examples:

SELECT (count(*) AS ?rows)
{ ... }
GROUP BY ?x

SELECT count(distinct *)
{ ... }
GROUP BY ?x

SELECT count(?y)
{ ... }
GROUP BY ?x

What is being counted is solutions, in the case of count(*) and names, in the case of count(?var).

The current list of ARQ extensions is:

SERVICE - call-out from a query to a SPARQL endpoint over HTTP
SELECT expressions
GROUP BY
count()

So what features should be next?

08 August 2007

Syntax Comparison of N3 and Turtle

Some notes on the syntax differences between N3 and Turtle, with a little SPARQL thrown in.

Some of the differences are low-level issues, which are enough to trip-up machine reading of RDF data in these formats across the web. It also makes writing reliable RDF output writers harder because of variability of parsers.

RDF/XML really does win for web exchange but it is also a barrier to people who find N3 or Turtle much easier to understand and produce.

This collection of differences is not complete and I'll add and amend the note. Please let know me of anything that should be added or corrected.

27 July 2007

Basic Federated SPARQL Query

There are already ways to access remote RDF data. The simplest is to read a document which is an RDF graph and query it. Another way is with the SPARQL protocol which allows a query to be sent to a remote service endpoint and the results sent back (in RDF, or an XML-based results format or even a JSON one).

Several people writing on jena-dev have been attempting to created federated query applications where part of a query needs to sent to one or more remote services.

Here's a basic building block for such federated query use cases. It adds the ability to make a SPARQL protocol call within a query, not just send the whole query to the remote service.

Syntax

A new keyword SERVICE is added to the extended SPARQL query language in ARQ. This keyword causes the sub-pattern to be sent to a named SPARQL service endpoint, and not matched against a local graph.

PREFIX : <http://example/>
PREFIX  dc:     <http://purl.org/dc/elements/1.1/>

SELECT ?a
FROM <mybooks.rdf>
{
  ?b dc:title ?title .
  SERVICE <http://sparql.org/books>
     { ?s dc:title ?title . ?s dc:creator ?a }
}

Algebra

There is a new operator in the algebra.

(prefix ((dc: <http://purl.org/dc/elements/1.1/>))
  (project (?a)
    (join
      (BGP [triple ?b dc:title ?title])
      (service <http://sparql.org/books>
          (BGP
            [triple ?s dc:title ?title]
            [triple ?s dc:creator ?a]
          ))
      )))

Performance Considerations

This feature is a basic building block to allow remote access in the middle of a query, not a general solution to the issues in distributed query evaluation. The algebra operation is executed without regard to how selective the pattern is. So the order of the query will affect the speed of execution. Because it involves HTTP operations, asking the query in the right order matters a lot. Don't ask for the whole of a bookstore just to find book whose title comes from a local RDF file - ask the bookshop a query with the title already bound from earlier in the query.

Proper SPARQL

On top of this access operation, it would be possible to build a query processor that does what DARQ (the DARQ project is not active) does which is to read SPARQL query, analyse it, and build a query on the extended algebra.The execution order is chosen based on the selectivity of the triple patterns so it minimises network traffic.

Hopefully, given the building block in ARQ, someone will add the necessary query execution analysis to give a query broker that accepts strict SPARQL and uses a number of SPARQL services to answer the query.

25 July 2007

SSE

Following on from SPARQL S-Expressions :: a description of SSE, a notation for RDF-related data structures (like the SPARQL algebra).

29 June 2007

Installing OracleXE

We have been adding support for Oracle into SDB. As part of that, I installed Oracle XE ("Oracle Database 10g Express Edition") on my WindowsXP machine at work, which is part of a Windows domain.

It didn't go smoothly. I hope these notes help someone else trying to do the same thing.

Summary

Install when logged in as administrator (NOT a domain user, even if in group Administrators).
In order to use the command line SQL interface, you need to change SQLNET.ORA from (NTS) to (none)

Account

The Oracle installation instructions say to log in with administrative privileges and be attached to the domain.

The installation proceeded with no errors, and the 5 Windows services start up OK. But I found that I could not connect to the database web-based admin interface.

An additional symptom was that I can't change my domain password, nor create Windows local user accounts. In both cases, the error was that the password does not meet the requirements on characters and length. HP has tighter guidelines than the default for passwords but why installation of OracleXE broke a part of WindowsXP, I don't understand. After uninstalling OracleXE, I could change my password and create local user accounts as usual.

There are lots of articles about not being about to contact the database home page but only a few were related to the situation I had: this one was most useful (that link is into the OracleXE forum which is not publicly readable - you have to register first).

A way to see if your installation has been affected is to see if there is a file server\dbs\SPFILEXE.ORA. If not, the installation is probably broken.

Logging in as the local administrator account, uninstalling, reinstalling got an installation where I could get to the database home page and administer the database. I guess any local user account in group Administarors would work.

But sqlplus.exe still didn't work. (The error is "ORA-12638: Credential retrieval failed"). Following that thread again, I changed SQLNET.ORA to SQLNET.AUTHENTICATION_SERVICES = (none) and could connect to the database from the SQL command prompt.

Afterwards ...

Installing on Windows XP/Home worked fine but that is not a domained machine.

And after all that, the SDB SPARQL test suite runs perfectly with OracleXE.

22 June 2007

Joseki 3.1

New version of Joseki released, primarily to package together all the updated jars files for Jena and ARQ that Joseki uses. At the same time, other jars have been upgraded so there are jar naming changes.

Joseki 3.1 download

10 June 2007

SPARQL Papers at ESWC 2007

There were 3 papers that particularly caught my SPARQL-driven attention at ESWC2007.

SPARQLeR: Extended Sparql for Semantic Association Discovery
Krys J. Kochut, Maciej Janik

This describes an extension to SPARQL for path variables. A path is a regular expression of properties but in addition the paper describes the need for reverse properties and constraints on paths (like length).

See also: PSPARQL: http://psparql.inrialpes.fr/

Minimal Deductive Systems for RDF
Sergio Muñoz, Jorge Pérez and Claudio Gutierrez.

This is a proposal for reduced RDFS with just rdfs:domain, rdfs:range, rdfs:subClassOf, rdfs:subPropertyOf and rdf:type.

This results in (on page 8) a small set of rules that have to be applied to the data but there is no core vocabulary. The rules can be applied to a streaming data stream, if the RDFS schema is known, because each rule only refers to at most one data triple.

There are no containers, which may be inconvenient, but that might more usefully be covered by not using typing, but having a different property just to match these syntactic constructs. That removes the container vocabulary from interacting with the application vocabulary.

A colleague here, Nipun Bhatia, has been working on streaming checking and rule application based on extending Eyeball. Nipun even adds cardinality validation by preprocessing the data to get the triples in subject order. Unix sort(1) is quite capable of sorting very large N-triples files in sensible amounts of time.

Semantic Process Retrieval with iSPARQL
Christoph Kiefer, Abraham Bernstein, Hong Joo Lee, Mark Klein and Markus Stocker.

((Non) interest declaration: Markus is now spending a few months working with us in Bristol - this work was done before that.)

The core of this paper is an example where statistical techniques beats logic. There is a strong message to us all here - don't think logic and perfect organization is necessarily the best solution to actual problems.

As part of this work, but not the main argument of the paper, they created iSPARQL (i=inprecise) which is an embedding of access to similarity metrics inside standard SPARQL without syntax changes. They use property functions (they are using ARQ but the principle is quite general) to access the similarity engine.

The idea of embedding some index or other functionality that can provide bindings of variables for some expression seems like a general extension technique for SPARQL. LARQ provides free-texting matching, using Lucene to do matching, and can include all the Lucene loose matching

09 April 2007

SPARQL S-expressions

I'm interested in exposing the SPARQL algebra support in ARQ for others (and me!) to experiment with SPARQL and SPARQL extensions.

Having a syntax to be able to write algebra expressions is useful and makes writing test cases of the algebra easier. ARQ already uses S-expressions to detail the syntax tree so it was natural to use S-expressions for algebra expressions.

I split the lowest levels of syntax out, to avoid having to write a many parsers. The result - SSE (SPARQL S-Expressions), a vaguely lisp-ish syntax. It consists of lists, RDF terms (IRIs, blank nodes, prefixed names and literals) in SPARQL syntax, and also words which are plain symbols without colon.

Given this universal syntax, it's a matter of building code libraries to build the Java data structures from SSE. This is mundane but being able to do this without rebuilding a parser each time is easier.

Example query:

 PREFIX : <http://example/> 

 SELECT ?x ?v
 { ?x :p ?v 
   OPTIONAL { ?v :q ?w }
 }

which is the algebra expression:

 (project (?x ?v)
   (leftjoin
     (bgp [triple ?x <http://example/p> ?v])
     (bgp [triple ?v <http://example/q> ?w])))

The use of either () or [] for lists, where beginning and end must match, aids readability but has no other significance.

Another example: 'prefix' defines namespaces for the enclosed body:

 (prefix ((: <http://example/>))
   (project (?c)
     (filter (= ?c "world")
       (bgp [triple ?s :p ?c]) )))

It doesn't just capture strict SPARQL: tables-as-constants mean an SSE file can contain data as well

(prefix ((x: <http://example/>))
  (join
    (table
      (row [?x 1] [?y x:g])
      (row [?x 2] ))
    (table 
      (row [?y x:g])
      (row [?x 2] ))
  ))

evaluating to:

 --------------------------
 | y                  | x |
 ==========================
 | <http://example/g> | 1 |
 | <http://example/g> | 2 |
 |                    | 2 |
 --------------------------

It's still "work in progress" and a bit rough - it can be inconsistent in layout, mainly due to slipping in a quick bit of hacking between doing other things; and also this leads to different coding styles in different places. But it's already proved to be an efficient way to write SPARQL algebra expressions and evaluate them for testing.

And doing an Emacs mode for SSE is trivial.

As an aside - I did a little web-trawling for the lisp information and gathered my links together.

Some Lisp Links

A partial, incomplete set of links to things about Lisp from a couple hours of web wandering. It's a bit of a change to be linking to web pages from the last millennium.

Lisp / General

The function/value namespace thing:: Scheme vs Common Lisp: http://www.nhplace.com/kent/Papers/Technical-Issues.html. Some of the arguments look a bit dated by modern standards.

Scheme

http://swiss.csail.mit.edu/projects/scheme/

Wikipedia - Scheme

http://www.r6rs.org/ - The latest Scheme definition. The nice thing, from a purely practical point of view, in this round of agreement is the definition of the library system.

Online books:

CMU Scheme repository: ftp://ftp.cs.cmu.edu/user/ai/lang/scheme/0.html

Community: http://community.schemewiki.org/

Object-oriented programming and Scheme

SLIB (A portable scheme library - "portable" seems to mean "it can be ported"). Included in SISC. http://swissnet.ai.mit.edu/~jaffer/SLIB

Websites:

http://www.schemers.org/
http://srfi.schemers.org/
SRFI - Scheme Requests for Implementation
http://readscheme.org/
http://library.readscheme.org/
Original lambda papers: http://library.readscheme.org/page1.html

Scheme / JVM Implementations

Access to access to ARQ for a SPARQL engine is important.

http://sisc-scheme.org/ (GPL 2 or MPL 1.1)
http://jscheme.sourceforge.net/ (Apache, zlib)
http://www.gnu.org/software/kawa/ (GPL 2)

Scheme / Eclipse

One of things that Java does have going for it is a free, sophisticated IDEs. Eclipse makes refactoring easy enough so as to encourage it as the project grows. For a project like ARQ, it's near essential to keep the naming and structure aligned to current terminology. Writing lisp in Emacs does not count as an IDE these days.

http://schemeway.sourceforge.net/ - not investigated yet.

http://schemeway.sourceforge.net/update-site/

Other Eclipse plug-ins? Other free, refactoring IDEs for Lisp?

Common Lisp / CLOS

Wikipedia - Common Lisp

Wikipedia - CLOS

Book: Common Lisp the Language, 2nd Edition

A Brief Guide to CLOS

http://jatha.sourceforge.net/ Common LISP library in Java (LGPL)

Common Lisp Wiki - http://www.cliki.net/

Lisp and .net

ARQ runs fine on .Net, so a CLR (.net and mono) lisp implementation is also interesting.

I didn't have time for much of a look around but did find Common Larceny: http://www.ccs.neu.edu/home/will/Larceny/CommonLarceny/download.html

Not scheme nor Common Lisp: http://dotlisp.sourceforge.net/dotlisp.htm (BSD) but last release: July 9, 2003. Patches from the authors home page.

Other

Bigloo: can call Java. Compiles scheme to the JVM (?? and CLR), can link in Java classes but I couldn't find a clear statement as to how to use in a mixed environment.

Download link to Bigloo 2.9a for the JVM broken (2007-04-08)

http://www-sop.inria.fr/mimosa/fp/Bigloo/ (GPL/LGPL)

09 March 2007

ARQ 2.0 beta

Just released the first ARQ 2.0 beta. This version is a complete implementation of the SPARQL query language, and also includes the ARQ featurs of custom property functions and free-text search. The plan is that this is the only beta. The big change is that it uses the SPARQL algebra for query execution, and there is a query optimizer to choose an execution strategy that makes this better than simple evaluation of the algebra operators. This makes ARQ up to date with where the workign group is going - that includes some changes of results to some queries, but, hopefully, these are restricted to queries that don't appear in the wild.

08 March 2007

SPARQL/Update

Modern applications are not single programs. Whether web or enterprise systems, they are a number of components running on different computers. What they do have in common is that they cooperate to deliver the application. SPARQL makes publishing RDF data on the web possible, but what about those applications that maintain and update that RDF Data? There needs to be a way for the application components to update their data. SPARQL/Update (also know as SPARUL, pronounced a bit like "spiral") is a language that takes the SPARQL style, and much of the grammar, and provides both graph update and graph management operations. It means those application components can talk a common language between themselves or to a database. Having a common update language then means the application developer can choose one RDF toolkit for one component, and another RDF system for another component, rather than having to choose one programming language and find one system that does everything required. This first draft of SPARQL/Update is published for comments, comment away!

12 February 2007

Jena SDB

SDB was released for the first time last week. While this is the first alpha release, it's actually at least the second internal query architecture and second loader architecture (Damian did the work to get the loader working fast).

SPARQL now has a formal algebra. ARQ is used to turn the SPARQL query syntax into a algebra expression; SDB takes over and compiles it first to a relational algebra(ish) structure then generates SQL syntax. Now there is a SPARQL algebra, this is all quite a bit simpler for SDB which is why this is the second design for query generation; much of the work now comes naturally from ARQ.

Patterns

At the moment, SDB only translates basic graph patterns, join and leftjoin expressions to SQL and leaves anything else to ARQ to stitch back together. For the current patterns, there are two tricky cases to consider:

SPARQL join relationships aren't exactly SQL's because they may involve unbound variables.
Multiple OPTIONALs may need to be COALESCEd.

For the first, sometimes the join needs to involve more than equality relationships like "if col1 = null or col1 = col2", which is a bit of scope tracking, and for the second, if a variable can be bound in two or more OPTIONALs , you have to take the first binding. The scope tracking is needed anyway.

Over time, more and more of the SPARQL algebra expression will be translated to SQL.

Layouts

SDB supports a number of databases layouts in two main classes:

Type 1 layouts have a single triple table, with all information encoded into the columns of the table.
Type 2 layouts use a triple table with a separate RDF terms table

Type 1 layouts are good at supporting fine-grained API calls where the need to join to get the actual RDF terms is removed because they are encoded into the triple tables columns. Jena's existing database layer, RDB, is an example of this. When the move was made from Jena1 to Jena2, the DB layout changed to a type 1 layout and it went faster. The two type 1 layouts supported are Jena's existing RDB layout and a debug version which encodes RDF terms in SPARQL-syntax directly into the triples table so you can simple read the table with an SQL query browser.

Type 2 layouts, where the triples table has pointers to a nodes table, are better as SPARQL queries gets larger. Databases prefer small, fixed width columns to variable string comparisons. SDB has two variations, one using 4 bytes integer indexes and one using 8 byte hashes. The hash form means that hash of query constants can be calculated and don't have to be looked up in the SQL.

It seemed that the hash form would be better all round. But it isn't - loading was 2x slower (sometimes worse) despite the fact that RDF terms don't have to be inserted into the nodes table first to get their auto-allocated sequence id. Databases we have tried are significant slower indexing 8 byte quantities than 4 byte quantities and this dominates the load performance.

There are three directions to go:

Inference support
Application-specific layout control
Filters support

(1) and (2) are linked by the fact it is looking at a query and deciding, for certain known predicates and part-patterns, that different database tables should be used instead. See Kevin's work on property tables which uses the approach to put some higher level understanding of the data back into the RDF store. (3) is "just" a matter of doing it.

And if you have some feature request, do suggest it.

02 February 2007

ARQ : what next?

In January, I got round to releasing a new version of ARQ and also doing a first full release of SPARQL version of Joseki. ARQ version 1.5 had a problem with a performance hit for certain database usages (boo, hiss) but this is fixed in version 1.5.1.

A key feature of this version of ARQ is support for free text search. The core indexing is done by Lucene 2. Searches bind matching resources to a variable and are done using a property function. This means they can be fast because it directly exploits the indexing capabilities of Lucene.

Example:

PREFIX pf: <java:com.hp.hpl.jena.query.pfunction.library.>
  SELECT ?doc {
    ?lit pf:textMatch '+text' .
    ?doc ?p ?lit
  }

What next? This release matches the published DAWG working draft but the working group is considering a SPARQL algebra the formalization of the semantic of SPARQL to be based on some like the relational algebra.

This version of ARQ has a number of query engines:

QueryEngine is the one used normally.
QueryEngineHTTP is a SPARQL protocol client for remote query.
QueryEngineRef is a direct, simple implementation of the SPARQL algebra.
QueryEngineQuad is a modified version of the reference query engine that compiles SPARQL to quads, not triples.

QueryEngineRef compiles a SPARQL query into the SPARQL algebra and it also provides a very simple evaluation of that algebra expression. The evaluation can be slow because it calculates all sub-expressions, then joins/left-joins/unions them but it is written to be correct, rather than fast. It is much better to extend the reference engine and convert the algebra expression into something cleverer.

If you use the --engine=ref argument to arq.sparql, it will use the reference engine. To see the algebra expression, use arq.qparse with arguments --engine=ref --print=op.

The original query engine, which is still the default one, uses a substitution algorithm to evaluate a query. This will exploit the indexes that Jena maintains for all graphs by replacing any variables already bound to some value with their current value and so giving more information to the basic graph pattern matcher. This is done in a steaming fashion which is why this query engine only takes a constant amount of memory regardless of the data size.

The next version of ARQ will combine the two approaches in a new query engine - where possible (and, for real world, queries that means nearly always), it will use the streaming, indexed approach to execution and only resort to a potentially more memory-intensive approach for parts of the query that really need it. The good news is that while it might sound like writing a whole new subsystem from scratch, it isn't. The SPARQL algebra compiler is already part of the reference engine and new query engine extends that; in addition, much of the code of the streaming engine is unchanged and it's just the conversion from the algebra to the iterator structure that needs to be written.

It's also a chance to go back and clean up some code that has been around for a long time and, with hindsight, can be structured better. My programming style has changed over the time ARQ has been in development as I try different patterns and designs then learn what works and what doesn't (makes me get frustrated at some of the design of Java as well). Some old-style code does the right thing; it just does it in a way I would not choose to do it now. Don't get that chance very often.