27 July 2007

Basic Federated SPARQL Query

There are already ways to access remote RDF data. The simplest is to read a document which is an RDF graph and query it. Another way is with the SPARQL protocol which allows a query to be sent to a remote service endpoint and the results sent back (in RDF, or an XML-based results format or even a JSON one).

Several people writing on jena-dev have been attempting to created federated query applications where part of a query needs to sent to one or more remote services.

Here's a basic building block for such federated query use cases. It adds the ability to make a SPARQL protocol call within a query, not just send the whole query to the remote service.

Syntax

A new keyword SERVICE is added to the extended SPARQL query language in ARQ. This keyword causes the sub-pattern to be sent to a named SPARQL service endpoint, and not matched against a local graph.

PREFIX : <http://example/>
PREFIX  dc:     <http://purl.org/dc/elements/1.1/>

SELECT ?a
FROM <mybooks.rdf>
{
  ?b dc:title ?title .
  SERVICE <http://sparql.org/books>
     { ?s dc:title ?title . ?s dc:creator ?a }
}

Algebra

There is a new operator in the algebra.

(prefix ((dc: <http://purl.org/dc/elements/1.1/>))
  (project (?a)
    (join
      (BGP [triple ?b dc:title ?title])
      (service <http://sparql.org/books>
          (BGP
            [triple ?s dc:title ?title]
            [triple ?s dc:creator ?a]
          ))
      )))

Performance Considerations

This feature is a basic building block to allow remote access in the middle of a query, not a general solution to the issues in distributed query evaluation. The algebra operation is executed without regard to how selective the pattern is. So the order of the query will affect the speed of execution. Because it involves HTTP operations, asking the query in the right order matters a lot. Don't ask for the whole of a bookstore just to find book whose title comes from a local RDF file - ask the bookshop a query with the title already bound from earlier in the query.

Proper SPARQL

On top of this access operation, it would be possible to build a query processor that does what DARQ (the DARQ project is not active) does which is to read SPARQL query, analyse it, and build a query on the extended algebra.The execution order is chosen based on the selectivity of the triple patterns so it minimises network traffic.

Hopefully, given the building block in ARQ, someone will add the necessary query execution analysis to give a query broker that accepts strict SPARQL and uses a number of SPARQL services to answer the query.

6 comments:

areggiori said...

Nice! It would be good to start seeing various efforts to converge into a common syntax; Eric's recent work on FeDeRate is worth consideration. He has some neat ideas how performances of queries over distributed graphs could be improved, for example passing partial bindings to each service; and have some syntactic sugar to express that...

AndyS said...

FeDeRate does something a bit different, although related. More focued on (organizationally) "close" data.

Each database appears as a graph in the SPARQL query and the syntax is can be legal SPARQL. SquirrelRDF the SQL access for Jena (LDAP as well).

If I understand correctly, the explicit bindings in FeDeRate are not so much for the application writer but to be able to dispatch a query and the solution set. If there is a protocol step (there isn't in FeDeRate), that would be a good place. Allows for preparing queries, for example.

DARQ shows that the control of query execution can be done by optimization techniques. I think it would be good to move the control burden away from the application writer.

This ARQ extension is to access SPARQL endpoints (like linked data). With a DARQ-style query broker, there would be no need for SERVICE. The SPARQL algebra needs some remote call operation - SERVICE allows access to it from the query until there is a rewriter/service directory.

In ARQ, the algebra, supports in-line data. Great for testing :-)

Theer is a huge body of work on this, dating back at least to the '80s. The work of the Garlic project from a while back shows what can be done. One join type they identify is a local index-join; it takes result from the left and iterative evaluates the right. You do have to be careful it does not change the semantics of the query.

Paula said...

Good to see!

I've only just re-introduced this feature into Mulgara (it used to be in TKS), but unfortunately this is all still in our TQL language. I was thinking it would have to be skipped for our SPARQL implementation (which we have yet to release), but now it looks like we have a template to work to. Thank you. :-)

I take it that you can CONSTRUCT new graphs from this too?

AndyS said...

Yes - you can use it with CONSTRUCT. It's in the query pattern and the pattern matching and result form are orthogonal.

dorgon said...

Dear Andy,

thank you for this simple extension. It is a good starting point for simple integration tasks and sufficient for simple unions of distributed data.

Because of the query modifiers like LIMIT being outside of the WHERE clause it is however not possible to add some offset/limit constraints to the remote sub-query. So it would not be a good idea for queries against large remote datasets.

For many people this is a very good extension (eg. integration of distributed library data with equal schemas)

Comment for others: note, you'll have to explicitly specify the ARQ syntax which extends SPARQL when parsing:

[snip]
QueryFactory.create(query, Syntax.syntaxARQ);
[/snip]

cheers
Andi L.

AndyS said...

dorgon,

Good point - it's true LIMIT etc do not get passed over. One point: the modifiers of a query can result in different answers to the same subquery.

A general solution would be to combine this with (local) nested SELECTs. Then a SPARQL query is a "table processing" expression and the sub-SELECT allows general tables. Combine with SERVICE taking a whole SELECT expression for the remote case.