12 July 2008

ARQ Property Paths

SPARQL basic graph patterns only allow fixed length routes through the graph being matched. Sometimes, the application wants a more general path so ARQ has acquired syntax and built-in evaluation for a path language as part of the ARQ's extensions to SPARQL. The path language is like string regular expressions, except it's over predicates, not string characters.

Property path documentation for ARQ.

Simple Paths

The first operator for simple paths is "/", which is path concatenation, or following property links between nodes, the other simple path operator is "^" which is like "/" except the graph connection is traversed (it's the inverse property).

# Find the names of people 2 "foaf:knows" links away.
PREFIX <http://xmlns.com/foaf/0.1/>
SELECT ?name
{ ?x foaf:mbox <mailto:alice@example> .
  ?x foaf:knows/foaf:knows/foaf:name ?name .
}

This is the same as the strict SPARQL query:

{
  ?x  foaf:mbox <mailto:alice@example> .
  ?x  foaf:knows [ foaf:knows [ foaf:name ?name ]]. 
}

or, with explicit variables:

{
  ?x  foaf:mbox <mailto:alice@example> .
  ?x  foaf:knows ?a1 .
  ?a1 foaf:knows ?a2 .
  ?a2 foaf:name ?name .
}

And these two are the same:

 ?x foaf:knows/foaf:knows/foaf:name ?name . 
 ?name ^foaf:name^foaf:knows^foaf:knows ?x .

Complex Paths

The simple paths don't change the expressivity; they are a shorthand for part of a basic graph pattern and ARQ compiles simple paths by generating the equivalent basic graph patterns then merging adjacent ones together.

Alternation, the "|" operator does not change the expressivity either - the same thing could be done with a SPARQL UNION.

# Use with Dublin core 1.0 or Dublin Core 1.1 "title"
 :book (dc10:title|dc11:title) ?title

Some complex paths do change the expressivity of language; the query can match things that can't be matched in a strictly fixed length paths because they allow arbitrary length paths through the use of "*" (zero or more), "+" (one or more), "?" (zero or one) as well as the form "{N,}" (N or more).

Two very useful cases are:

 # All the types, chasing the subclass hierarchy
 <http://example/> rdf:type/rdfs:subClassOf* ?type

and:

 # Members of a list
 ?someList rdf:rest*/rdf:first ?member .

because "*" includes the case of a zero length path - all nodes are "connected" to themselves by a zero-length path.

Strict SPARQL

The Property path documentation shows how to install paths and name them with a URI so you can use a path in strict SPARQL syntax.

Other

There have been some other path-related extensions to SPARQL:

  • GLEEN is a library that provides path-functionality in graph matching via property functions.  It also provides subgraph extraction based on pattern.
  • PSPARQL allows variables in paths
  • SPARQLeR which has path value type

8 comments:

Unknown said...

Andy - Looks great!

I'm curious, though, if this extension interferes with the standard SPARQL syntax. For example, how would the following triple pattern be parsed using the ARQ parser (from the DAWG test syntax-general-07.rq):

<a><b>+1.0

Is the plus parsed as a path modifier or part of a DECIMAL_POSITIVE? If longest match is preferred, then this query changes meaning under the ARQ parser... is that correct?

AndyS said...

It's a decimal. The grammar changed during SPARQL's standardization specifically to make this case work.

The lowest level of the language is tokenization and the longest token wins. Then the parser sees the tokens:

<a>
<b>
+1.0

It also means that "+ 1" or "- 1" (a between sign and digits ) is not a signed number in SPARQL.

Aldo Bucchi said...

Hi Andy,

Great! Non-fixed queries will be really useful for working with transitive properties in general.

Just curious, is the grammar compatible with bengee's SPARQL Script?

If so, I guess it would be interesting to merge both projects at some point. Scripting with paths is more natural.

Best,
A

AndyS said...

The path extension does not change any other part of the grammar and I don't think benjee script extensions change the BGP part of the grammar. One piece of tidying up in the SPARQL grammar (in the spec) was to make this sort of thing possible (the issue was the "+" for numbers). So it should be possible to add to other systems.

Pradeep said...

I would like to add one more such extension to the list - SPARQ2L. This too allows path variables at the predicate position. But the difference with others is that is not a in-memory model. See the details in - lsdis.cs.uga.edu/library/download/fp785-anyanwu1.pdf .
I am working to integrate this with Jena ARQ and analyze the benefits.

Pradeep

AndyS said...

Pradeep - thanks for pointing out that work. Is this a continuation of the SPARQLeR, also from U Georgia?

The implementation is in ARQ is not tied to in-memory use. Functionally, it works with all Jena graph implementations. In particularly, it works well with TDB which provides the right access granularity for the path evaluation algorithm.

I look forward to hearing of your integration with ARQ and hope you will publish the code.

Pradeep said...

Andy, I think this and SPARQLer are different works. I ma doing my thesis with the author of SPARQ2L in NCSU.
Coming back to the memory issue what I meant is Jena's graph model itself is in-memory. i.e it works if the graph model is in memory. Other works like SPARQLer and PSPARQL too use in-memory models. SPARQ2L query engine proposes an idea to store the graph model on disk and query it. If you want, we can discuss more on this through email.
Additionally, can you please provide some references to me on how to tweak/extend the ARQ? I want to replace the path finding algorithm in ARQ with that of ours to see how it works. I appreciate your time. Thanks you.

Pradeep

AndyS said...

Pradeep - Jena provides quite a few graph implementations. In-memory is not the only kind. ARQ works with all graph implementations and especially SDB and TDB which are disk-backed persistent stores (SDB is SQL-based and TDB is based on custom indexing).

ARQ has an explicit presentation of the SPARQL algebra - you may wish to look at that, find the path operators, and see where they are evaluated. The default evaluator can be replaced with a subcslass of the usual one and you then override the evaluation of the path. (This is what TDB does for basic graph patterns)

Alternatively, you can run a tranformation of the algebra and introduce your own operators. (this is what SDB does for SPARQL OpLefJoin etc)

Or you can do a mixture - one does not exclude the other because they happen at different phases.

You'll want to look at the source code - there are a few source code examples.