12 July 2008

ARQ Property Paths

SPARQL basic graph patterns only allow fixed length routes through the graph being matched. Sometimes, the application wants a more general path so ARQ has acquired syntax and built-in evaluation for a path language as part of the ARQ's extensions to SPARQL. The path language is like string regular expressions, except it's over predicates, not string characters.

Property path documentation for ARQ.

Simple Paths

The first operator for simple paths is "/", which is path concatenation, or following property links between nodes, the other simple path operator is "^" which is like "/" except the graph connection is traversed (it's the inverse property).

# Find the names of people 2 "foaf:knows" links away.
PREFIX <http://xmlns.com/foaf/0.1/>
SELECT ?name
{ ?x foaf:mbox <mailto:alice@example> .
  ?x foaf:knows/foaf:knows/foaf:name ?name .
}

This is the same as the strict SPARQL query:

{
  ?x  foaf:mbox <mailto:alice@example> .
  ?x  foaf:knows [ foaf:knows [ foaf:name ?name ]]. 
}

or, with explicit variables:

{
  ?x  foaf:mbox <mailto:alice@example> .
  ?x  foaf:knows ?a1 .
  ?a1 foaf:knows ?a2 .
  ?a2 foaf:name ?name .
}

And these two are the same:

 ?x foaf:knows/foaf:knows/foaf:name ?name . 
 ?name ^foaf:name^foaf:knows^foaf:knows ?x .

Complex Paths

The simple paths don't change the expressivity; they are a shorthand for part of a basic graph pattern and ARQ compiles simple paths by generating the equivalent basic graph patterns then merging adjacent ones together.

Alternation, the "|" operator does not change the expressivity either - the same thing could be done with a SPARQL UNION.

# Use with Dublin core 1.0 or Dublin Core 1.1 "title"
 :book (dc10:title|dc11:title) ?title

Some complex paths do change the expressivity of language; the query can match things that can't be matched in a strictly fixed length paths because they allow arbitrary length paths through the use of "*" (zero or more), "+" (one or more), "?" (zero or one) as well as the form "{N,}" (N or more).

Two very useful cases are:

 # All the types, chasing the subclass hierarchy
 <http://example/> rdf:type/rdfs:subClassOf* ?type

and:

 # Members of a list
 ?someList rdf:rest*/rdf:first ?member .

because "*" includes the case of a zero length path - all nodes are "connected" to themselves by a zero-length path.

Strict SPARQL

The Property path documentation shows how to install paths and name them with a URI so you can use a path in strict SPARQL syntax.

Other

There have been some other path-related extensions to SPARQL:

  • GLEEN is a library that provides path-functionality in graph matching via property functions.  It also provides subgraph extraction based on pattern.
  • PSPARQL allows variables in paths
  • SPARQLeR which has path value type

08 June 2008

TDB: Loading UniProt

TDB passed a milestone this week - a load of the complete UniProt V13.3 dataset.

UniProt V13.3 is 1,755,773,303 triples (1.7 billion) of which 1,516,036,125 are unique after duplicate suppression.

This dataset is interesting in a variety of ways. Firstly, it's quite large. Secondly, it is the composite of a small number of different, related databases and has some large literals (complete protein sequences - some over 70k characters in a single literal) as well as the full text of many abstracts. (LUBM doesn't have literals at all. Testing using both synthetic data and real-world data is necessary.)

UniProt comes as a number of RDF/XML files.  These had already been checked before the loading, by parsing to give N-Triples, using it as a sort of dump format. The Jena RDF/XML parser does extensive checking, and the data had some bad URIs. I find that most large datasets do throw up some warnings on URIs.

TDB also does value-based storage for XSD datatypes decimals. integer, dates, and dateTimes. Except there aren't very many. For example, there are just 18 occurrences of the value "1", in any form, in the entire dataset. They are just xsd:ints in some cardinality constraints. I was a bit surprised by this. Given the size of the dataset, I expected none or lots of uses of the value 1, so I grepped the input data to check - it's much, much quicker to use SPARQL than run grep on 1.7 billion triples in gzip'ed files but working with N-triples makes it easy to produce small tools you can be sure that work. And indeed they are the only 1 values. Trust the SPARQL query next time.