08 June 2008

TDB: Loading UniProt

TDB passed a milestone this week - a load of the complete UniProt V13.3 dataset.

UniProt V13.3 is 1,755,773,303 triples (1.7 billion) of which 1,516,036,125 are unique after duplicate suppression.

This dataset is interesting in a variety of ways. Firstly, it's quite large. Secondly, it is the composite of a small number of different, related databases and has some large literals (complete protein sequences - some over 70k characters in a single literal) as well as the full text of many abstracts. (LUBM doesn't have literals at all. Testing using both synthetic data and real-world data is necessary.)

UniProt comes as a number of RDF/XML files.  These had already been checked before the loading, by parsing to give N-Triples, using it as a sort of dump format. The Jena RDF/XML parser does extensive checking, and the data had some bad URIs. I find that most large datasets do throw up some warnings on URIs.

TDB also does value-based storage for XSD datatypes decimals. integer, dates, and dateTimes. Except there aren't very many. For example, there are just 18 occurrences of the value "1", in any form, in the entire dataset. They are just xsd:ints in some cardinality constraints. I was a bit surprised by this. Given the size of the dataset, I expected none or lots of uses of the value 1, so I grepped the input data to check - it's much, much quicker to use SPARQL than run grep on 1.7 billion triples in gzip'ed files but working with N-triples makes it easy to produce small tools you can be sure that work. And indeed they are the only 1 values. Trust the SPARQL query next time.

25 March 2008

Two more ARQ extensions

I've implemented two new extensions for ARQ:

  • Assignment
  • Sub-queries

Both these expose facilities that are already in the query algebra.  Sub-queries are done by simply allowing query algebra operators to appear anywhere in the query, not requiring solution modifiers to only be at the outer level of the query, so it allows extensions like counting, to be inside the query and available to the rest of the pattern matching. An assigment operator existed as an algebra extension for optimization and to support ARQ SELECT expressions

Both are syntactic extensions and are available if the query is parsed with language Syntax.syntaxARQ.

Currently available in ARQ SVN.

Assignment

This assigns a computed value to a variable in the middle of a pattern.

LET (?x := ?y + 5 )

The assignment operator is ":=". A single "=" is already the test for equals in SPARQL.

This means that a computed value can be used in other pattern matching:

 SELECT ?y ?area
 {
    ?x rdf:type :Rectangle ;
       :height ?h ;
       :width ?w .
    LET (?area := ?h*?w )
    GRAPH <otherShapes>
    {
      ?y :area ?area . # Shapes with the same area
    }
 }

Application writer can provide their own functions, maybe to do a little data munging to map between different formats:

   ?x  foaf:name  ?name .          # "John Smith"
   # Convert to a different style: "Smith, John" for example.
   LET (?vcardName := my:convertName(?name) )
   ?y vCard:FN ?vcardName .

There are some rules for the assignment:

  • if the expression does not evaluate (e.g. unbound variable in the expression), no assignment occurs and the query continues.
  • if the variable is unbound, and the expression evaluates, the variable is bound to the value.
  • if the variable is bound to the same value as the expression evaluates, nothing happens and the query continues.
  • if the variable is bound to a different value as the expression evaluates, an error occurs and the current solution will be excluded from the results.

ARQ already has expressions in SELECT expressions so a combination of sub-query and expression can achieve the same effect but it's unnatural and verbose and sometimes requires parts of the pattern matching to be written twice, inside and outside the sub-query.

One place where LET might be useful is in a CONSTRUCT query. In strict SPARQL, only terms found in the original data can be used for variables in the construct template but with LET-assignment:

   CONSTRUCT { ?x :lengthInInches ?inch }
   WHERE
   { ?x :lengthInCM ?cm
     LET (?inch := ?cm/2.54 )
   }

This isn't a new idea - see for example: "A SPARQL Semantics based on Datalog" - although the syntax in ARQ is designed to group the terms better.

Sub-queries

A sub-query can be used to apply some solution modifier to a sub-pattern.  Useful examples include aggregation, especially grouping and counting, and LIMIT with ORDER BY to get only some of the results of a pattern match.

 { SELECT (COUNT(*) AS ?c) { ?s ?p ?o } }

A sub-query is enclosed by {} and must be the only thing inside those braces, the same style as Virtuoso Subqueries. The sub-query will be combined, with SPARQL join, with other patterns in the same group. In the example

Find how many people all persons with two or more phones foaf:knows:

 PREFIX foaf: <http://xmlns.com/foaf/0.1/>

 SELECT ?person ?knowsCount
 {
   # ?person who have 2 or more phones
   { SELECT ?person
     WHERE { ?person foaf:phone ?phone } 
     GROUP BY ?person 
     HAVING (COUNT(?phone) >= 2) 
   }
   # Join on ?person with how many people they foaf:knows
   { SELECT ?person (COUNT(?x) AS ?knowsCount)
     WHERE { ?person foaf:knows ?x .}
     GROUP BY ?person
   }
}

Queries with sub-queries can become complicated quite quickly so I usually write each of the part separately then combining them.