27 December 2008

A small mystery about deletion in T-Trees

T-Trees are a generalisation of AVL trees. They are useful for in-memory databases because they have better packing densities than AVL trees and need less rotations. They provide a sorted index (so they are not good as the only index structure in an in-memory RDF store). This posting is only loosely triple store or SPARQL related.

The paper "A Study of Index Structures for Main Memory Database Management Systems" (Tobin J. Lehman and Michael J. Carey, VLDB 1986) has the details.

T-Trees keep an array of items per tree node (usually a short array) and have 3 pointers and 2 integers per tree node stored as opposed to 3 pointers, 1 number per single item stored for AVL. (both can be done with 2 pointers, with no parent but it's more complicated and the code has to run it's own stack to record it's path through the tree). Make the array a few entries long, and a T-Tree is a bit more more compact; rotations only happen when the tree structure after a leaf-array fills up.

I understand these algorithms in depth if I code them up. My implementation of T-Trees which includes consistency checking because I find it easier to write data structure algorithms this way - lots of internal checking, lots of test cases and then move to large scale randomized insertion and deletion patterns because I don't trust myself to enumerate all possibilities in a hand-written test suite. Run the randomized tests for a few million iterations checking the structure for internal consistency constraints on every insertions and deletion. Then disable (but don't remove!) the checking code, and rely on the fact that "if (false)" compiles to nothing in Java and statics tend to get in-lined by the JIT.

But one area is puzzling me. The mystery, to me, is in the delete algorithm. One feature of a T-Tree is that the internal nodes (internal nodes have both a left and right subtree) should always be larger than some minimum amount.

The delete algorithm, from the paper, is (section 3.2.1):

3) Delete Algorithm

The deletion algorithm is similar to the insertion algorithm in the sense that the element to be deleted is searched for, the operation is performed, and then re-balancing is done if necessary. The algorithm works as follows:

  1. Search for the node that bounds the delete value. Search for the delete value within this node, reporting an error and stopping if it is not found.
  2. If the delete will not cause an underflow (i.e. if the node has more than the minimum allowable number of entries prior to the delete), then simply delete the value and stop; else, if this is an internal node, then delete the value and borrow the greatest lower bound of this node from a leaf or half-leaf to bring this node’s element count back up to the minimum; else, this is a leaf or a half-leaf, so just delete the element. (Leaves are permitted to underflow, and half-leaves are handled in step 3.
  3. If the node is a half-leaf and can be merged with a leaf, coalesce the two nodes into one node (a leaf) and discard the other node. Proceed to step 5.
  4. If the current node (a leaf) is not empty, then stop; else, free the node and proceed to step 5 to re-balance the tree.
  5. For every node along the path from the leaf up to the root, if the two subtrees of the node differ in height by more than one, perform a rotation operation. Since a rotation at one node may create an imbalance for a node higher up in the tree, balance-checking for deletion must examine all of the nodes on the search path until a node of even balance is discovered.

But what if a half-leaf, a node with just one sub-node as a leaf, becomes less than the minimum size of a node, yet can not be merged with it's leaf? The t-tree is still valid - the size constraint is on internal nodes only. But a half-leaf can be become an internal node by a rotation of the tree. So our less-than-min-sized half-leaf can become an invalid internal node and the constraint on internal nodes is validated.

Several possible solutions ocurr to me (and probably more than just these):

  • Just don't worry (works for me because I'm expecting insertion if much more important in the any usage I might make of the T-Tree implementation). The internal node that was too small may become larger due to a later insertion. But my internal consistency checking code is now weakened because there is a condition on internal nodes that is no longer always true.
  • Treat a half-leaf more like an internal node with respect to deletion and pull up an entry from its leaf to keep the half-leaf at the minimum size. If the half-leaf has a left-leaf, this is the same as the rule for deletion in an internal node. If the half-leaf has a right-leaf, the lowest element of the leaf is required, which is a shift down of all the other elements in the leaf so is less than ideal.
  • Have special code in the rotation operations to move elements around when rotating a half-leaf into an internal node position. This is like possibility 2, except delaying it until it is know it will cause an invalid internal node. T-Trees already have a special case rotation on insertion anyway.

I choose the second way for now - fix up during deletion - because the checking code can now check half-leaves as well as internal nodes for size constraints, so catching problems earlier in some insert/delete sequence.

A search of the web does not find any mention of this - most web pages are either a copy of the wikipedia page or reference the original paper.

If anyone can help me out with this, then please leave a comment or get in touch. I'd be surprised if it isn't that I'm missing something obvious (T-Trees are not new) but the situation of a small half-leaf becoming an internal node does occur as I found from the randomized testing.

29 October 2008

Walking the Web

It's nice to see Freebase providing an RDF interface:  http://rdf.freebase.com/. The example they give is <http://rdf.freebase.com/ns/en.blade_runner> so let's see what is actually there and how we might use the information.

Each graph describing something contains Freebase URLs to be explored.  What we want is the ability to load data into our local store while some query is running, enabling the dataset to be enlarged as the query makes choices about how to proceed.

This is similar to cwm's log:semantics. http://ww.w3.org/2000/10/swap/doc/Reach

In SPARQL, the dataset is fixed. No good if you want to write a graph-walking process without some glue in your favourite programming language. In one way, it's scripting for the web but in a special way.  It's not a sequence of queries and updates; it's changing the collection of graphs, expanding the RDF dataset known to the application.

Query 1 : See what's in the graph

Let's first look at what's available at the example URL.  That does not require anything special: it's just a FROM clause (which in ARQ will content-negotiate for RDF; if you use a web browser you will see an HTML page):

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT *
FROM fb:en.blade_runner
{ ?s ?p ?o }

Hmm - 294 triples.

Query 2 : Look for interesting properties

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?p
FROM fb:en.blade_runner
{
  ?s ?p ?o
}

62 distinct properties used.  fb:film.film.starring looks interesting.

Query 3 : Follow the links

As an experimental feature, consider a new SPARQL keyword "FETCH" which takes a URL, or a variable bound to a URL by the time that part of the query is reached, and fetches the graph at that location.

Now we fetch the documents at each of the URLs that are objects of the blade runner, film.film.starring triples.

FETCH loads the graph and places it in the dataset as a named graph, the name being the URL is fetched it from. We use GRAPH to access the loaded graph. Done this way, triples from different sources are kept separately which might be important in deciding what sources to believe.

This also shows a critical limitation: just placing in a named graph is a basic requirement for deciding what to believe but really there ought to be a lot more metadata about the graph, including when it was read, possibly why it was read (how we got here in the query) etc etc. But we are not an agent system so we will note this and move on.

By poking around with GRAPH ?personUUID { ?s ?p ?o} (60 triples) the property film.performance.actor looks hopeful.

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT ?actor
FROM fb:en.blade_runner
{
  fb:en.blade_runner fb:film.film.starring ?personUUID
  FETCH ?personUUID
  GRAPH ?personUUID
    { ?personUUID fb:film.performance.actor ?actor }
}

12 results.

--------------------------------------------
| actor                                    |
============================================
| fb:en.james_hong                         |
| fb:en.brion_james                        |
| fb:en.edward_james_olmos                 |
| fb:en.joanna_cassidy                     |
| fb:en.william_sanderson                  |
| fb:en.rutger_hauer                       |
| fb:authority.netflix.role.20000077       |
| fb:guid.9202a8c04000641f80000000054cbccc |
| fb:en.sean_young                         |
| fb:en.joe_turkel                         |
| fb:en.harrison_ford                      |
| fb:en.daryl_hannah                       |
--------------------------------------------

and more URLs to follow.

Looking in the next graph, there is fb:type.object.name so let's guess and use that.  But each time we have chosen a property, we didn't have to guess, we can follow that property URL itself:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT *
FROM fb:type.object.name
{
  ?s ?p ?o
}

but it's easier to read the description in HTML (and freebase is link following internally to build the page).

Query 3 : The names of actors in Blade Runner

So a query to find the names of actors in "Blade Runner" is:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT ?actor ?name
FROM fb:en.blade_runner
{
  fb:en.blade_runner fb:film.film.starring ?personUUID
  FETCH ?personUUID
  GRAPH ?personUUID
    { ?personUUID fb:film.performance.actor ?actor }
  FETCH ?actor
  GRAPH ?actor
    { ?actor fb:type.object.name ?name }
}
ORDER BY ?actor

which gives:

-------------------------------------------------------------------
| actor                                    | name                 |
===================================================================
| fb:authority.netflix.role.20000077       | "M. Emmet Walsh"     |
| fb:authority.netflix.role.20000077       | "M・エメット・ウォルシュ"    |
| fb:en.brion_james                        | "Brion James"        |
| fb:en.daryl_hannah                       | "Daryl Hannah"       |
| fb:en.daryl_hannah                       | "Ханна, Дэрил"       |
| fb:en.daryl_hannah                       | "דריל האנה"          |
| fb:en.daryl_hannah                       | "ダリル・ハンナ"         |
| fb:en.edward_james_olmos                 | "Edward James Olmos" |
| fb:en.harrison_ford                      | "Harrison Ford"      |
| fb:en.harrison_ford                      | "Форд Гаррісон"      |
| fb:en.harrison_ford                      | "Форд, Харрисон"     |
| fb:en.harrison_ford                      | "Харисон Форд"       |
| fb:en.harrison_ford                      | "Харисън Форд"       |
| fb:en.harrison_ford                      | "האריסון פורד"       |
| fb:en.harrison_ford                      | "ハリソン・フォード"       |
| fb:en.harrison_ford                      | "哈里森·福特"         |
| fb:en.harrison_ford                      | "해리슨 포드"          |
| fb:en.james_hong                         | "James Hong"         |
| fb:en.joanna_cassidy                     | "Joanna Cassidy"     |
| fb:en.joe_turkel                         | "Joe Turkel"         |
| fb:en.rutger_hauer                       | "Rutger Hauer"       |
| fb:en.rutger_hauer                       | "Хауэр, Рутгер"      |
| fb:en.rutger_hauer                       | "ルトガー・ハウアー"      |
| fb:en.rutger_hauer                       | "魯格·豪爾"           |
| fb:en.sean_young                         | "Sean Young"         |
| fb:en.sean_young                         | "Шон Йънг"           |
| fb:en.sean_young                         | "Янг, Шон"           |
| fb:en.sean_young                         | "ショーン・ヤング"        |
| fb:en.william_sanderson                  | "William Sanderson"  |
| fb:guid.9202a8c04000641f80000000054cbccc | "Morgan Paull"       |
-------------------------------------------------------------------

 

We are left with a question: why use (extended) SPARQL? If you're doing it once, then a web browser is easier. After all, I used one to choose the properties to follow.

But with a query you can send it to someone else for them to reuse your knowledge, you can rerun it to look for changes, you can generalise and let the computer do some brute force search to find things that would take you, the human, a long time.

12 July 2008

ARQ Property Paths

SPARQL basic graph patterns only allow fixed length routes through the graph being matched. Sometimes, the application wants a more general path so ARQ has acquired syntax and built-in evaluation for a path language as part of the ARQ's extensions to SPARQL. The path language is like string regular expressions, except it's over predicates, not string characters.

Property path documentation for ARQ.

Simple Paths

The first operator for simple paths is "/", which is path concatenation, or following property links between nodes, the other simple path operator is "^" which is like "/" except the graph connection is traversed (it's the inverse property).

# Find the names of people 2 "foaf:knows" links away.
PREFIX <http://xmlns.com/foaf/0.1/>
SELECT ?name
{ ?x foaf:mbox <mailto:alice@example> .
  ?x foaf:knows/foaf:knows/foaf:name ?name .
}

This is the same as the strict SPARQL query:

{
  ?x  foaf:mbox <mailto:alice@example> .
  ?x  foaf:knows [ foaf:knows [ foaf:name ?name ]]. 
}

or, with explicit variables:

{
  ?x  foaf:mbox <mailto:alice@example> .
  ?x  foaf:knows ?a1 .
  ?a1 foaf:knows ?a2 .
  ?a2 foaf:name ?name .
}

And these two are the same:

 ?x foaf:knows/foaf:knows/foaf:name ?name . 
 ?name ^foaf:name^foaf:knows^foaf:knows ?x .

Complex Paths

The simple paths don't change the expressivity; they are a shorthand for part of a basic graph pattern and ARQ compiles simple paths by generating the equivalent basic graph patterns then merging adjacent ones together.

Alternation, the "|" operator does not change the expressivity either - the same thing could be done with a SPARQL UNION.

# Use with Dublin core 1.0 or Dublin Core 1.1 "title"
 :book (dc10:title|dc11:title) ?title

Some complex paths do change the expressivity of language; the query can match things that can't be matched in a strictly fixed length paths because they allow arbitrary length paths through the use of "*" (zero or more), "+" (one or more), "?" (zero or one) as well as the form "{N,}" (N or more).

Two very useful cases are:

 # All the types, chasing the subclass hierarchy
 <http://example/> rdf:type/rdfs:subClassOf* ?type

and:

 # Members of a list
 ?someList rdf:rest*/rdf:first ?member .

because "*" includes the case of a zero length path - all nodes are "connected" to themselves by a zero-length path.

Strict SPARQL

The Property path documentation shows how to install paths and name them with a URI so you can use a path in strict SPARQL syntax.

Other

There have been some other path-related extensions to SPARQL:

  • GLEEN is a library that provides path-functionality in graph matching via property functions.  It also provides subgraph extraction based on pattern.
  • PSPARQL allows variables in paths
  • SPARQLeR which has path value type

08 June 2008

TDB: Loading UniProt

TDB passed a milestone this week - a load of the complete UniProt V13.3 dataset.

UniProt V13.3 is 1,755,773,303 triples (1.7 billion) of which 1,516,036,125 are unique after duplicate suppression.

This dataset is interesting in a variety of ways. Firstly, it's quite large. Secondly, it is the composite of a small number of different, related databases and has some large literals (complete protein sequences - some over 70k characters in a single literal) as well as the full text of many abstracts. (LUBM doesn't have literals at all. Testing using both synthetic data and real-world data is necessary.)

UniProt comes as a number of RDF/XML files.  These had already been checked before the loading, by parsing to give N-Triples, using it as a sort of dump format. The Jena RDF/XML parser does extensive checking, and the data had some bad URIs. I find that most large datasets do throw up some warnings on URIs.

TDB also does value-based storage for XSD datatypes decimals. integer, dates, and dateTimes. Except there aren't very many. For example, there are just 18 occurrences of the value "1", in any form, in the entire dataset. They are just xsd:ints in some cardinality constraints. I was a bit surprised by this. Given the size of the dataset, I expected none or lots of uses of the value 1, so I grepped the input data to check - it's much, much quicker to use SPARQL than run grep on 1.7 billion triples in gzip'ed files but working with N-triples makes it easy to produce small tools you can be sure that work. And indeed they are the only 1 values. Trust the SPARQL query next time.

25 March 2008

Two more ARQ extensions

I've implemented two new extensions for ARQ:

  • Assignment
  • Sub-queries

Both these expose facilities that are already in the query algebra.  Sub-queries are done by simply allowing query algebra operators to appear anywhere in the query, not requiring solution modifiers to only be at the outer level of the query, so it allows extensions like counting, to be inside the query and available to the rest of the pattern matching. An assigment operator existed as an algebra extension for optimization and to support ARQ SELECT expressions

Both are syntactic extensions and are available if the query is parsed with language Syntax.syntaxARQ.

Currently available in ARQ SVN.

Assignment

This assigns a computed value to a variable in the middle of a pattern.

LET (?x := ?y + 5 )

The assignment operator is ":=". A single "=" is already the test for equals in SPARQL.

This means that a computed value can be used in other pattern matching:

 SELECT ?y ?area
 {
    ?x rdf:type :Rectangle ;
       :height ?h ;
       :width ?w .
    LET (?area := ?h*?w )
    GRAPH <otherShapes>
    {
      ?y :area ?area . # Shapes with the same area
    }
 }

Application writer can provide their own functions, maybe to do a little data munging to map between different formats:

   ?x  foaf:name  ?name .          # "John Smith"
   # Convert to a different style: "Smith, John" for example.
   LET (?vcardName := my:convertName(?name) )
   ?y vCard:FN ?vcardName .

There are some rules for the assignment:

  • if the expression does not evaluate (e.g. unbound variable in the expression), no assignment occurs and the query continues.
  • if the variable is unbound, and the expression evaluates, the variable is bound to the value.
  • if the variable is bound to the same value as the expression evaluates, nothing happens and the query continues.
  • if the variable is bound to a different value as the expression evaluates, an error occurs and the current solution will be excluded from the results.

ARQ already has expressions in SELECT expressions so a combination of sub-query and expression can achieve the same effect but it's unnatural and verbose and sometimes requires parts of the pattern matching to be written twice, inside and outside the sub-query.

One place where LET might be useful is in a CONSTRUCT query. In strict SPARQL, only terms found in the original data can be used for variables in the construct template but with LET-assignment:

   CONSTRUCT { ?x :lengthInInches ?inch }
   WHERE
   { ?x :lengthInCM ?cm
     LET (?inch := ?cm/2.54 )
   }

This isn't a new idea - see for example: "A SPARQL Semantics based on Datalog" - although the syntax in ARQ is designed to group the terms better.

Sub-queries

A sub-query can be used to apply some solution modifier to a sub-pattern.  Useful examples include aggregation, especially grouping and counting, and LIMIT with ORDER BY to get only some of the results of a pattern match.

 { SELECT (COUNT(*) AS ?c) { ?s ?p ?o } }

A sub-query is enclosed by {} and must be the only thing inside those braces, the same style as Virtuoso Subqueries. The sub-query will be combined, with SPARQL join, with other patterns in the same group. In the example

Find how many people all persons with two or more phones foaf:knows:

 PREFIX foaf: <http://xmlns.com/foaf/0.1/>

 SELECT ?person ?knowsCount
 {
   # ?person who have 2 or more phones
   { SELECT ?person
     WHERE { ?person foaf:phone ?phone } 
     GROUP BY ?person 
     HAVING (COUNT(?phone) >= 2) 
   }
   # Join on ?person with how many people they foaf:knows
   { SELECT ?person (COUNT(?x) AS ?knowsCount)
     WHERE { ?person foaf:knows ?x .}
     GROUP BY ?person
   }
}

Queries with sub-queries can become complicated quite quickly so I usually write each of the part separately then combining them.

19 February 2008

First time out for TDB (pt 2)

Follow-on from previous testing: a larger load of 100 million triples from UniProt 7.0a performed as follows on the SDB1 machine:
  • 115,517,840 triples
  • 3903s (which is 29594 triples/s) -- or about 65 minutes

13 February 2008

First time out for TDB

In other work, I've needed with a storage and indexing components.  To test them out, I've built a persistent Jena graph that behaves like an "RDF VM" whereby an application can handle more triples than memory alone, and it flexes to use and release its cache space based on the other applications running on the machine.  Working name: TDB. Early days, having only just finished writing the core code, but the core is now working to the point where it can load and query reliably.

The RDF VM uses indexing code (currently, classical B-Trees) but in a way that matches the model of implementation of RDF.  There is no translation between the indexing and the disk idea of data. To check that made sense, I also tried with the B-Trees replaced by Berkeley DB Java Edition. The BDB version behaves similarly with a constant slowdown. Of course, BDB-JE is more sophisticated with variable sized data items and duplicates (and transactions but I wasn't using them) so some overhead isn't surprising.

I have also tried some other indexing structures but B-Trees have proved to scale better, from situations where there isn't much free memory to 64-bit machines where there is.

Node Loads

The main area of difference between the custom and BDB-backed implementations is in loading speed.  They handle RDF node representations differently. Storing them in a BDB database, or JDBM htable, was adequate, giving a load rate of around 12K triples/s but it does generate too many disk writes to disk in an asynchronous pattern. Changing to streaming writes in TDB fixed that.  Because all the implementations fit the same framework, this technique can be rolled back into the BDB-implemented code. And BDB supports transactions.  The node technique may also help with a SQL database backed system like SDB as well.

I did try Lucene - not a good idea.  Loading is too slow, but then that's not what Lucene is designed for.

Testing

For testing, I used the Jena test suite for functional tests and the RDF Store Benchmarks with DBpedia dataset for performance.

TDB works and gives the right results for the queries. (It would be good to have the results published as well as described in the DAWG test suite so testing can be done.)

Query 2 benefits hugely from caching.  If run completely cold, after a reboot, it can take up to 30s. Running cold is also a lot more variable on machine sdb1 because other projects use the disk array.

Still room for improvement though. The new index code doesn't quite pack the leaf nodes optimally yet and some more profiling may show up hotspots but for a first pass just getting the benchmark to run is fine.  Rewriting queries, as an optimizer should, lowers the execution time for queries 3 and 5 to 0.48s and 1.46s respectively. 

The results for query 4 show one possible hotspot.  This query churns nodes executing the filters but the node retrieval code does not benefit from co-locality of disk access.  Fortunately alternative code for the node table does make co-locality possible and still run almost as fast. Time to get out the profiler.

To illustrate the "RDF VM" effect; when run with Eclipse, Firefox etc all consuming memory, then my home PC is 5-10% slower than when run without them hogging bytes even on a dataset as small as 16 million triples.

First Results for TDB

Machine sdb1 Home PC
Date 11/02/2008 11/02/2008
     
Load (seconds) 686.582 726.1
Load (triples/s) 23,478 22,961
     
Query 1 (seconds) 0.05 0.03
Query 2 (seconds) 1.30 0.73
Query 3 (seconds) 9.87 9.50
Query 4 (seconds) 30.99 35.32
Query 5 (seconds) 29.87 34.24

Breakdown of the sdb1 load:

Loading Triples Load time seconds Load rate Triples/s
Overall 16,120,177

686.582s

23,478
infoboxes 15,472,624 651.543 24.084
geocordinates 447,517 24.084 18,581
homepages 200,036 10.955 18,259

Setup

My home PC is a media centre - quad core, 3Gbyte RAM, consumer grade disks, running Vista and Norton Internet Security anti-virus. I guess it's quicker on the short queries because there is less latency to getting to the disk - even if the disks are slower - but falls behind when the query requires some crunching or a lot of data drawn from the disk.

sdb1 is a machine in a blade rack in the data centre - details below.

(My work's desktop machine, running WindowsXP has various Symantec antivirus, anti-intrusion software components and is slower for database work generally.)

Disk: Data centre disk array over fiber channel.

/proc/cpuinfo (abbrev):

processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 252
stepping : 1
cpu MHz : 1804.121
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
TLB size : 1088 4K pages
address sizes : 40 bits physical, 48 bits virtual

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 252
stepping : 1
cpu MHz : 1804.121
cache size : 1024 KB
fpu : yes
fpu_exception : yes
TLB size : 1088 4K pages
clflush size : 64
address sizes : 40 bits physical, 48 bits virtual
/proc/meminfo (abbrev):

MemTotal: 8005276 kB
MemFree: 435836 kB
Buffers: 40772 kB
Cached: 7099840 kB
SwapCached: 0 kB
Active: 1165348 kB
Inactive: 6141392 kB
SwapTotal: 2048276 kB
SwapFree: 2048116 kB
Mapped: 202868 kB

05 January 2008

Jena-Mulgara : example of implementing a Jena graph

In Jena, Graph is an interface. It abstracts anything that looks like RDF - storage options, inference, other legacy data sources.

The main operations are find(Triple), add(Triple) and remove(Triple). In addition, there are a number of getters to access handlers of various features (query, statistics, reification, bulk update, event manager) . Having handlers, rather than directly including all the operations for each feature reduces the size of the interface and makes it easier to provide default implementations of each feature.

Implementing a graph rarely needs to directly implement the interface.  More usually, an implementation starts by inheriting from the class GraphBase.  A minimal (read-only) implementation just needs to implement graphBaseFind. Wrapping legacy data often only makes sense as a read-only graph. To provide update operations, just implement the methods performAdd and performDelete, which are the methods called from the base implementations of add(Triple) and remove(Triple).

Then for testing with JUnit, inherit from AbstractGraphTest (override tests that don't make sense in a particular circumstance) and provide the getGraph operation to generate a graph instance to test.

Application APIs

Graph/Triple/Node provide the low level interface in Jena; Model/Statement/Resource/Literal provide the RDF API and the ontology API provides an OWL-centric view of the RDF data.

Where the graph level is minimal and symmetric (e.g. literal as subjects, inclusion of named variables) for easy implementation, the RDF API enforces the RDF conditions and provides a wide variety of convenience operations so writing a program can be succinct, not requiring the application writer to write unnecessary boilerplate code sequences. The ontology API does the same for OWL.  If you look at the javadoc, you'll see the APIs are large but the system level interface is small.

A graph is turned into a Model by calling ModelFactory.createModelForGraph(Graph). All the key application APIs are interface-based although it's rarely needed to do anything other that use the standard Model-Graph bridge.

Data access to the graph all goes via find. All the read operations of application APIs, directly or indirectly, come down to calling Graph.find or a graph query handler. And the default graph query handler works by calling Graph.find, so once find is implemented everything (read-only) works. ARQ's query API, which includes a SPARQL implementation, included. It may not be the most efficient way but importantly all functionality is available and so the graph implementer can quickly get a first implementation up and running, then decide where and when to spend further development time - or whether that's needed at all.

Jena-Mulgara

An example of this is a prototype Jena-Mulgara bridge (work in progress as of Jan'08). This maps the Graph API to a Mulgara session object, which can be a local Mulgara database or a remote Mulgara server. The prototype is a single class together with a set of factory operations for more convenient creation of a bridge graph wrapped in all Jena's APIs.

Implementing graph nodes, for IRIs and for literals is straight forward.  Mulgara uses JRDF to represent these nodes and to represent triples. Mapping to and from Jena versions of the same is just the change in naming.

Blank nodes are more interesting. A blank node in Jena has an internal label (which is not a URI in disguise). When working at the lowest level of Graph, the code is manipulating things at a concrete, syntactic level.

A blank node in Mulgara has an internal id but it can change. It really is the internal node index as I found out by creating a blank node with id=1 and found it turned into rdf:type which was what was really at node slot 1. Paul has been (patiently!) explaining this to me on a Mulgara mailing list. The session interface is an interface onto the RDF data, not an interface to extend the graph details to the client. Both approaches are valid - it's just different levels of abstraction.

If the Jena application is careful about blank nodes (not assuming they are stable across transactions, and not deleting all triples involving some blank node, then creating triples involving that blank node) then it all works out. The most important case of reading data within a transaction is safe. Bulk loading is better down via the native Mulgara interfaces anyway. The Jena-Mulgara bridge enables a Jena application to access a Mulgara server through the same interfaces as any other RDF data.