ARQtick

Something the SPARQL Working Group has been thinking about recently is updates to RDF lists.

RDF lists are hard to deal with because they are not first class objects in the RDF data model. Instead they are "encoded" in triples. The encoding using a cons cell like structure whereby each element of the list is a blank node (not necessary a blank node but it nearly always is).

RDF lists are correctly called "RDF collections" but as it's the list-nature (elements in order) that matters, I'll call them lists in this blog.

Turtle and SPARQL has syntax for lists, but it's only surface syntax, and there are really triples in the RDF graph:

@prefix :  .
:x :p (1 2 3) .

is the RDF:

:x    :p         _:b0 .
_:b0  rdf:first  1 .
_:b0  rdf:rest   _:b1 .
_:b1  rdf:first  2 .
_:b1  rdf:rest   _:b2 .
_:b2  rdf:first  3 .
_:b2  rdf:rest   rdf:nil

RDF toolkits help by presenting lists as progamming language lists. This also helps in keeping the lists well formed. In all those triples, there is one rdf:rest and one rdf:first per list element - but it's legal RDF to have several uses of the properties, or none, on one subject.

As an addition quirk, the empty list isn't any RDF triples, so looking for lists isn't just looking for rdf:rest properties.

@prefix :  .
:x :p () .

is the RDF:

:x :p rdf:nil .

Lists, Property Paths and Update

SPARQL 1.1 Query adds property paths, which make working with lists a bit easier, but it's not perfect. List elements do not necessarily come out in order.

{ :list rdf:rest*/rdf:next ?element }

But what about SPARQL 1.1 Update? How can we work with RDF lists? Here are some scripts for list operations. By using property paths they work on arbitrary length lists.

Add an element to the start of a list
Add an element to the end of a list
Delete the element at the start of a list
Delete an element to the end of a list
Delete the whole list (version 1 - common case)
Delete the whole list (version 2 - more general case)

All the scripts are self-contained - they include tests data.

They are examples - they aren't necessarily fully general, for example, if lists are badly formed or the property :p is also used to relate the subject to things that aren't lists. The last example shows a way to address that by finding and marking relavent points in the graph, doing some work and going back and tidying up. The graph updated is also being used as a scratch pad.

Add an element to the start of a list

PREFIX :    <http://example/> 
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 

INSERT DATA {
  :x0 :p () .
  :x1 :p (1) .
  :x2 :p (1 2) .
  :x3 :p (1 2 3) .
} ;

DELETE { ?x :p ?list }
INSERT { ?x :p [ rdf:first 0 ; 
                 rdf:rest ?list ]
       }
WHERE
{
  ?x :p ?list .
}

This one is relatively easy. Find the list start ?x :p ?list, which works whether the list is zero length or already has elements, delete the old triple that connected to the start of the list, put in a new cons cell (the [...]) at the start, and link to it.

Add an element to the end of a list

PREFIX :    <http://example/> 
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 

INSERT DATA {
  :x0 :p () .
  :x1 :p (1) .
  :x2 :p (1 2) .
  :x3 :p (1 2 3) .
} ;

# The order here is important.
# Must do list >= 1 first.

# List of length >= 1
DELETE { ?elt rdf:rest rdf:nil }
INSERT { ?elt rdf:rest [ rdf:first 98 ; rdf:rest rdf:nil ] }
WHERE
{
  ?x :p ?list .
  # List of length >= 1
  ?list rdf:rest+ ?elt .
  ?elt rdf:rest rdf:nil .
  # ?elt is last cons cell
} ;

# List of length = 0
DELETE { ?x :p rdf:nil . }
INSERT { ?x :p [ rdf:first 99 ; rdf:rest rdf:nil ] }
WHERE
{
   ?x :p rdf:nil .
}

This is a bit harder - there are two cases, lists of length 0 and lists of length one or more. The element before the insertion point needs changing and that can be a cons cell (list length >= 1) or the empty list (the triple pointing to it).

Do the lists of length one or more first, otherwise the adding to a list of length zero will be caught again by the adding to a list of length one.

For a list of length 1 or more: find the last element. The WHERE finds ?elt by finding all elements of the list rdf:rest+, and checking it's the last element by looking for ?elt rdf:rest rdf:nil.

Then delete the rdf:rest, and insert the new cons cell [ rdf:first 98 ; rdf:rest rdf:nil ].

For a list of length 0, the style is the same but the finding the triple to delete-insert to attch the cons cell is different.

Delete the element at the start of a list

PREFIX :      <http://example/> 
PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 

INSERT DATA {
  :x3 :p (1 2 3) .
  :x2 :p (1 2) .
  :x1 :p (1) .
  :x0 :p () .
} ;

DELETE { 
   ?x :p ?list .
   ?list rdf:first ?first ;
         rdf:rest  ?rest }
INSERT { ?x :p ?rest }
WHERE
{
  ?x :p ?list .
  ?list rdf:first ?first ;
        rdf:rest ?rest .
}

This can be done in one step - we are not interested in lists of length 0 because they have no element to delete. So find the pattern at the start of the list, delete it (note the WHERE pattern and DELETE template are the same), and insert the new triple that links the list directly to the previous rdf:rest.

Delete the element at the end of a list

PREFIX :     <http://example/> 
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 

INSERT DATA {
  :x3 :p (1 2 3) .
  :x2 :p (1 2) .
  :x1 :p (1) .
  :x0 :p () .
} ;

# List of length 1
# Do before other lists.

DELETE { ?x :p ?elt .
         ?elt  rdf:first ?v .
         ?elt  rdf:rest  rdf:nil .
       }
INSERT { ?x :p rdf:nil . }
WHERE
{
  ?x :p ?elt .
  ?elt rdf:first ?v ;
       rdf:rest rdf:nil .
} ;

# List of length >= 2
DELETE { ?elt1 rdf:rest ?elt .
         ?elt  rdf:first ?v .
         ?elt  rdf:rest  rdf:nil .
       }
INSERT { ?elt1 rdf:rest rdf:nil }
WHERE
{
  ?x :p ?list .
  ?list rdf:rest* ?elt1 .

  # Second to end.
  ?elt1 rdf:rest ?elt .
  # End.
  ?elt rdf:first ?v ;
       rdf:rest rdf:nil .
}

The cases to consider are lists of exactly one and lists of two or more elements. It's the treatment of the element before the element we're deleteing that is different.

The style is the same though - find the place before the deleting, and the delete that cons cell.

For the list of length 2 or more, rdf:rest* is used which, is all elements including the ?list case of zero steps - then the structure beyond that is tested for being the end. There are 2 rdf:rest uses in the test for the end, hence list of length 2 or more.

Delete the whole list (common case)

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX :     <http://example/> 

INSERT DATA {
:x0 :p () .
:x0 :q "abc" .

:x1 :p (1) .
:x1 :q "def" .

:x2 :p (1 2) .
:x2 :q "ghi" .
} ;

# Delete the cons cells.
DELETE
    { ?z rdf:first ?head ; rdf:rest ?tail . }
WHERE { 
      [] :p ?list .
      ?list rdf:rest* ?z .
      ?z rdf:first ?head ;
         rdf:rest ?tail .
      } ;

# Delete the triples that connect the lists.
DELETE WHERE { ?x :p ?z . }

This version is not fully general because it assume that :p is a link to the list and not also to any other RDF terms (non-lists) which we would want to keep.

The first DELETE finds and removes all cons cells. The second DELETE removes the triple with :p connecting the list to the subject.

Delete the whole list (general case)

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX :     <http://example/> 

INSERT DATA {
:x0 :p () .
:x0 :p "String 0" .
:x0 :p [] .

:x1 :p (1) .
:x1 :p "String 1" .
:x1 :p [] .

:x2 :p (1 2) .
:x2 :p "String 2" .
:x2 :p [] .

# A list not connected.
(1 2) .

# Not legal RDF.
# () .

} ;

INSERT { ?list :deleteMe true . }
WHERE {
   ?x :p ?list . 
   FILTER (?list = rdf:nil || EXISTS{?list rdf:rest ?z} )
} ;

# Delete the cons cells.
DELETE
    { ?z rdf:first ?head ; rdf:rest ?tail . }
WHERE { 
      [] :p ?list .
      ?list rdf:rest* ?z .
      ?z rdf:first ?head ;
         rdf:rest ?tail .
      } ;

# Delete the marked nodes
DELETE 
WHERE { ?x :p ?z . 
        ?z :deleteMe true . 
} ;

## ------
## Unconnected lists.

DELETE
    { ?z rdf:first ?head ; rdf:rest ?tail . }
WHERE { 
      ?list rdf:rest ?z2 .
      FILTER NOT EXISTS { ?s ?p ?list }
      ?list rdf:rest* ?z .
      ?z rdf:first ?head ;
         rdf:rest ?tail .
      }

Deep breath.

This one is quite long.

The first step is to find and mark all the triples from a subject to a list via :p. We will need to delete at the end of the process but the property might also be used for non-lists and after the middle DELETE step all evidence of the lists is lost. The test:

    FILTER (?list = rdf:nil || EXISTS{?list rdf:rest ?z} )

catches both zero length lists and lists with elements.

Second step: delete all list elements, any subjects with properties rdf:first and rdf:rest.

Third step: remove the connecting triples and the markers.

Finally, we delete any lists where the start isn't connected to anything, which is the

    FILTER NOT EXISTS { ?s ?p ?list }

test.

License and Copyright

This page and the SPARQL 1.1 Update scripts are (c) Epimorphics Ltd and licensed under a Creative Commons Attribution 3.0 License.

CAVEAT

There are "Lies, damned lies, and statistics" but worse are probably performance measurements done by someone else. The real test is what does it mean for any given application and is performance "fit for purpose". Database-related performance measurements are particular murky. The shape of the data matters, the usage made of the data matters, all in ways that can wildly affect whether a system is for for purpose.

Treat these figures with care - they are given to compare the TDB bulker (to version 0.8.7) loader and the new one (version 0.8.8 and later). Even then, the new bulk loader is new, so it is subject to tweaking and tuning but hopefully just to improve performance, not worsen it.

BSBM – Berlin SPARQL Benchmark
COINS – Combined Online Information System from the UK Treasury.
LUBM – Lehigh University Benchmark.

See also

http://esw.w3.org/RdfStoreBenchmarking.

Summary

The new bulk loader is faster by x2 or more depending on the characteristics of the data. As loads can take hours, this saving is very useful. It produces smaller databases and the databases are as good as or better in terms of performance than the ones produced by the current bulk loader.

Setup

The tests were run on a small local server, not tuned or provisioned for database work, just a machine that happened to be easily accessible.

8GB RAM
4 core Intel i5 760 @2.8Ghz
Ubuntu 10.10 - ext4 filing system
Disk: WD 2 TB - SATA-300 7200 rpm and buffer Size 64 MB
Java version Sun/Oracle JDK 1.6.0_22 64-Bit Server VM

BSBM

The BSBM published results from Nov 2009.

The figures here are produced using a modified version of the BSBM tools set used for version 2 of BSBM. The modifications are to run the tests on a local database, not over HTTP. The code is available from github. See also this article.

Loader Performance for BSBM
Database Sizes for BSBM
Query Performance for BSBM

BSBM - Loader performance

BSBM dataset	Triples	Loader 1	Rate	Loader 2	Rate
50k	50,057	3s	18,011 TPS	7s	7,151 TPS
250k	250,030	8s	31,702 TPS	11s	22,730 TPS
1m	1,000,313	26s	38,956 TPS	27s	37,049 TPS
5m	5,000,339	121s	41,298 TPS	112s	44,646 TPS
25m	25,000,250	666s	37,561 TPS	586s	42,663 TPS
100m	100,000,112	8,584s	11,650 TPS	3,141s	31,837 TPS
200m	200,031,413	30,348s	6,591 TPS	8,309s	24,074 TPS
350m	350,550,000	83,232s	4,212 TPS	21,146s	16,578 TPS

BSBM - Database sizes

Database	Size/loader1	Size/loader2
50k	10MB	7.2MB
250k	49MB	35MB
1m	198MB	137MB
5m	996MB	680MB
25m	4.9GB	3.3GB
100m	20GB	13GB
200m	39GB	26GB
350m	67GB	45GB

BSBM - Query Performance

Numbers are "query mix per hour"; larger numbers are better. The BSBM performance engine was run with 100 warmups and 100 timing runs over local databases.

Loader used	50k	250k	1m	5m	25m	100m	200m	350m
Loader 1	102389.1	87527.4	58441.6	5854.7	1798.4	673.0	410.7	250.0
Loader 2	106920.1	86726.1	62240.7	11384.5	3477.9	797.1	425.8	259.2

What this does show is that for a narrow range of database sizes around 5m to 25m, the databases produced by loader2 are faster. This happens because the majority ogf the working set of databases due to loader1 didn't fit mostly in-memory but those produced by loader2 do.

COINS

COINS is the Combined Online Information System from the UK Treasury. It's a real-wolrd database that has been converted to RDF by my colleague, Ian - see Description of the conversion to RDF done by Ian for data.gov.uk.

General information about COINS.

COINS is all named graphs.

Loader Performance for COINS
Database Sizes for COINS

COINS - Loader Performance

COINS dataset	Quads	Loader 1	Rate	Loader 2	Rate
	417,792,897	26,425s	15,811 TPS	17,057s	24,494 TPS

COINS - Database sizes

Size/loader1	Size/loader2
152GB	77GB

LUBM

LUBM information

LUBM isn't a very representative benchmark for RDF and linked data applications - it is design more for testing inference. But there is some details of various systems published using this benchmark. To check the new loader on this data, I ran loads for a couple of the larger generated. These are the 1000 and 5000 datasets, with inference applied during data creation. The 5000 dataset, just under a billion triples, was only run through the new loader.

Loader Performance for LUBM
Database Sizes for LUBM

LUBM - Loader Performance

LUBM dataset	Triples	Loader 1	Rate	Loader 2	Rate
1000-inf	190,792,744	7,106s	26,849 TPS	3,965s	48,119 TPS
5000-inf	953,287,749	N/A	N/A	86,644s	11,002 TPS

LUBM - Database sizes

Database sizes:

Dataset	Size/loader1	Size/loader2
1000-inf	25GB	16GB
5000-inf	N/A	80GB

ARQtick

02 March 2011

Updating RDF Lists with SPARQL

Lists, Property Paths and Update

Add an element to the start of a list

Add an element to the end of a list

Delete the element at the start of a list

Delete the element at the end of a list

Delete the whole list (common case)

Delete the whole list (general case)

License and Copyright

09 December 2010

Performance benchmarks for the TDB loader (version 2)

Blog Archive

Links

About Me