ARQtick: February 2008

In other work, I've needed with a storage and indexing components. To test them out, I've built a persistent Jena graph that behaves like an "RDF VM" whereby an application can handle more triples than memory alone, and it flexes to use and release its cache space based on the other applications running on the machine. Working name: TDB. Early days, having only just finished writing the core code, but the core is now working to the point where it can load and query reliably.

The RDF VM uses indexing code (currently, classical B-Trees) but in a way that matches the model of implementation of RDF. There is no translation between the indexing and the disk idea of data. To check that made sense, I also tried with the B-Trees replaced by Berkeley DB Java Edition. The BDB version behaves similarly with a constant slowdown. Of course, BDB-JE is more sophisticated with variable sized data items and duplicates (and transactions but I wasn't using them) so some overhead isn't surprising.

I have also tried some other indexing structures but B-Trees have proved to scale better, from situations where there isn't much free memory to 64-bit machines where there is.

Node Loads

The main area of difference between the custom and BDB-backed implementations is in loading speed. They handle RDF node representations differently. Storing them in a BDB database, or JDBM htable, was adequate, giving a load rate of around 12K triples/s but it does generate too many disk writes to disk in an asynchronous pattern. Changing to streaming writes in TDB fixed that. Because all the implementations fit the same framework, this technique can be rolled back into the BDB-implemented code. And BDB supports transactions. The node technique may also help with a SQL database backed system like SDB as well.

I did try Lucene - not a good idea. Loading is too slow, but then that's not what Lucene is designed for.

Testing

For testing, I used the Jena test suite for functional tests and the RDF Store Benchmarks with DBpedia dataset for performance.

TDB works and gives the right results for the queries. (It would be good to have the results published as well as described in the DAWG test suite so testing can be done.)

Query 2 benefits hugely from caching. If run completely cold, after a reboot, it can take up to 30s. Running cold is also a lot more variable on machine sdb1 because other projects use the disk array.

Still room for improvement though. The new index code doesn't quite pack the leaf nodes optimally yet and some more profiling may show up hotspots but for a first pass just getting the benchmark to run is fine. Rewriting queries, as an optimizer should, lowers the execution time for queries 3 and 5 to 0.48s and 1.46s respectively.

The results for query 4 show one possible hotspot. This query churns nodes executing the filters but the node retrieval code does not benefit from co-locality of disk access. Fortunately alternative code for the node table does make co-locality possible and still run almost as fast. Time to get out the profiler.

To illustrate the "RDF VM" effect; when run with Eclipse, Firefox etc all consuming memory, then my home PC is 5-10% slower than when run without them hogging bytes even on a dataset as small as 16 million triples.

First Results for TDB

Machine	sdb1	Home PC
Date	11/02/2008	11/02/2008

Load (seconds)	686.582	726.1
Load (triples/s)	23,478	22,961

Query 1 (seconds)	0.05	0.03
Query 2 (seconds)	1.30	0.73
Query 3 (seconds)	9.87	9.50
Query 4 (seconds)	30.99	35.32
Query 5 (seconds)	29.87	34.24

Breakdown of the sdb1 load:

Loading	Triples	Load time seconds	Load rate Triples/s
Overall	16,120,177	686.582s	23,478
infoboxes	15,472,624	651.543	24.084
geocordinates	447,517	24.084	18,581
homepages	200,036	10.955	18,259

Setup

My home PC is a media centre - quad core, 3Gbyte RAM, consumer grade disks, running Vista and Norton Internet Security anti-virus. I guess it's quicker on the short queries because there is less latency to getting to the disk - even if the disks are slower - but falls behind when the query requires some crunching or a lot of data drawn from the disk.

sdb1 is a machine in a blade rack in the data centre - details below.

(My work's desktop machine, running WindowsXP has various Symantec antivirus, anti-intrusion software components and is slower for database work generally.)

Disk: Data centre disk array over fiber channel.

/proc/cpuinfo (abbrev):

processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 252
stepping : 1
cpu MHz : 1804.121
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
TLB size : 1088 4K pages
address sizes : 40 bits physical, 48 bits virtual

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 252
stepping : 1
cpu MHz : 1804.121
cache size : 1024 KB
fpu : yes
fpu_exception : yes
TLB size : 1088 4K pages
clflush size : 64
address sizes : 40 bits physical, 48 bits virtual

/proc/meminfo (abbrev):

MemTotal: 8005276 kB
MemFree: 435836 kB
Buffers: 40772 kB
Cached: 7099840 kB
SwapCached: 0 kB
Active: 1165348 kB
Inactive: 6141392 kB
SwapTotal: 2048276 kB
SwapFree: 2048116 kB
Mapped: 202868 kB

ARQtick

19 February 2008

First time out for TDB (pt 2)

13 February 2008

First time out for TDB

Node Loads

Testing

First Results for TDB

Setup

Blog Archive

Links

About Me