19 February 2008

First time out for TDB (pt 2)

Follow-on from previous testing: a larger load of 100 million triples from UniProt 7.0a performed as follows on the SDB1 machine:
  • 115,517,840 triples
  • 3903s (which is 29594 triples/s) -- or about 65 minutes

13 February 2008

First time out for TDB

In other work, I've needed with a storage and indexing components.  To test them out, I've built a persistent Jena graph that behaves like an "RDF VM" whereby an application can handle more triples than memory alone, and it flexes to use and release its cache space based on the other applications running on the machine.  Working name: TDB. Early days, having only just finished writing the core code, but the core is now working to the point where it can load and query reliably.

The RDF VM uses indexing code (currently, classical B-Trees) but in a way that matches the model of implementation of RDF.  There is no translation between the indexing and the disk idea of data. To check that made sense, I also tried with the B-Trees replaced by Berkeley DB Java Edition. The BDB version behaves similarly with a constant slowdown. Of course, BDB-JE is more sophisticated with variable sized data items and duplicates (and transactions but I wasn't using them) so some overhead isn't surprising.

I have also tried some other indexing structures but B-Trees have proved to scale better, from situations where there isn't much free memory to 64-bit machines where there is.

Node Loads

The main area of difference between the custom and BDB-backed implementations is in loading speed.  They handle RDF node representations differently. Storing them in a BDB database, or JDBM htable, was adequate, giving a load rate of around 12K triples/s but it does generate too many disk writes to disk in an asynchronous pattern. Changing to streaming writes in TDB fixed that.  Because all the implementations fit the same framework, this technique can be rolled back into the BDB-implemented code. And BDB supports transactions.  The node technique may also help with a SQL database backed system like SDB as well.

I did try Lucene - not a good idea.  Loading is too slow, but then that's not what Lucene is designed for.

Testing

For testing, I used the Jena test suite for functional tests and the RDF Store Benchmarks with DBpedia dataset for performance.

TDB works and gives the right results for the queries. (It would be good to have the results published as well as described in the DAWG test suite so testing can be done.)

Query 2 benefits hugely from caching.  If run completely cold, after a reboot, it can take up to 30s. Running cold is also a lot more variable on machine sdb1 because other projects use the disk array.

Still room for improvement though. The new index code doesn't quite pack the leaf nodes optimally yet and some more profiling may show up hotspots but for a first pass just getting the benchmark to run is fine.  Rewriting queries, as an optimizer should, lowers the execution time for queries 3 and 5 to 0.48s and 1.46s respectively. 

The results for query 4 show one possible hotspot.  This query churns nodes executing the filters but the node retrieval code does not benefit from co-locality of disk access.  Fortunately alternative code for the node table does make co-locality possible and still run almost as fast. Time to get out the profiler.

To illustrate the "RDF VM" effect; when run with Eclipse, Firefox etc all consuming memory, then my home PC is 5-10% slower than when run without them hogging bytes even on a dataset as small as 16 million triples.

First Results for TDB

Machine sdb1 Home PC
Date 11/02/2008 11/02/2008
     
Load (seconds) 686.582 726.1
Load (triples/s) 23,478 22,961
     
Query 1 (seconds) 0.05 0.03
Query 2 (seconds) 1.30 0.73
Query 3 (seconds) 9.87 9.50
Query 4 (seconds) 30.99 35.32
Query 5 (seconds) 29.87 34.24

Breakdown of the sdb1 load:

Loading Triples Load time seconds Load rate Triples/s
Overall 16,120,177

686.582s

23,478
infoboxes 15,472,624 651.543 24.084
geocordinates 447,517 24.084 18,581
homepages 200,036 10.955 18,259

Setup

My home PC is a media centre - quad core, 3Gbyte RAM, consumer grade disks, running Vista and Norton Internet Security anti-virus. I guess it's quicker on the short queries because there is less latency to getting to the disk - even if the disks are slower - but falls behind when the query requires some crunching or a lot of data drawn from the disk.

sdb1 is a machine in a blade rack in the data centre - details below.

(My work's desktop machine, running WindowsXP has various Symantec antivirus, anti-intrusion software components and is slower for database work generally.)

Disk: Data centre disk array over fiber channel.

/proc/cpuinfo (abbrev):

processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 252
stepping : 1
cpu MHz : 1804.121
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
TLB size : 1088 4K pages
address sizes : 40 bits physical, 48 bits virtual

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 252
stepping : 1
cpu MHz : 1804.121
cache size : 1024 KB
fpu : yes
fpu_exception : yes
TLB size : 1088 4K pages
clflush size : 64
address sizes : 40 bits physical, 48 bits virtual
/proc/meminfo (abbrev):

MemTotal: 8005276 kB
MemFree: 435836 kB
Buffers: 40772 kB
Cached: 7099840 kB
SwapCached: 0 kB
Active: 1165348 kB
Inactive: 6141392 kB
SwapTotal: 2048276 kB
SwapFree: 2048116 kB
Mapped: 202868 kB