In other work, I've needed with a storage and indexing components. To test them out, I've built a persistent Jena graph that behaves like an "RDF VM" whereby an application can handle more triples than memory alone, and it flexes to use and release its cache space based on the other applications running on the machine. Working name: TDB. Early days, having only just finished writing the core code, but the core is now working to the point where it can load and query reliably.
The RDF VM uses indexing code (currently, classical B-Trees) but in a way that matches the model of implementation of RDF. There is no translation between the indexing and the disk idea of data. To check that made sense, I also tried with the B-Trees replaced by Berkeley DB Java Edition. The BDB version behaves similarly with a constant slowdown. Of course, BDB-JE is more sophisticated with variable sized data items and duplicates (and transactions but I wasn't using them) so some overhead isn't surprising.
I have also tried some other indexing structures but B-Trees have proved to scale better, from situations where there isn't much free memory to 64-bit machines where there is.
The main area of difference between the custom and BDB-backed implementations is in loading speed. They handle RDF node representations differently. Storing them in a BDB database, or JDBM htable, was adequate, giving a load rate of around 12K triples/s but it does generate too many disk writes to disk in an asynchronous pattern. Changing to streaming writes in TDB fixed that. Because all the implementations fit the same framework, this technique can be rolled back into the BDB-implemented code. And BDB supports transactions. The node technique may also help with a SQL database backed system like SDB as well.
I did try Lucene - not a good idea. Loading is too slow, but then that's not what Lucene is designed for.
For testing, I used the Jena test suite for functional tests and the RDF Store Benchmarks with DBpedia dataset for performance.
TDB works and gives the right results for the queries. (It would be good to have the results published as well as described in the DAWG test suite so testing can be done.)
Query 2 benefits hugely from caching. If run completely cold, after a reboot, it can take up to 30s. Running cold is also a lot more variable on machine sdb1 because other projects use the disk array.
Still room for improvement though. The new index code doesn't quite pack the leaf nodes optimally yet and some more profiling may show up hotspots but for a first pass just getting the benchmark to run is fine. Rewriting queries, as an optimizer should, lowers the execution time for queries 3 and 5 to 0.48s and 1.46s respectively.
The results for query 4 show one possible hotspot. This query churns nodes executing the filters but the node retrieval code does not benefit from co-locality of disk access. Fortunately alternative code for the node table does make co-locality possible and still run almost as fast. Time to get out the profiler.
To illustrate the "RDF VM" effect; when run with Eclipse, Firefox etc all consuming memory, then my home PC is 5-10% slower than when run without them hogging bytes even on a dataset as small as 16 million triples.
First Results for TDB
|Query 1 (seconds)||0.05||0.03|
|Query 2 (seconds)||1.30||0.73|
|Query 3 (seconds)||9.87||9.50|
|Query 4 (seconds)||30.99||35.32|
|Query 5 (seconds)||29.87||34.24|
Breakdown of the sdb1 load:
|Loading||Triples||Load time seconds||Load rate Triples/s|
My home PC is a media centre - quad core, 3Gbyte RAM, consumer grade disks, running Vista and Norton Internet Security anti-virus. I guess it's quicker on the short queries because there is less latency to getting to the disk - even if the disks are slower - but falls behind when the query requires some crunching or a lot of data drawn from the disk.
sdb1 is a machine in a blade rack in the data centre - details below.
(My work's desktop machine, running WindowsXP has various Symantec antivirus, anti-intrusion software components and is slower for database work generally.)
Disk: Data centre disk array over fiber channel.
/proc/cpuinfo (abbrev): processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 37 model name : AMD Opteron(tm) Processor 252 stepping : 1 cpu MHz : 1804.121 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 TLB size : 1088 4K pages address sizes : 40 bits physical, 48 bits virtual processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 37 model name : AMD Opteron(tm) Processor 252 stepping : 1 cpu MHz : 1804.121 cache size : 1024 KB fpu : yes fpu_exception : yes TLB size : 1088 4K pages clflush size : 64 address sizes : 40 bits physical, 48 bits virtual
/proc/meminfo (abbrev): MemTotal: 8005276 kB MemFree: 435836 kB Buffers: 40772 kB Cached: 7099840 kB SwapCached: 0 kB Active: 1165348 kB Inactive: 6141392 kB SwapTotal: 2048276 kB SwapFree: 2048116 kB Mapped: 202868 kB