13 February 2008

First time out for TDB

In other work, I've needed with a storage and indexing components.  To test them out, I've built a persistent Jena graph that behaves like an "RDF VM" whereby an application can handle more triples than memory alone, and it flexes to use and release its cache space based on the other applications running on the machine.  Working name: TDB. Early days, having only just finished writing the core code, but the core is now working to the point where it can load and query reliably.

The RDF VM uses indexing code (currently, classical B-Trees) but in a way that matches the model of implementation of RDF.  There is no translation between the indexing and the disk idea of data. To check that made sense, I also tried with the B-Trees replaced by Berkeley DB Java Edition. The BDB version behaves similarly with a constant slowdown. Of course, BDB-JE is more sophisticated with variable sized data items and duplicates (and transactions but I wasn't using them) so some overhead isn't surprising.

I have also tried some other indexing structures but B-Trees have proved to scale better, from situations where there isn't much free memory to 64-bit machines where there is.

Node Loads

The main area of difference between the custom and BDB-backed implementations is in loading speed.  They handle RDF node representations differently. Storing them in a BDB database, or JDBM htable, was adequate, giving a load rate of around 12K triples/s but it does generate too many disk writes to disk in an asynchronous pattern. Changing to streaming writes in TDB fixed that.  Because all the implementations fit the same framework, this technique can be rolled back into the BDB-implemented code. And BDB supports transactions.  The node technique may also help with a SQL database backed system like SDB as well.

I did try Lucene - not a good idea.  Loading is too slow, but then that's not what Lucene is designed for.

Testing

For testing, I used the Jena test suite for functional tests and the RDF Store Benchmarks with DBpedia dataset for performance.

TDB works and gives the right results for the queries. (It would be good to have the results published as well as described in the DAWG test suite so testing can be done.)

Query 2 benefits hugely from caching.  If run completely cold, after a reboot, it can take up to 30s. Running cold is also a lot more variable on machine sdb1 because other projects use the disk array.

Still room for improvement though. The new index code doesn't quite pack the leaf nodes optimally yet and some more profiling may show up hotspots but for a first pass just getting the benchmark to run is fine.  Rewriting queries, as an optimizer should, lowers the execution time for queries 3 and 5 to 0.48s and 1.46s respectively. 

The results for query 4 show one possible hotspot.  This query churns nodes executing the filters but the node retrieval code does not benefit from co-locality of disk access.  Fortunately alternative code for the node table does make co-locality possible and still run almost as fast. Time to get out the profiler.

To illustrate the "RDF VM" effect; when run with Eclipse, Firefox etc all consuming memory, then my home PC is 5-10% slower than when run without them hogging bytes even on a dataset as small as 16 million triples.

First Results for TDB

Machine sdb1 Home PC
Date 11/02/2008 11/02/2008
     
Load (seconds) 686.582 726.1
Load (triples/s) 23,478 22,961
     
Query 1 (seconds) 0.05 0.03
Query 2 (seconds) 1.30 0.73
Query 3 (seconds) 9.87 9.50
Query 4 (seconds) 30.99 35.32
Query 5 (seconds) 29.87 34.24

Breakdown of the sdb1 load:

Loading Triples Load time seconds Load rate Triples/s
Overall 16,120,177

686.582s

23,478
infoboxes 15,472,624 651.543 24.084
geocordinates 447,517 24.084 18,581
homepages 200,036 10.955 18,259

Setup

My home PC is a media centre - quad core, 3Gbyte RAM, consumer grade disks, running Vista and Norton Internet Security anti-virus. I guess it's quicker on the short queries because there is less latency to getting to the disk - even if the disks are slower - but falls behind when the query requires some crunching or a lot of data drawn from the disk.

sdb1 is a machine in a blade rack in the data centre - details below.

(My work's desktop machine, running WindowsXP has various Symantec antivirus, anti-intrusion software components and is slower for database work generally.)

Disk: Data centre disk array over fiber channel.

/proc/cpuinfo (abbrev):

processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 252
stepping : 1
cpu MHz : 1804.121
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
TLB size : 1088 4K pages
address sizes : 40 bits physical, 48 bits virtual

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 252
stepping : 1
cpu MHz : 1804.121
cache size : 1024 KB
fpu : yes
fpu_exception : yes
TLB size : 1088 4K pages
clflush size : 64
address sizes : 40 bits physical, 48 bits virtual
/proc/meminfo (abbrev):

MemTotal: 8005276 kB
MemFree: 435836 kB
Buffers: 40772 kB
Cached: 7099840 kB
SwapCached: 0 kB
Active: 1165348 kB
Inactive: 6141392 kB
SwapTotal: 2048276 kB
SwapFree: 2048116 kB
Mapped: 202868 kB

6 comments:

Sameer said...

I am curious about the BTree node structure you are using. Do you intend to explain that anytime in future? I would love to know more.

Last year, I implemented my own Btree based RDF store - maybe we can share our findings.

Sameer.

AndyS said...

sameer - yes, let's share findings. Have you written anything up? Or drop me an email or something and we can chat or skype.

The structure I've used up till now is a node table (which happens to also be a btree but I'd like to try a different structure for that because it does not require the full sortedness that a btree gives), and a triple (or quad) table. NodeIds are 64bit numbers directly tied to their location in the disk storage.

There is no triple table as such - all the information is in the indexes so nothing needs to be stored as the triple table. For triples, up to 3 indexes (SPO, POS, OSP) each of which is 3 longs do fixed length, hence easy to use in btrees

There are some engineering-level improvements that can be made to increase the packing density of the datastructures (a sort of b+ / b-tree cross).

But there are also some structural designs I think offer very good prospects (they do on paper at least).

Brian said...

Hi Andy. Downloaded and took a look at TBD tonight. I kept thinking incorporating the Hadoop Distributed File System might be a good fit for TBD, helping scaling out large datasets.

let's image for a moment TDB ran over HDFS. What's your impression on how much of a boost this may give TBD?

AndyS said...

Hi Brian,

HDFS is certainly interesting and several people have looked at using it with Jena and/or TDB. TDB relies on indexing rather than streaming data and there are classes of query that streaming can be interesting (aggregation, extracting the data of interest from a large dataset into a graph to be queried).

Other groups are looking at SPARQL+HADOOP and SPARQL+HBASE.

Brian said...

Thanks,

Haven't seen much from any groups looking at SPARQL+HADOOP and SPARQL+HBASE other than heart. Doesn't seem like that project has much momentum at the moment.

http://rdf-proj.blogspot.com/2008/11/i-have-started-to-work-with-hbase.html

http://heart.korea.ac.kr/trac/

Ahmad said...

Brian and Andy,

I am also interested using SPARQL over hadoop. Other than heart project, I have seen two other contributions [1][2] but yet to find open source implementation to run customized experiments.



[1] https://opencirrus.org/content/sparql-query-over-hadoop-very-large-rdf-dataset

[2] www.biomanta.org/