This article could be subtitled called "Good I/O and Bad I/O". By arranging to use good I/O, the new TDB loader achieves faster loading rates despite writing more data to disk. "Good I/O" is file operations that occurs in a buffered and streaming fashion. "Bad I/O" is file operations that cause the disk to jump the heads about randomly or work in small units of disk transfer.
The new TDB loader "loader2" is a standalone program that bulk loads data into a new TDB database. It does not support incremental loading, and may destroy existing data. It has only been tested on Linux; it should run on Windows with Cygwin but what the performance will be is hard to tell.
Figures demonstrating the loader in action for various large datasets are in a separate blog entry. It is faster than the current loader for datasets over about 1 million triples and comes into it's own above 100 million triples.
Like the current bulk loader ("loader1"), loader2 can load triple and quad RDF formats, and from gzipped files. It runs fastest from N-triples or N-Quads because the parser is fastest, and low overhead, for these formats.
The loader is a shell script that coordinates the various phases. It's available in the TDB development code repository in bin/tdbloader2 and the current 0.8.8 snapshot build.
Loader2 is based on the observation that the speed of loader1 can drop sharply as the memory mapped files fill up RAM (the "can" is because it does not always happen; slightly weird). This fall off is more than one would expect simply by having to use some disk and sometimes the rate of loader1 becomes erratic. This could be due to the OS and the management of memory mapped files but the effect is that the secondary index creation can become rather slow. loader1 tends to do "bad I/O" - as the caches fill up, blocks are written back in what to the disk looks like a random order causing the disk heads to jump round.
Copying from the primary index to a secondary index involves a sort because TDB uses B+trees for it's triple and quad indexing. A B+Tree keeps its records in sorted order and each index is different orders.
Loader1 is much faster than simply loading all indexes at once because in that case there is some much RAM being used for caching of parts of all the indexes. Better is to do one index at a time, using the RAM for caching one index at a time.
Loader2 similarly has an data loading phase and an index creation phase.
The first phase is to build the node table and write out the data for index building. Loader2 takes the stream of triples and quads from the parser and writes out the RDF terms (IRI, Literal, blank node) into the internal node table. It also writes out text files of tuples of NodeId (the internal 64 bit number used to identify each RDF term. This is "good I/O" - the writes of the tuples files are buffered up and the files are written append-only. This phase is a Java program, which exits after the node table and working files have been written.
The next phase is to produce the indexes, including the primary index. Unlike loader1, loader2 does not write the primary index during node loading. Experimentation showed it was quicker to do it separately despite needing more I/O. This is slightly strange.
To build indexes, loader2 uses the B+Tree rebuidler and that requires the data in index-sorted order. Index rebuilding is a sort followed by B+tree building. The sort is done by Unix sort. Unix sort is very easy to use and it smoothly scales from a few lines to gigabytes of data. Having written the tuple data out as text files in the first phase (and fixed width hex numbers at that - quite wasteful) Unix sort can do a text sort on the files. Despite that meaning lots of I/O, it's good I/O and the sort program really knows how to best manage temporary files.
For each index, a Unix sort is done to get a temporary file of tuple data in the right sort order. The B+Tree rebuilder is called with this file as the stream of sorted data it needs to create an index.
There are still opportunities to tune the new loader and to see if the output of the sorts being piped directly into the rebuilder is better or worse than the two step approach using temporary file used at the moment. Using different disks for different temporary files should also help.
The index building phase is parallelisable. Because I/O and memory usage are the bottlenecks, not CPU cycles, the crossover point for this to become effective might be quite high.
To find out whether loader2 is better than loader1, I've run a number of tests. Load and query tests with the Berlin SPARQL Benchmark (2009 version), a load test on the RDF version of COINS (UK Treasury Combined Online Information System - about 420 million quads and it's real data) and a load test using the Lehigh University Benchmark with some inferencing. Details, figures and tables in the next article.
2 comments:
Hi Andy,
This is somewhat similar to what the 4s-import loader does, though it's not completly tuned for bulk loading, so it works in 5MT stripes. We don't have a dedicated bulk loader anymore, it wasn't enough better than the 4s-import one to justify maintaining it (maybe 5% faster).
We load 5MT-worth of resource data (ID -> resource maps), then 5MT of quads gets sent to the indexes. Using the same trick though of stashing the quad data, then sorting it. We use binary data, mmap() and qsort() IIRC, but the idea is the same.
The various indexes then do their own internal sort to place the data in logical disk order, to get more write locality before writing. I doubt that matters so much on 2010 hardware though.
It was done that way because when we wrote that loader, the normal situation would be that we would get at most 10% of the indexes mapped into RAM during normal operation.
5store does something very different, but it's optimised for vastly more powerful hardware.
At the time 4s-import was tuned (2006ish) our data/user ratio was pretty high, so the key economic factor was triples/machine, but now were much more focussed on users/second, so we make sure 100% of the index is in RAM during normal operation.
- Steve
This looks very interesting. I have been trying to load a very large dataset (~3B triples) and I have been observing the sharp fall off when the memmapped files start to fill up. This usually happens around the 350,000,000 statement mark and the load performance drops quickly from an average of about 70k TPS to about 3-6k TPS.
I'm now going to have to take loader2 for a spin.
Ryan-
http://damnhandy.com
Post a Comment