08 June 2008

TDB: Loading UniProt

TDB passed a milestone this week - a load of the complete UniProt V13.3 dataset.

UniProt V13.3 is 1,755,773,303 triples (1.7 billion) of which 1,516,036,125 are unique after duplicate suppression.

This dataset is interesting in a variety of ways. Firstly, it's quite large. Secondly, it is the composite of a small number of different, related databases and has some large literals (complete protein sequences - some over 70k characters in a single literal) as well as the full text of many abstracts. (LUBM doesn't have literals at all. Testing using both synthetic data and real-world data is necessary.)

UniProt comes as a number of RDF/XML files.  These had already been checked before the loading, by parsing to give N-Triples, using it as a sort of dump format. The Jena RDF/XML parser does extensive checking, and the data had some bad URIs. I find that most large datasets do throw up some warnings on URIs.

TDB also does value-based storage for XSD datatypes decimals. integer, dates, and dateTimes. Except there aren't very many. For example, there are just 18 occurrences of the value "1", in any form, in the entire dataset. They are just xsd:ints in some cardinality constraints. I was a bit surprised by this. Given the size of the dataset, I expected none or lots of uses of the value 1, so I grepped the input data to check - it's much, much quicker to use SPARQL than run grep on 1.7 billion triples in gzip'ed files but working with N-triples makes it easy to produce small tools you can be sure that work. And indeed they are the only 1 values. Trust the SPARQL query next time.

3 comments:

Hany Azzam said...

I don't understand why you said that LUBM doesn't have any literals. I am pretty sure it does, but they are not meaningful. For example, for phone numbers you will find something like xx-xx-xx. Is that what you meant?

Hany Azzam said...

I forgot to mention something. I commented before on one of your posts about larq and said that it will be interesting to see if there is a possibility to incorporate some meaningful form of ranking with the free-text search in sparql. I think UniPort data can help in carrying out such an investigation.

AndyS said...

Hanny - you're right there are a few literals. There is one, repeated, for phone numbers and the email ddresses are literal (I think they should be <mailtoi:> URIs. All the literals are about the same length as the URIs so they don't measure my loading process differently.