12 November 2006

LARQ = Lucene + ARQ

SPARQL is normally thought of as only querying fixed RDF data. At the core of SPARQL are the building blocks of basic graph patterns, and on top of these there is an algebra to create more complex patterns (OPTIONAL UNION, FILTER, GRAPH).

The key question a basic graph pattern asks is "does this pattern match the graph". The named variables record how the pattern matches.

Not all information needs to be in the raw data. ARQ property functions are a way to let the application add some relationships to be computed at query execution time.

LARQ adds free text search. The real work is done by Lucene. LARQ adds ways to create a Lucene index from RDF data and a property function to perform free text matching in a SPARQL query.

Example: find all the string literals that match '+keyword'

PREFIX pf: <java:com.hp.hpl.jena.query.pfunction.library.>

SELECT *
  { ?lit pf:textMatch '+keyword' }

Any simple or complex Lucene query string can be used.

LARQ provides utilities to index string literals. As the literal can be stored as well, a query can find the subjects with some property value matching the free text search.

So to find all the document that have titles matching some free text search:

PREFIX pf: <java:com.hp.hpl.jena.query.pfunction.library.>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
  
SELECT ?doc {
    ?lit pf:textMatch '+text' .
    ?doc ?p ?lit
  }

More details in the ARQ documentation for LARQ

This will be in ARQ 1.5 but is available from ARQ CVS now. Hopefully, this will be useful to users and application writers. Comments and feedback on the design are welcome, especially before the next ARQ release.

3 comments:

Danny said...

Wow!

I'm now half-asleep, so I'll just paste from a comment I made over at Shelley Power's the other day.
[[
...I'd installed Longwell, the facetted browser, curious to see whether it'd be useful for my blog data. Longwell will eat any RDF files you dump in the appropriate directory. On a whim I collected a few random files from the web, including one about famous people. I'd glanced at the source and knew there was an entry for Beethoven. So I did a (plain text) search for the guy in Longwell. Sure enough it picked up his bio material. But in the same results I also had a blog post I'd forgotten about, pointing to some audio files of his symphonies.
...
Dunno, it thrilled me…
]]
http://burningbird.net/technology/semanticweb/deja-data

Hany Azzam said...

It will be interesting to see how a free-text rich rdf data can help in returning a meaningful ranked list of results from a sparql query.

AndyS said...

Closest LARQ give you is that the results from Lucene come back in score order and the application can set the minimum score and/or the limit of number of results returned.

This is just building blocks in combining free text and other information into an overall ranking.