29 October 2008

Walking the Web

It's nice to see Freebase providing an RDF interface:  http://rdf.freebase.com/. The example they give is <http://rdf.freebase.com/ns/en.blade_runner> so let's see what is actually there and how we might use the information.

Each graph describing something contains Freebase URLs to be explored.  What we want is the ability to load data into our local store while some query is running, enabling the dataset to be enlarged as the query makes choices about how to proceed.

This is similar to cwm's log:semantics. http://ww.w3.org/2000/10/swap/doc/Reach

In SPARQL, the dataset is fixed. No good if you want to write a graph-walking process without some glue in your favourite programming language. In one way, it's scripting for the web but in a special way.  It's not a sequence of queries and updates; it's changing the collection of graphs, expanding the RDF dataset known to the application.

Query 1 : See what's in the graph

Let's first look at what's available at the example URL.  That does not require anything special: it's just a FROM clause (which in ARQ will content-negotiate for RDF; if you use a web browser you will see an HTML page):

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT *
FROM fb:en.blade_runner
{ ?s ?p ?o }

Hmm - 294 triples.

Query 2 : Look for interesting properties

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?p
FROM fb:en.blade_runner
{
  ?s ?p ?o
}

62 distinct properties used.  fb:film.film.starring looks interesting.

Query 3 : Follow the links

As an experimental feature, consider a new SPARQL keyword "FETCH" which takes a URL, or a variable bound to a URL by the time that part of the query is reached, and fetches the graph at that location.

Now we fetch the documents at each of the URLs that are objects of the blade runner, film.film.starring triples.

FETCH loads the graph and places it in the dataset as a named graph, the name being the URL is fetched it from. We use GRAPH to access the loaded graph. Done this way, triples from different sources are kept separately which might be important in deciding what sources to believe.

This also shows a critical limitation: just placing in a named graph is a basic requirement for deciding what to believe but really there ought to be a lot more metadata about the graph, including when it was read, possibly why it was read (how we got here in the query) etc etc. But we are not an agent system so we will note this and move on.

By poking around with GRAPH ?personUUID { ?s ?p ?o} (60 triples) the property film.performance.actor looks hopeful.

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT ?actor
FROM fb:en.blade_runner
{
  fb:en.blade_runner fb:film.film.starring ?personUUID
  FETCH ?personUUID
  GRAPH ?personUUID
    { ?personUUID fb:film.performance.actor ?actor }
}

12 results.

--------------------------------------------
| actor                                    |
============================================
| fb:en.james_hong                         |
| fb:en.brion_james                        |
| fb:en.edward_james_olmos                 |
| fb:en.joanna_cassidy                     |
| fb:en.william_sanderson                  |
| fb:en.rutger_hauer                       |
| fb:authority.netflix.role.20000077       |
| fb:guid.9202a8c04000641f80000000054cbccc |
| fb:en.sean_young                         |
| fb:en.joe_turkel                         |
| fb:en.harrison_ford                      |
| fb:en.daryl_hannah                       |
--------------------------------------------

and more URLs to follow.

Looking in the next graph, there is fb:type.object.name so let's guess and use that.  But each time we have chosen a property, we didn't have to guess, we can follow that property URL itself:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT *
FROM fb:type.object.name
{
  ?s ?p ?o
}

but it's easier to read the description in HTML (and freebase is link following internally to build the page).

Query 3 : The names of actors in Blade Runner

So a query to find the names of actors in "Blade Runner" is:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT ?actor ?name
FROM fb:en.blade_runner
{
  fb:en.blade_runner fb:film.film.starring ?personUUID
  FETCH ?personUUID
  GRAPH ?personUUID
    { ?personUUID fb:film.performance.actor ?actor }
  FETCH ?actor
  GRAPH ?actor
    { ?actor fb:type.object.name ?name }
}
ORDER BY ?actor

which gives:

-------------------------------------------------------------------
| actor                                    | name                 |
===================================================================
| fb:authority.netflix.role.20000077       | "M. Emmet Walsh"     |
| fb:authority.netflix.role.20000077       | "M・エメット・ウォルシュ"    |
| fb:en.brion_james                        | "Brion James"        |
| fb:en.daryl_hannah                       | "Daryl Hannah"       |
| fb:en.daryl_hannah                       | "Ханна, Дэрил"       |
| fb:en.daryl_hannah                       | "דריל האנה"          |
| fb:en.daryl_hannah                       | "ダリル・ハンナ"         |
| fb:en.edward_james_olmos                 | "Edward James Olmos" |
| fb:en.harrison_ford                      | "Harrison Ford"      |
| fb:en.harrison_ford                      | "Форд Гаррісон"      |
| fb:en.harrison_ford                      | "Форд, Харрисон"     |
| fb:en.harrison_ford                      | "Харисон Форд"       |
| fb:en.harrison_ford                      | "Харисън Форд"       |
| fb:en.harrison_ford                      | "האריסון פורד"       |
| fb:en.harrison_ford                      | "ハリソン・フォード"       |
| fb:en.harrison_ford                      | "哈里森·福特"         |
| fb:en.harrison_ford                      | "해리슨 포드"          |
| fb:en.james_hong                         | "James Hong"         |
| fb:en.joanna_cassidy                     | "Joanna Cassidy"     |
| fb:en.joe_turkel                         | "Joe Turkel"         |
| fb:en.rutger_hauer                       | "Rutger Hauer"       |
| fb:en.rutger_hauer                       | "Хауэр, Рутгер"      |
| fb:en.rutger_hauer                       | "ルトガー・ハウアー"      |
| fb:en.rutger_hauer                       | "魯格·豪爾"           |
| fb:en.sean_young                         | "Sean Young"         |
| fb:en.sean_young                         | "Шон Йънг"           |
| fb:en.sean_young                         | "Янг, Шон"           |
| fb:en.sean_young                         | "ショーン・ヤング"        |
| fb:en.william_sanderson                  | "William Sanderson"  |
| fb:guid.9202a8c04000641f80000000054cbccc | "Morgan Paull"       |
-------------------------------------------------------------------

 

We are left with a question: why use (extended) SPARQL? If you're doing it once, then a web browser is easier. After all, I used one to choose the properties to follow.

But with a query you can send it to someone else for them to reuse your knowledge, you can rerun it to look for changes, you can generalise and let the computer do some brute force search to find things that would take you, the human, a long time.