19 May 2011

Importing SourceForge code into Apache for Jena

The Jena project now has the legal paperwork done for the vast majority of the codebase. It's now time to move the code from SourceForge, where's it's been for almost 10 years (the project was registered November 2001).

During that time, the SourceForge infrastructure has been excellent. We're not moving because of dissatisfaction but because we want to put the post-HP on a solid legal basis where the license and IP situation is well-understood and completely clear. We now have committers in 3 different organisations, and contributions from yet more - it's slowly getting more and more complicated.

The way Apache works is that software is granted to Apache, which grants Apache the right to re-license it. Any software you use from Apache is a license (with IP guarantees etc.) from Apache to you - not between you and the original contributor, so you can when use the software commercially and only need to check one Apache license.

Until now, we have had a setup where any contributions are simply incorporated with the license and conditions of the contributor. It so happens that all the licensed code in the codebase is the same BSD-type license but in using Jena you don't get a single license, you get one from every contributor. For some people who are going to depend on Jena for commercial use or long term big deployment, this matters. We've had a user crawl the codebase to check each of the licenses (as they should but it's just work). With Apache it's different - one license, well-understood legal situation.

Contributors grant software two ways - either a software grant document or when they upload code to a mailing list or to Jira. When you add something to Jira there's a tick box to say you are making the grant to Apache, otherwise while it may illustrate some issue, we can't use it in the codebase.

Apache use subversion so Jena needs to import the code base to svn.

Subversion or git or Mercurial ...

Aside: one question I've been asked is why not a DVCS like git or mercurial. Aapche use Subversion. As I understand it, there are legal matters to consider. Suppose A pushes code to B and B pushes to Apache. A has not necessarily granted the software to Apache - B could check but it's a new burden for B, and pushing to Apache is B's responsibility but B does not own A's contribution. Maybe this will change sometime but at the moment, DVCS works for direct contributor to user licensing (and the user "should" then check every license) but not the consolidation offered by Apache.

Process

Jena has three repositories, Jena in CVS, Jena in SVN and Joseki in CVS. There are active projects in all of them but theer is also a lot of history and legacy. We want to import everything as a record of ownerships, not just copy the latest working copy.

This is the process I have put together:

1. Grab the repositories

SourceForge offer rsync access for backup, with history (the tarballs are just the current state).

2. Convert CVS to SVN

We have a multi-project layout so cvs2svn needs some arguments.

#!/bin/bash
MODS="ARQ BRQL DataGenerator Eyeball EyeballAcceptance Scratch extras grddl gvs iri jena jena-perf jena2 modeljeb owlsyntax rdf-html sparql2sql
tutorial"

SVN=ASF-Jena-CVS   # Destination
CVS=../Jena-CVS    # Local rsync backup

for in $MODS
do
    echo "==== $m"
    #ARGS="--dry-run"
    ARGS="$ARGS --encoding=utf8 --encoding=iso-8859-1"
    # Create trunk/branshes/tag structure per project
    ARGS="$ARGS --trunk=$m/trunk --branches=$m/branches --tags=$m/tags"
    cvs2svn $ARGS --existing-svnrepos --svnrepos "$SVN" $CVS/$m
done

and much the same for Joseki except the modules list is just "Joseki1 Joseki3 Joseki3" and it is much faster.

Dry-run this first : it showed up two problems.

The "--encoding=utf8 --encoding=iso-8859-1" to to get the translation of some people's names right (non-ASCII characters).

A name clash in Joseki couldn't be resolved. Fortunately, it was with some old intermediate binaries so simply deleting from CVS (the joy of CVS using the filesystem layout) was simplest.

3. Dump the repositories

Use "svnadmin dump" and gzip the files. They are going to uploaded to an Apache machine and they are quite large - 3.1G to upload over from my home cable connection (1.5Mbit up).

4. Import to subversion

This step has been done by the Apache Infrastructure team as it requires svnadmin access to the respository. See INFRA-3628 for the details.

It's good to check it's going to do the right thing first. We now have the files for three repositories. We want the imported svn to look like:

   .../Import/Jena-CVS/...
   .../Import/Jena-SVN/...
   .../Import/Joseki-CVS/...

so we have a permanent record of the code state at the start of the Aapche svn. After import, active project can be "svn copy"ed out to give the working versions going forward.

To test it's going to work when the apache infrastrucure team so the actual import, I built a local repo in the same layout.

# ---- Create the layout in Apache repository
mkdir -p Layout/incubator/jena/Import/Joseki-CVS
mkdir -p Layout/incubator/jena/Import/Jena-CVS
mkdir -p Layout/incubator/jena/Import/Jena-SVN
svnadmin create ApacheRepo
svn import Layout/ file://$PWD/ApacheRepo -m "Set layout"
rm -rf Layout

then it's juts a matter of inserting the code in the right place:

# --- Imports
REPO=ApacheRepo

# Joseki-CVS
gzip -d < Imports/ASF-Joseki-CVS.svn.gz | \
     svnadmin load --parent-dir incubator/jena/Import/Joseki-CVS $REPO

# Jena-CVS
gzip -d < Imports/ASF-Jena-CVS.svn.gz | \
     svnadmin load --parent-dir incubator/jena/Import/Jena-CVS $REPO

# Jena-SVN
gzip -d < Imports/ASF-Jena-SVN.svn.gz | \
     svnadmin load --parent-dir incubator/jena/Import/Jena-SVN $REPO

The slow bits where csv2svn (it's not bad but it's not instant : an hour or so), the upload to Apache (a couple of hours) and the checking the "svnadmin load" (another couple of hours).

5. Extract working copies

We're keeping the imports unchanged as a record of the starting point at Apache (revision 1124118)

The whole process has been done now - Jena code at Apache

02 March 2011

Updating RDF Lists with SPARQL

Something the SPARQL Working Group has been thinking about recently is updates to RDF lists.

RDF lists are hard to deal with because they are not first class objects in the RDF data model. Instead they are "encoded" in triples. The encoding using a cons cell like structure whereby each element of the list is a blank node (not necessary a blank node but it nearly always is).

RDF lists are correctly called "RDF collections" but as it's the list-nature (elements in order) that matters, I'll call them lists in this blog.

Turtle and SPARQL has syntax for lists, but it's only surface syntax, and there are really triples in the RDF graph:

@prefix :  .
:x :p (1 2 3) .

is the RDF:

:x    :p         _:b0 .
_:b0  rdf:first  1 .
_:b0  rdf:rest   _:b1 .
_:b1  rdf:first  2 .
_:b1  rdf:rest   _:b2 .
_:b2  rdf:first  3 .
_:b2  rdf:rest   rdf:nil

RDF toolkits help by presenting lists as progamming language lists. This also helps in keeping the lists well formed. In all those triples, there is one rdf:rest and one rdf:first per list element - but it's legal RDF to have several uses of the properties, or none, on one subject.

As an addition quirk, the empty list isn't any RDF triples, so looking for lists isn't just looking for rdf:rest properties.

@prefix :  .
:x :p () .

is the RDF:

:x :p rdf:nil .

Lists, Property Paths and Update

SPARQL 1.1 Query adds property paths, which make working with lists a bit easier, but it's not perfect. List elements do not necessarily come out in order.

{ :list rdf:rest*/rdf:next ?element }

But what about SPARQL 1.1 Update? How can we work with RDF lists? Here are some scripts for list operations. By using property paths they work on arbitrary length lists.

All the scripts are self-contained - they include tests data.

They are examples - they aren't necessarily fully general, for example, if lists are badly formed or the property :p is also used to relate the subject to things that aren't lists. The last example shows a way to address that by finding and marking relavent points in the graph, doing some work and going back and tidying up. The graph updated is also being used as a scratch pad.

Add an element to the start of a list

PREFIX :    <http://example/> 
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 

INSERT DATA {
  :x0 :p () .
  :x1 :p (1) .
  :x2 :p (1 2) .
  :x3 :p (1 2 3) .
} ;

DELETE { ?x :p ?list }
INSERT { ?x :p [ rdf:first 0 ; 
                 rdf:rest ?list ]
       }
WHERE
{
  ?x :p ?list .
}

This one is relatively easy. Find the list start ?x :p ?list, which works whether the list is zero length or already has elements, delete the old triple that connected to the start of the list, put in a new cons cell (the [...]) at the start, and link to it.

Add an element to the end of a list

PREFIX :    <http://example/> 
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 

INSERT DATA {
  :x0 :p () .
  :x1 :p (1) .
  :x2 :p (1 2) .
  :x3 :p (1 2 3) .
} ;

# The order here is important.
# Must do list >= 1 first.

# List of length >= 1
DELETE { ?elt rdf:rest rdf:nil }
INSERT { ?elt rdf:rest [ rdf:first 98 ; rdf:rest rdf:nil ] }
WHERE
{
  ?x :p ?list .
  # List of length >= 1
  ?list rdf:rest+ ?elt .
  ?elt rdf:rest rdf:nil .
  # ?elt is last cons cell
} ;

# List of length = 0
DELETE { ?x :p rdf:nil . }
INSERT { ?x :p [ rdf:first 99 ; rdf:rest rdf:nil ] }
WHERE
{
   ?x :p rdf:nil .
}

This is a bit harder - there are two cases, lists of length 0 and lists of length one or more. The element before the insertion point needs changing and that can be a cons cell (list length >= 1) or the empty list (the triple pointing to it).

Do the lists of length one or more first, otherwise the adding to a list of length zero will be caught again by the adding to a list of length one.

For a list of length 1 or more: find the last element. The WHERE finds ?elt by finding all elements of the list rdf:rest+, and checking it's the last element by looking for ?elt rdf:rest rdf:nil.

Then delete the rdf:rest, and insert the new cons cell [ rdf:first 98 ; rdf:rest rdf:nil ].

For a list of length 0, the style is the same but the finding the triple to delete-insert to attch the cons cell is different.

Delete the element at the start of a list

PREFIX :      <http://example/> 
PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 

INSERT DATA {
  :x3 :p (1 2 3) .
  :x2 :p (1 2) .
  :x1 :p (1) .
  :x0 :p () .
} ;

DELETE { 
   ?x :p ?list .
   ?list rdf:first ?first ;
         rdf:rest  ?rest }
INSERT { ?x :p ?rest }
WHERE
{
  ?x :p ?list .
  ?list rdf:first ?first ;
        rdf:rest ?rest .
}

This can be done in one step - we are not interested in lists of length 0 because they have no element to delete. So find the pattern at the start of the list, delete it (note the WHERE pattern and DELETE template are the same), and insert the new triple that links the list directly to the previous rdf:rest.

Delete the element at the end of a list

PREFIX :     <http://example/> 
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 

INSERT DATA {
  :x3 :p (1 2 3) .
  :x2 :p (1 2) .
  :x1 :p (1) .
  :x0 :p () .
} ;

# List of length 1
# Do before other lists.

DELETE { ?x :p ?elt .
         ?elt  rdf:first ?v .
         ?elt  rdf:rest  rdf:nil .
       }
INSERT { ?x :p rdf:nil . }
WHERE
{
  ?x :p ?elt .
  ?elt rdf:first ?v ;
       rdf:rest rdf:nil .
} ;

# List of length >= 2
DELETE { ?elt1 rdf:rest ?elt .
         ?elt  rdf:first ?v .
         ?elt  rdf:rest  rdf:nil .
       }
INSERT { ?elt1 rdf:rest rdf:nil }
WHERE
{
  ?x :p ?list .
  ?list rdf:rest* ?elt1 .

  # Second to end.
  ?elt1 rdf:rest ?elt .
  # End.
  ?elt rdf:first ?v ;
       rdf:rest rdf:nil .
}

The cases to consider are lists of exactly one and lists of two or more elements. It's the treatment of the element before the element we're deleteing that is different.

The style is the same though - find the place before the deleting, and the delete that cons cell.

For the list of length 2 or more, rdf:rest* is used which, is all elements including the ?list case of zero steps - then the structure beyond that is tested for being the end. There are 2 rdf:rest uses in the test for the end, hence list of length 2 or more.

Delete the whole list (common case)

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX :     <http://example/> 

INSERT DATA {
:x0 :p () .
:x0 :q "abc" .

:x1 :p (1) .
:x1 :q "def" .

:x2 :p (1 2) .
:x2 :q "ghi" .
} ;

# Delete the cons cells.
DELETE
    { ?z rdf:first ?head ; rdf:rest ?tail . }
WHERE { 
      [] :p ?list .
      ?list rdf:rest* ?z .
      ?z rdf:first ?head ;
         rdf:rest ?tail .
      } ;

# Delete the triples that connect the lists.
DELETE WHERE { ?x :p ?z . }

This version is not fully general because it assume that :p is a link to the list and not also to any other RDF terms (non-lists) which we would want to keep.

The first DELETE finds and removes all cons cells. The second DELETE removes the triple with :p connecting the list to the subject.

Delete the whole list (general case)

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX :     <http://example/> 

INSERT DATA {
:x0 :p () .
:x0 :p "String 0" .
:x0 :p [] .

:x1 :p (1) .
:x1 :p "String 1" .
:x1 :p [] .

:x2 :p (1 2) .
:x2 :p "String 2" .
:x2 :p [] .

# A list not connected.
(1 2) .

# Not legal RDF.
# () .

} ;

INSERT { ?list :deleteMe true . }
WHERE {
   ?x :p ?list . 
   FILTER (?list = rdf:nil || EXISTS{?list rdf:rest ?z} )
} ;

# Delete the cons cells.
DELETE
    { ?z rdf:first ?head ; rdf:rest ?tail . }
WHERE { 
      [] :p ?list .
      ?list rdf:rest* ?z .
      ?z rdf:first ?head ;
         rdf:rest ?tail .
      } ;

# Delete the marked nodes
DELETE 
WHERE { ?x :p ?z . 
        ?z :deleteMe true . 
} ;

## ------
## Unconnected lists.

DELETE
    { ?z rdf:first ?head ; rdf:rest ?tail . }
WHERE { 
      ?list rdf:rest ?z2 .
      FILTER NOT EXISTS { ?s ?p ?list }
      ?list rdf:rest* ?z .
      ?z rdf:first ?head ;
         rdf:rest ?tail .
      } 

Deep breath.

This one is quite long.

The first step is to find and mark all the triples from a subject to a list via :p. We will need to delete at the end of the process but the property might also be used for non-lists and after the middle DELETE step all evidence of the lists is lost. The test:

    FILTER (?list = rdf:nil || EXISTS{?list rdf:rest ?z} )

catches both zero length lists and lists with elements.

Second step: delete all list elements, any subjects with properties rdf:first and rdf:rest.

Third step: remove the connecting triples and the markers.

Finally, we delete any lists where the start isn't connected to anything, which is the

    FILTER NOT EXISTS { ?s ?p ?list }

test.

License and Copyright

This page and the SPARQL 1.1 Update scripts are (c) Epimorphics Ltd and licensed under a Creative Commons Attribution 3.0 License.