Version support for datastore/Progress

Time line

For reference, I've copied the timeline from the proposal.

2009-04-03
Application deadline
2009-04-12
Easter (sunday); UI mockup submitted for review by Design Team
2009-04-20
start of (university) term; announcement of accepted GSoC proposals
2009-05-10
submitted API draft for review by Development Team
2009-05-16
SugarCamp Europe 2009
2009-05-23
start of GSoC
2009-05-31
current code examined and understood; API, on-disk format and UI design chosen
2009-06-07
data store enhanced to be able to deal with versions (basic API)
2009-06-14
added (working) prev/next buttons to Journal details view
2009-06-21
added support for importing from existing data store
2009-06-28
added unit tests (and potentially regression tests), fixed all known bugs, submitted for review by Design Team
2009-07-06
GSoC midterm evaluation ("working and 90% done"); added indexing (e.g. using sqlite)
2009-07-13
code integrated upstream for increased exposure (testing!); started discussion on extended UI design (version tree etc.)
2009-07-25
end of (university) term
2009-08-10
end of GSoC
2009-10-31
Fedora 12 release; Sugar 0.86 release short time later?

Progress

2009-05-16

SugarCamp was great! I got to know a lot of the SugarLabs people - they're a cool bunch. :)

Time in general was too short to do anything more than getting to know each other, but Tomeu quickly showed me an old effort at introducing version support into data store that I didn't know about (both data store and Journal). It looks quite interesting API-wise (simply adds another, optional parameter called vid that can be used to request a specific version). Also had some time with Bernie and he offered to set up a host for VMs (for our build bots).

Unfortunately we also catched some virus in Paris, so I had to lay in bed for the next two weeks.

2009-06-01

Slowly getting up to speed again, diving a bit into data store code (both old and new) while fixing the build infrastructure (about the same time as we ran SugarCamp Gnome did some largish changes in jhbuild that broke our sugar-jhbuild).

Discussed some possible data store API changes with alsroot, but they didn't really interfere with my design and we decided to discuss them again after the version support is finished.

Tomeu had the idea of treating the current object_id (distinguishing between instances of an activity) as a combined instance and version identifier and introducing a new "super_object_id" that does what object_id does now. The advantage would be that activities could transparently access old versions. If we'd introduce a version_id in parallel to object_id (the naive approach taken by the old implementation) an activity (or at least the framework) would need to remember and pass through the resumed version_id in order to use the corresponding branch on save.

Started evaluating Version Control Systems (VCS). From the 27 systems on the comparison list done by the "Better SCM Initiative", 13 are open source with 8 of them being shipped by all of Debian, Fedora and Ubuntu (the systems currently officially supported in sugar-jhbuild).

While writing a benchmark to help in further elimination of candidates, I noticed our use case is actually quite distinct from that targeted by most systems (which is storing a small number of projects each carrying source code, i.e. a large number of related files):

  1. we're going to store a large number of unrelated entries (i.e. "projects" in traditional VCS nomenclature)
  2. most of our entries are going to be rather small (compared to entire source trees)
  3. for space efficiency, we don't want to keep working copies around after the activity using it has finished

Point 3 offers an excellent chance for VCS' which expose their low level working primitives (e.g. git) to be tuned to our use case as we might be able to directly access the repository instead of using the working directory as intermediate storage. It will only affect timing, not repository size, though.

The sample set chosen for the benchmark (789 text files from Project Gutenberg, 295 MB) occupied my desktop for about 10 hours, so while I've done only a single run (in multi user mode) yet the numbers should be accurate enough for an initial impression.

Benchmark results

 
Plot showing the time taken for operations common to our usage scenario for various Version Control Systems

The operations still missing from the benchmark are looking up and checking out intermediate versions (prior to branching it) instead of the latest version of the branch. Since many VCS' store the latest version as-is ("full" copy) and only deltas of the intermediate versions, this might change the timings quite a bit. I'll also need to do a weighted summary to account for the prospective usage pattern (more commits than checkouts due to autosave and only few branches created).


 
Plot showing the space occupied after the named operations have finished

While I included my favourite VCS, GNU arch, just out of curiousity (unfortunately not maintained anymore), it's comparing very well with the other systems: It's on second place both for total (unweighted) runtime and final repository size, giving perfect balance between those two goals.

2009-06-08

Finished the planned enhancements to the benchmark. After two runs (with the timings differing by about 20% on average) of about 13h each (with only the 100 of the Project Gutenberg files) there are a few loosers, but no clear winner. Monotone comes close but doesn't seem to scale well with branches (both size and runtime) and also has the highest memory consumption.

The results emphasize the need to use an abstraction layer so we can switch between different VCSs (depending on what resource is scarcest) or even to a homegrown backend later. For the prototype git seems to be a good candidate as it combines average resource consumption with a rich API, including low-level access.


Benchmark results

 
Summary of the final run of the VCS benchmark

Darcs is fast, but needs over four times the original size to store the data.

 
Repository size summary
 
Runtime summary
 
Repository sizes for monotone
 
Timings for monotone

Monotone starts out very small, but incurs high costs on branches.

Memory usage

Determining reliable figures for memory usage, especially peak usage, seems to require using valgrind which is - according to its documentation - slowing down program execution by a factor of 5 (which would mean 65h for the already reduced sample set). Instead I've done a few simple measurements using BSD process accounting for a run of the benchmark with a single file as sample set. This should at least give a very rough idea of memory consumption, especially relative to the other VCSs in the run.

arch (incl. tar) 1340k
bazaar 5060k
cvs 957k
darcs 7385k
git 1614k
mercurial 3501k
monotone 9865k
subversion 4121k