Version support for datastore/Progress

< Version support for datastore
Revision as of 10:53, 5 June 2009 by Sascha silbe (talk | contribs) (→‎2009-06-01: report on VCS evaluation progress)

Time line

For reference, I've copied the timeline from the proposal.

2009-04-03
Application deadline
2009-04-12
Easter (sunday); UI mockup submitted for review by Design Team
2009-04-20
start of (university) term; announcement of accepted GSoC proposals
2009-05-10
submitted API draft for review by Development Team
2009-05-16
SugarCamp Europe 2009
2009-05-23
start of GSoC
2009-05-31
current code examined and understood; API, on-disk format and UI design chosen
2009-06-07
data store enhanced to be able to deal with versions (basic API)
2009-06-14
added (working) prev/next buttons to Journal details view
2009-06-21
added support for importing from existing data store
2009-06-28
added unit tests (and potentially regression tests), fixed all known bugs, submitted for review by Design Team
2009-07-06
GSoC midterm evaluation ("working and 90% done"); added indexing (e.g. using sqlite)
2009-07-13
code integrated upstream for increased exposure (testing!); started discussion on extended UI design (version tree etc.)
2009-07-25
end of (university) term
2009-08-10
end of GSoC
2009-10-31
Fedora 12 release; Sugar 0.86 release short time later?

Progress

2009-05-16

SugarCamp was great! I got to know a lot of the SugarLabs people - they're a cool bunch. :)

Time in general was too short to do anything more than getting to know each other, but Tomeu quickly showed me an old effort at introducing version support into data store that I didn't know about (both data store and Journal). It looks quite interesting API-wise (simply adds another, optional parameter called vid that can be used to request a specific version). Also had some time with Bernie and he offered to set up a host for VMs (for our build bots).

Unfortunately we also catched some virus in Paris, so I had to lay in bed for the next two weeks.

2009-06-01

Slowly getting up to speed again, diving a bit into data store code (both old and new) while fixing the build infrastructure (about the same time as we ran SugarCamp Gnome did some largish changes in jhbuild that broke our sugar-jhbuild).

Discussed some possible data store API changes with alsroot, but they didn't really interfere with my design and we decided to discuss them again after the version support is finished.

Tomeu had the idea of treating the current object_id (distinguishing between instances of an activity) as a combined instance and version identifier and introducing a new "super_object_id" that does what object_id does now. The advantage would be that activities could transparently access old versions. If we'd introduce a version_id in parallel to object_id (the naive approach taken by the old implementation) an activity (or at least the framework) would need to remember and pass through the resumed version_id in order to use the corresponding branch on save.

Started evaluating Version Control Systems (VCS). From the 27 systems on the comparison list done by the "Better SCM Initiative", 13 are open source with 8 of them being shipped by all of Debian, Fedora and Ubuntu (the systems currently officially supported in sugar-jhbuild).

While writing a benchmark to help in further elimination of candidates, I noticed our use case is actually quite distinct from that targeted by most systems (which is storing a small number of projects each carrying source code, i.e. a large number of related files):

  1. we're going to store a large number of unrelated entries (i.e. "projects" in traditional VCS nomenclature)
  2. most of our entries are going to be rather small (compared to entire source trees)
  3. for space efficiency, we don't want to keep working copies around after the activity using it has finished

Point 3 offers an excellent chance for VCS' which expose their low level working primitives (e.g. git) to be tuned to our use case as we might be able to directly access the repository instead of using the working directory as intermediate storage. It will only affect timing, not repository size, though.

The sample set chosen for the benchmark (789 text files from Project Gutenberg, 295 MB) occupied my desktop for about 10 hours, so while I've done only a single run (in multi user mode) yet the numbers should be accurate enough for an initial impression.

Benchmark results

 
Plot showing the time taken for operations common to our usage scenario for various Version Control Systems

The operations still missing from the benchmark are looking up and checking out intermediate versions (prior to branching it) instead of the latest version of the branch. Since many VCS' store the latest version as-is ("full" copy) and only deltas of the intermediate versions, this might change the timings quite a bit. I'll also need to do a weighted summary to account for the prospective usage pattern (more commits than checkouts due to autosave and only few branches created).


 
Plot showing the space occupied after the named operations have finished

While I included my favourite VCS, GNU arch, just out of curiousity (unfortunately not maintained anymore), it's comparing very well with the other systems: It's on second place both for total (unweighted) runtime and final repository size, giving perfect balance between those two goals.