Version support for datastore/Progress
Time line
For reference, I've copied the timeline from the proposal.
- 2009-04-03
- Application deadline
- 2009-04-12
- Easter (sunday); UI mockup submitted for review by Design Team
- 2009-04-20
- start of (university) term; announcement of accepted GSoC proposals
- 2009-05-10
- submitted API draft for review by Development Team
- 2009-05-16
- SugarCamp Europe 2009
- 2009-05-23
- start of GSoC
- 2009-05-31
- current code examined and understood; API, on-disk format and UI design chosen
- 2009-06-07
- data store enhanced to be able to deal with versions (basic API)
- 2009-06-14
- added (working) prev/next buttons to Journal details view
- 2009-06-21
- added support for importing from existing data store
- 2009-06-28
- added unit tests (and potentially regression tests), fixed all known bugs, submitted for review by Design Team
- 2009-07-06
- GSoC midterm evaluation ("working and 90% done"); added indexing (e.g. using sqlite)
- 2009-07-13
- code integrated upstream for increased exposure (testing!); started discussion on extended UI design (version tree etc.)
- 2009-07-25
- end of (university) term
- 2009-08-10
- end of GSoC
- 2009-10-31
- Fedora 12 release; Sugar 0.86 release short time later?
Progress
2009-05-16
SugarCamp was great! I got to know a lot of the SugarLabs people - they're a cool bunch. :)
Time in general was too short to do anything more than getting to know each other,
but Tomeu quickly showed me an old effort at introducing version support into data
store that I didn't know about (both
data store
and
Journal).
It looks quite interesting API-wise (simply adds another,
optional parameter called vid
that can be used to request a specific version).
Also had some time with Bernie and he offered to set up a host for VMs
(for our build bots).
Unfortunately we also catched some virus in Paris, so I had to lay in bed for the next two weeks.
2009-06-01
Slowly getting up to speed again, diving a bit into data store code (both old and new) while fixing the build infrastructure (about the same time as we ran SugarCamp Gnome did some largish changes in jhbuild that broke our sugar-jhbuild).
Discussed some possible data store API changes with alsroot, but they didn't really interfere with my design and we decided to discuss them again after the version support is finished.
Tomeu had the idea of treating the current object_id
(distinguishing between instances of an activity) as a combined instance and version
identifier and introducing a new
"super_object_id
" that does what object_id
does now. The
advantage would be that activities could transparently access old versions. If we'd
introduce a version_id
in parallel to object_id
(the naive
approach taken by the old implementation) an activity (or at least the framework)
would need to remember and pass through the resumed version_id
in order
to use the corresponding branch on save.
Started evaluating Version Control Systems (VCS). From the 27 systems on the comparison list done by the "Better SCM Initiative", 13 are open source with 8 of them being shipped by all of Debian, Fedora and Ubuntu (the systems currently officially supported in sugar-jhbuild).
While writing a benchmark to help in further elimination of candidates, I noticed our use case is actually quite distinct from that targeted by most systems (which is storing a small number of projects each carrying source code, i.e. a large number of related files):
- we're going to store a large number of unrelated entries (i.e. "projects" in traditional VCS nomenclature)
- most of our entries are going to be rather small (compared to entire source trees)
- for space efficiency, we don't want to keep working copies around after the activity using it has finished
Point 3 offers an excellent chance for VCS' which expose their low level working primitives (e.g. git) to be tuned to our use case as we might be able to directly access the repository instead of using the working directory as intermediate storage. It will only affect timing, not repository size, though.
The sample set chosen for the benchmark (789 text files from Project Gutenberg, 295 MB) occupied my desktop for about 10 hours, so while I've done only a single run (in multi user mode) yet the numbers should be accurate enough for an initial impression.
Benchmark results
The operations still missing from the benchmark are looking up and checking out intermediate versions (prior to branching it) instead of the latest version of the branch. Since many VCS' store the latest version as-is ("full" copy) and only deltas of the intermediate versions, this might change the timings quite a bit. I'll also need to do a weighted summary to account for the prospective usage pattern (more commits than checkouts due to autosave and only few branches created).
While I included my favourite VCS, GNU arch, just out of curiousity (unfortunately not maintained anymore), it's comparing very well with the other systems: It's on second place both for total (unweighted) runtime and final repository size, giving perfect balance between those two goals.
2009-06-08
Finished the planned enhancements to the benchmark. After two runs (with the timings differing by about 20% on average) of about 13h each (with only the 100 of the Project Gutenberg files) there are a few loosers, but no clear winner. Monotone comes close but doesn't seem to scale well with branches (both size and runtime) and also has the highest memory consumption.
The results emphasize need to use an abstraction layer so we can switch between different VCSs (depending on what resource is scarcest) or even to a homegrown backend later. For the prototype git seems to be a good candidate as it combines average resource consumption with a rich API, including low-level access.
Benchmark results
Darcs is fast, but needs over four times the original size to store the data.
Monotone starts out very small, but incurs high costs on branches.
Memory usage
Determining reliable figures for memory usage, especially peak usage, seems to require using valgrind which is - according to its documentation - slowing down program execution by a factor of 5 (which would mean 65h for the already reduced sample set). Instead I've done a few simple measurements using BSD process accounting for a run of the benchmark with a single file as sample set. This should at least give a very rough idea of memory consumption, especially relative to the other VCSs in the run.
arch (incl. tar) | 1340k |
bazaar | 5060k |
cvs | 957k |
darcs | 7385k |
git | 1614k |
mercurial | 3501k |
monotone | 9865k |
subversion | 4121k |