Difference between revisions of "Version support for datastore/Progress"

From Sugar Labs
Jump to navigation Jump to search
(→‎2009-06-16: prototype finished - thoughts about datastore rewrite)
(add progress reports for past three weeks)
Line 179: Line 179:
  
  
=== 2009-06-16 ===
+
=== 2009-06-15 ===
  
 
Created [http://git.sugarlabs.org/projects/versionsupport-project/repos/mainline/trees/master/backends git VCS backend]. There's
 
Created [http://git.sugarlabs.org/projects/versionsupport-project/repos/mainline/trees/master/backends git VCS backend]. There's
Line 194: Line 194:
  
 
Also need to choose how to save metadata - use the current journal format extended to save a separate entry per '''version''' (as in the prototype), but store the data in a VCS per '''object'''. Or move to a standard database like sqlite. As for data, crash recovery needs to be carefully considered.
 
Also need to choose how to save metadata - use the current journal format extended to save a separate entry per '''version''' (as in the prototype), but store the data in a VCS per '''object'''. Or move to a standard database like sqlite. As for data, crash recovery needs to be carefully considered.
 +
 +
=== 2009-06-29 ===
 +
 +
Worked on a [http://git.sugarlabs.org/projects/versionsupport-project/repos/mainline/blobs/master/datastore-redesign.html proposal] (follow "raw blob data" link) for a full redesign of the datastore that can store data in a VCS with delta compression support and has a well-defined, asynchronous, [http://en.wikipedia.org/wiki/ACID ACID] API (currently e.g. <code>activity_id</code> is used by Record in an unintended way because there's no real definition of it).
 +
 +
=== 2009-07-06 ===
 +
 +
Started changing current prototype to match the API for the proposed redesign. Lots of bugs in code written by other people (e.g. [http://dev.sugarlabs.org/ticket/1040 #1040], [http://dev.sugarlabs.org/ticket/1042 #1042] and [http://dev.sugarlabs.org/ticket/1053 #1053]) made testing and finding my own bugs quite hard.
 +
 +
The datastore part is more or less done, the API consumer part (Journal etc.) currently uses some compatibility wrappers and needs to be adapted to use the new API. I'll probably change from distinct <code>tree_id</code> / <code>version_id</code> parameters to a combined <code>object_id</code> one (using a tuple) while I'm at it.

Revision as of 07:36, 14 July 2009

Time line

For reference, I've copied the timeline from the proposal.

2009-04-03
Application deadline
2009-04-12
Easter (sunday); UI mockup submitted for review by Design Team
2009-04-20
start of (university) term; announcement of accepted GSoC proposals
2009-05-10
submitted API draft for review by Development Team
2009-05-16
SugarCamp Europe 2009
2009-05-23
start of GSoC
2009-05-31
current code examined and understood; API, on-disk format and UI design chosen
2009-06-07
data store enhanced to be able to deal with versions (basic API)
2009-06-14
added (working) prev/next buttons to Journal details view
2009-06-21
added support for importing from existing data store
2009-06-28
added unit tests (and potentially regression tests), fixed all known bugs, submitted for review by Design Team
2009-07-06
GSoC midterm evaluation ("working and 90% done"); added indexing (e.g. using sqlite)
2009-07-13
code integrated upstream for increased exposure (testing!); started discussion on extended UI design (version tree etc.)
2009-07-25
end of (university) term
2009-08-10
end of GSoC
2009-10-31
Fedora 12 release; Sugar 0.86 release short time later?

Progress

2009-05-16

SugarCamp was great! I got to know a lot of the SugarLabs people - they're a cool bunch. :)

Time in general was too short to do anything more than getting to know each other, but Tomeu quickly showed me an old effort at introducing version support into data store that I didn't know about (both data store and Journal). It looks quite interesting API-wise (simply adds another, optional parameter called vid that can be used to request a specific version). Also had some time with Bernie and he offered to set up a host for VMs (for our build bots).

Unfortunately we also catched some virus in Paris, so I had to lay in bed for the next two weeks.

2009-06-01

Slowly getting up to speed again, diving a bit into data store code (both old and new) while fixing the build infrastructure (about the same time as we ran SugarCamp Gnome did some largish changes in jhbuild that broke our sugar-jhbuild).

Discussed some possible data store API changes with alsroot, but they didn't really interfere with my design and we decided to discuss them again after the version support is finished.

Tomeu had the idea of treating the current object_id (distinguishing between instances of an activity) as a combined instance and version identifier and introducing a new "super_object_id" that does what object_id does now. The advantage would be that activities could transparently access old versions. If we'd introduce a version_id in parallel to object_id (the naive approach taken by the old implementation) an activity (or at least the framework) would need to remember and pass through the resumed version_id in order to use the corresponding branch on save.

Started evaluating Version Control Systems (VCS). From the 27 systems on the comparison list done by the "Better SCM Initiative", 13 are open source with 8 of them being shipped by all of Debian, Fedora and Ubuntu (the systems currently officially supported in sugar-jhbuild).

While writing a benchmark to help in further elimination of candidates, I noticed our use case is actually quite distinct from that targeted by most systems (which is storing a small number of projects each carrying source code, i.e. a large number of related files):

  1. we're going to store a large number of unrelated entries (i.e. "projects" in traditional VCS nomenclature)
  2. most of our entries are going to be rather small (compared to entire source trees)
  3. for space efficiency, we don't want to keep working copies around after the activity using it has finished

Point 3 offers an excellent chance for VCS' which expose their low level working primitives (e.g. git) to be tuned to our use case as we might be able to directly access the repository instead of using the working directory as intermediate storage. It will only affect timing, not repository size, though.

The sample set chosen for the benchmark (789 text files from Project Gutenberg, 295 MB) occupied my desktop for about 10 hours, so while I've done only a single run (in multi user mode) yet the numbers should be accurate enough for an initial impression.

Benchmark results

Plot showing the time taken for operations common to our usage scenario for various Version Control Systems

The operations still missing from the benchmark are looking up and checking out intermediate versions (prior to branching it) instead of the latest version of the branch. Since many VCS' store the latest version as-is ("full" copy) and only deltas of the intermediate versions, this might change the timings quite a bit. I'll also need to do a weighted summary to account for the prospective usage pattern (more commits than checkouts due to autosave and only few branches created).


Plot showing the space occupied after the named operations have finished

While I included my favourite VCS, GNU arch, just out of curiousity (unfortunately not maintained anymore), it's comparing very well with the other systems: It's on second place both for total (unweighted) runtime and final repository size, giving perfect balance between those two goals.

2009-06-08

Finished the planned enhancements to the benchmark. After two runs (with the timings differing by about 20% on average) of about 13h each (with only the 100 of the Project Gutenberg files) there are a few loosers, but no clear winner. Monotone comes close but doesn't seem to scale well with branches (both size and runtime) and also has the highest memory consumption.

The results emphasize the need to use an abstraction layer so we can switch between different VCSs (depending on what resource is scarcest) or even to a homegrown backend later. For the prototype git seems to be a good candidate as it combines average resource consumption with a rich API, including low-level access.


Benchmark results

Summary of the final run of the VCS benchmark

Darcs is fast, but needs over four times the original size to store the data.

Repository size summary
Runtime summary
Repository sizes for monotone
Timings for monotone

Monotone starts out very small, but incurs high costs on branches.

Memory usage

Determining reliable figures for memory usage, especially peak usage, seems to require using valgrind which is - according to its documentation - slowing down program execution by a factor of 5 (which would mean 65h for the already reduced sample set). Instead I've done a few simple measurements using BSD process accounting for a run of the benchmark with a single file as sample set. This should at least give a very rough idea of memory consumption, especially relative to the other VCSs in the run.

arch (incl. tar) 1340k
bazaar 5060k
cvs 957k
darcs 7385k
git 1614k
mercurial 3501k
monotone 9865k
subversion 4121k


2009-06-15

Created git VCS backend. There's room for performance improvement and code simplification, but it's fully working, including getBranches() and getVersions().

Got a working prototype (sugar, sugar-datastore, sugar-toolkit). It treats each version as a separate object, unrelated to the others - so no topology, no branches. But it's working and the basic user experience (details view allows selecting old versions, changing metadata and resuming them) is there.

The API has been changed to add a version_id in parallel to object_id. This seemed cleanest, but also needs invasive code changes (though most activities will continue to work as-is - the framework (sugar-toolkit) usually handles save/resume for them). Breaking the API is almost unavoidable - see my API choosing notes for details. But I noticed that in most cases we want to identify a single version, so we could alter the Python side of the API to do that by default, in a single parameter - e.g. by passing a tuple (object_id, version_id) - instead of always passing around two separate parameters. The few places where we want to identify all versions of an object, we can do that explicitly.

And as I'm breaking API already, I'm thinking about tweaking it with regard to (a)synchronous operation - the code suggests it's currently half-synchronous, half asynchronous. For UI responsiveness, the "heavy" operations should be done without blocking the activity. So basically the activity provides new content, data store accepts it in a queue and adds it to the VCS backend after the activity has regained control (partially this is already being done today - see the optimizer code). Hardlinking and crashes (power loss) need to be considered carefully.

Also need to choose how to save metadata - use the current journal format extended to save a separate entry per version (as in the prototype), but store the data in a VCS per object. Or move to a standard database like sqlite. As for data, crash recovery needs to be carefully considered.

2009-06-29

Worked on a proposal (follow "raw blob data" link) for a full redesign of the datastore that can store data in a VCS with delta compression support and has a well-defined, asynchronous, ACID API (currently e.g. activity_id is used by Record in an unintended way because there's no real definition of it).

2009-07-06

Started changing current prototype to match the API for the proposed redesign. Lots of bugs in code written by other people (e.g. #1040, #1042 and #1053) made testing and finding my own bugs quite hard.

The datastore part is more or less done, the API consumer part (Journal etc.) currently uses some compatibility wrappers and needs to be adapted to use the new API. I'll probably change from distinct tree_id / version_id parameters to a combined object_id one (using a tuple) while I'm at it.