Changes

Jump to navigation Jump to search
→‎2009-06-15: fix links
Line 76: Line 76:  
would need to remember and pass through the resumed <code>version_id</code> in order
 
would need to remember and pass through the resumed <code>version_id</code> in order
 
to use the corresponding branch on save.
 
to use the corresponding branch on save.
 +
 +
Started evaluating Version Control Systems (VCS). From the 27 systems on the
 +
[http://better-scm.berlios.de/comparison/comparison.html comparison list] done by
 +
the "Better SCM Initiative", 13 are open source with 8 of them being shipped by
 +
all of Debian, Fedora and Ubuntu (the systems currently officially supported
 +
in [[Development Team/Jhbuild|sugar-jhbuild]]).
 +
 +
While writing a [http://git.sugarlabs.org/projects/versionsupport-project/repos/mainline/trees/master/benchmarks benchmark]
 +
to help in further elimination of candidates, I noticed our use case is actually quite distinct from that targeted by most systems
 +
(which is storing a small number of projects each carrying source code, i.e. a large number of '''related''' files):
 +
 +
# we're going to store a large number of '''un'''related entries (i.e. "projects" in traditional VCS nomenclature)
 +
# most of our entries are going to be rather small (compared to entire source trees)
 +
# for space efficiency, we don't want to keep working copies around after the activity using it has finished
 +
 +
Point 3 offers an excellent chance for VCS' which expose their low level working primitives (e.g. git) to be tuned to our
 +
use case as we might be able to directly access the repository instead of using the working directory as intermediate
 +
storage. It will only affect timing, not repository size, though.
 +
 +
The sample set chosen for the benchmark (789 text files from [http://www.gutenberg.org/ Project Gutenberg], 295 MB) occupied
 +
my desktop for about 10 hours, so while I've done only a single run (in multi user mode) yet the numbers should be accurate
 +
enough for an initial impression.
 +
 +
==== Benchmark results ====
 +
 +
[[Image:Op-vs-time.png|thumb|Plot showing the time taken for operations common to our usage scenario for various Version Control Systems]]
 +
 +
The operations still missing from the benchmark are looking up and checking out intermediate versions (prior to branching it)
 +
instead of the latest version of the branch. Since many VCS' store the latest version as-is ("full" copy) and only deltas
 +
of the intermediate versions, this might change the timings quite a bit. I'll also need to do a weighted summary to account
 +
for the prospective usage pattern (more commits than checkouts due to autosave and only few branches created).
 +
 +
 +
[[Image:Op-vs-size.png|thumb|Plot showing the space occupied after the named operations have finished]]
 +
 +
While I included my favourite VCS, [http://www.gnu.org/software/gnu-arch/ GNU arch], just out of curiousity
 +
(unfortunately [http://lists.gnu.org/archive/html/gnu-arch-users/2008-11/msg00001.html not maintained anymore]),
 +
it's comparing very well with the other systems: It's on second place both for total (unweighted) runtime and
 +
final repository size, giving perfect balance between those two goals.
 +
 +
=== 2009-06-08 ===
 +
 +
Finished the planned enhancements to the benchmark. After two runs (with the timings differing by about 20% on average) of about 13h each
 +
(with only the 100 of the Project Gutenberg files) there are a few loosers, but no clear winner. Monotone comes close but doesn't seem
 +
to scale well with branches (both size and runtime) and also has the highest memory consumption.
 +
 +
The results emphasize the need to use an abstraction layer so we can switch between different VCSs (depending on what resource is
 +
scarcest) or even to a homegrown backend later. For the prototype git seems to be a good candidate as it combines average
 +
resource consumption with a rich API, including low-level access.
 +
 +
 +
==== Benchmark results ====
 +
 +
[[Image:Vcs-benchmark-run-2-total.png|thumb|Summary of the final run of the VCS benchmark]]
 +
 +
Darcs is fast, but needs over four times the original size to store the data.
 +
 +
[[Image:Vcs-benchmark-run-2-op-vs-size.png|thumb|Repository size summary]]
 +
 +
[[Image:Vcs-benchmark-run-2-op-vs-time.png|thumb|Runtime summary]]
 +
 +
[[Image:Vcs-benchmark-run-2-op-vs-size-monotone.png|thumb|Repository sizes for monotone]]
 +
[[Image:Vcs-benchmark-run-2-op-vs-time-monotone.png|thumb|Timings for monotone]]
 +
 +
Monotone starts out very small, but incurs high costs on branches.
 +
 +
==== Memory usage ====
 +
 +
Determining reliable figures for memory usage, especially peak usage, seems to require using valgrind which is - according to its
 +
documentation - slowing down program execution by a factor of 5 (which would mean 65h for the already reduced sample set).
 +
Instead I've done a few simple measurements using [http://www.gnu.org/software/acct/ BSD process accounting] for a run of the
 +
benchmark with a single file as sample set. This should at least give a very rough idea of memory consumption, especially
 +
relative to the other VCSs in the run.
 +
 +
{|
 +
|-
 +
|arch (incl. tar)
 +
|align="right" | 1340k
 +
|-
 +
|bazaar
 +
|align="right" | 5060k
 +
|-
 +
|cvs
 +
|align="right" | 957k
 +
|-
 +
|darcs
 +
|align="right" | 7385k
 +
|-
 +
|git
 +
|align="right" | 1614k
 +
|-
 +
|mercurial
 +
|align="right" | 3501k
 +
|-
 +
|monotone
 +
|align="right" | 9865k
 +
|-
 +
|subversion
 +
|align="right" | 4121k
 +
|-
 +
|}
 +
 +
 +
=== 2009-06-15 ===
 +
 +
Created [http://git.sugarlabs.org/projects/versionsupport-project/repos/mainline/trees/master/backends git VCS backend]. There's
 +
room for performance improvement and code simplification, but it's fully working, including getBranches() and getVersions().
 +
 +
Got a working prototype
 +
([http://git.sugarlabs.org/projects/sugar/repos/versionsupport sugar],
 +
[http://git.sugarlabs.org/projects/sugar-datastore/repos/versionsupport sugar-datastore],
 +
[http://git.sugarlabs.org/projects/sugar-toolkit/repos/versionsupport sugar-toolkit]). It treats each version as a separate object, unrelated to the others - so no topology, no branches. But it's working and the basic user experience (details view allows selecting old versions, changing metadata and resuming them) is there.
 +
 +
The API has been changed to add a version_id in parallel to object_id. This seemed cleanest, but also needs invasive code changes (though most activities will continue to work as-is - the framework (sugar-toolkit) usually handles save/resume for them). Breaking the API is almost unavoidable - see [http://git.sugarlabs.org/projects/versionsupport-project/repos/mainline/blobs/master/API-choosing-notes.txt my API choosing notes] for details. But I noticed that in most cases we want to identify a single version, so we could alter the Python side of the API to do that by default, in a single parameter - e.g. by passing a tuple (object_id, version_id) - instead of always passing around two separate parameters. The few places where we want to identify all versions of an object, we can do that explicitly.
 +
 +
And as I'm breaking API already, I'm thinking about tweaking it with regard to (a)synchronous operation - the code suggests it's currently half-synchronous, half asynchronous. For UI responsiveness, the "heavy" operations should be done without blocking the activity. So basically the activity provides new content, data store accepts it in a queue and adds it to the VCS backend after the activity has regained control (partially this is already being done today - see the optimizer code). Hardlinking and crashes (power loss) need to be considered carefully.
 +
 +
Also need to choose how to save metadata - use the current journal format extended to save a separate entry per '''version''' (as in the prototype), but store the data in a VCS per '''object'''. Or move to a standard database like sqlite. As for data, crash recovery needs to be carefully considered.
 +
 +
=== 2009-06-29 ===
 +
 +
Worked on a [http://git.sugarlabs.org/projects/versionsupport-project/repos/mainline/blobs/master/datastore-redesign.html proposal] (follow "raw blob data" link) for a full redesign of the datastore that can store data in a VCS with delta compression support and has a well-defined, asynchronous, [http://en.wikipedia.org/wiki/ACID ACID] API (currently e.g. <code>activity_id</code> is used by Record in an unintended way because there's no real definition of it).
 +
 +
=== 2009-07-06 ===
 +
 +
Started changing current prototype to match the API for the proposed redesign. Lots of bugs in code written by other people (e.g. [http://dev.sugarlabs.org/ticket/1040 #1040], [http://dev.sugarlabs.org/ticket/1042 #1042] and [http://dev.sugarlabs.org/ticket/1053 #1053]) made testing and finding my own bugs quite hard.
 +
 +
The datastore part is more or less done, the API consumer part (Journal etc.) currently uses some compatibility wrappers and needs to be adapted to use the new API. I'll probably change from distinct <code>tree_id</code> / <code>version_id</code> parameters to a combined <code>object_id</code> one (using a tuple) while I'm at it.
 +
 +
=== 2009-07-13 ===
 +
 +
Worked on changing the Python APIs and code (i.e. sugar and sugar-toolkit) to use the new datastore API. This included changing several modules that directly interfaced with the datastore DBus API to use the Python-side API (sugar.datastore.datastore) instead. Now only the Journal uses the DBus API directly (because the way it currently works internally is too different from the way sugar.datastore.datastore works).
 +
 +
=== 2009-07-20 ===
 +
 +
Added support for Xapian prefix search and started working with Tomeu about merging my changes into mainline. Some small fixes are already in ([http://dev.sugarlabs.org/ticket/1040 #1040], [http://dev.sugarlabs.org/ticket/1053 #1053], [http://dev.sugarlabs.org/ticket/1059 #1059]), for the larger one ([http://dev.sugarlabs.org/ticket/1090 #1090]) I've now got feedback from Tomeu and work on changing it accordingly (mostly stylistic issues).
 +
 +
=== 2009-07-27 ===
 +
 +
Will need to work on merging Tomeus latest changes (esp. in sugar and sugar-toolkit) into my tree since the metacity change broke my sugar-jhbuild installation (even a standard sugar-jhbuild installation - i.e. without version support for datastore - is currently quite buggy, even Terminal isn't working properly).
344

edits

Navigation menu