Version support for datastore/Progress: Difference between revisions
Sascha silbe (talk | contribs) →2009-06-01: report on VCS evaluation progress |
Sascha silbe (talk | contribs) →2009-06-15: fix links |
||
| (6 intermediate revisions by the same user not shown) | |||
| Line 115: | Line 115: | ||
it's comparing very well with the other systems: It's on second place both for total (unweighted) runtime and | it's comparing very well with the other systems: It's on second place both for total (unweighted) runtime and | ||
final repository size, giving perfect balance between those two goals. | final repository size, giving perfect balance between those two goals. | ||
=== 2009-06-08 === | |||
Finished the planned enhancements to the benchmark. After two runs (with the timings differing by about 20% on average) of about 13h each | |||
(with only the 100 of the Project Gutenberg files) there are a few loosers, but no clear winner. Monotone comes close but doesn't seem | |||
to scale well with branches (both size and runtime) and also has the highest memory consumption. | |||
The results emphasize the need to use an abstraction layer so we can switch between different VCSs (depending on what resource is | |||
scarcest) or even to a homegrown backend later. For the prototype git seems to be a good candidate as it combines average | |||
resource consumption with a rich API, including low-level access. | |||
==== Benchmark results ==== | |||
[[Image:Vcs-benchmark-run-2-total.png|thumb|Summary of the final run of the VCS benchmark]] | |||
Darcs is fast, but needs over four times the original size to store the data. | |||
[[Image:Vcs-benchmark-run-2-op-vs-size.png|thumb|Repository size summary]] | |||
[[Image:Vcs-benchmark-run-2-op-vs-time.png|thumb|Runtime summary]] | |||
[[Image:Vcs-benchmark-run-2-op-vs-size-monotone.png|thumb|Repository sizes for monotone]] | |||
[[Image:Vcs-benchmark-run-2-op-vs-time-monotone.png|thumb|Timings for monotone]] | |||
Monotone starts out very small, but incurs high costs on branches. | |||
==== Memory usage ==== | |||
Determining reliable figures for memory usage, especially peak usage, seems to require using valgrind which is - according to its | |||
documentation - slowing down program execution by a factor of 5 (which would mean 65h for the already reduced sample set). | |||
Instead I've done a few simple measurements using [http://www.gnu.org/software/acct/ BSD process accounting] for a run of the | |||
benchmark with a single file as sample set. This should at least give a very rough idea of memory consumption, especially | |||
relative to the other VCSs in the run. | |||
{| | |||
|- | |||
|arch (incl. tar) | |||
|align="right" | 1340k | |||
|- | |||
|bazaar | |||
|align="right" | 5060k | |||
|- | |||
|cvs | |||
|align="right" | 957k | |||
|- | |||
|darcs | |||
|align="right" | 7385k | |||
|- | |||
|git | |||
|align="right" | 1614k | |||
|- | |||
|mercurial | |||
|align="right" | 3501k | |||
|- | |||
|monotone | |||
|align="right" | 9865k | |||
|- | |||
|subversion | |||
|align="right" | 4121k | |||
|- | |||
|} | |||
=== 2009-06-15 === | |||
Created [http://git.sugarlabs.org/projects/versionsupport-project/repos/mainline/trees/master/backends git VCS backend]. There's | |||
room for performance improvement and code simplification, but it's fully working, including getBranches() and getVersions(). | |||
Got a working prototype | |||
([http://git.sugarlabs.org/projects/sugar/repos/versionsupport sugar], | |||
[http://git.sugarlabs.org/projects/sugar-datastore/repos/versionsupport sugar-datastore], | |||
[http://git.sugarlabs.org/projects/sugar-toolkit/repos/versionsupport sugar-toolkit]). It treats each version as a separate object, unrelated to the others - so no topology, no branches. But it's working and the basic user experience (details view allows selecting old versions, changing metadata and resuming them) is there. | |||
The API has been changed to add a version_id in parallel to object_id. This seemed cleanest, but also needs invasive code changes (though most activities will continue to work as-is - the framework (sugar-toolkit) usually handles save/resume for them). Breaking the API is almost unavoidable - see [http://git.sugarlabs.org/projects/versionsupport-project/repos/mainline/blobs/master/API-choosing-notes.txt my API choosing notes] for details. But I noticed that in most cases we want to identify a single version, so we could alter the Python side of the API to do that by default, in a single parameter - e.g. by passing a tuple (object_id, version_id) - instead of always passing around two separate parameters. The few places where we want to identify all versions of an object, we can do that explicitly. | |||
And as I'm breaking API already, I'm thinking about tweaking it with regard to (a)synchronous operation - the code suggests it's currently half-synchronous, half asynchronous. For UI responsiveness, the "heavy" operations should be done without blocking the activity. So basically the activity provides new content, data store accepts it in a queue and adds it to the VCS backend after the activity has regained control (partially this is already being done today - see the optimizer code). Hardlinking and crashes (power loss) need to be considered carefully. | |||
Also need to choose how to save metadata - use the current journal format extended to save a separate entry per '''version''' (as in the prototype), but store the data in a VCS per '''object'''. Or move to a standard database like sqlite. As for data, crash recovery needs to be carefully considered. | |||
=== 2009-06-29 === | |||
Worked on a [http://git.sugarlabs.org/projects/versionsupport-project/repos/mainline/blobs/master/datastore-redesign.html proposal] (follow "raw blob data" link) for a full redesign of the datastore that can store data in a VCS with delta compression support and has a well-defined, asynchronous, [http://en.wikipedia.org/wiki/ACID ACID] API (currently e.g. <code>activity_id</code> is used by Record in an unintended way because there's no real definition of it). | |||
=== 2009-07-06 === | |||
Started changing current prototype to match the API for the proposed redesign. Lots of bugs in code written by other people (e.g. [http://dev.sugarlabs.org/ticket/1040 #1040], [http://dev.sugarlabs.org/ticket/1042 #1042] and [http://dev.sugarlabs.org/ticket/1053 #1053]) made testing and finding my own bugs quite hard. | |||
The datastore part is more or less done, the API consumer part (Journal etc.) currently uses some compatibility wrappers and needs to be adapted to use the new API. I'll probably change from distinct <code>tree_id</code> / <code>version_id</code> parameters to a combined <code>object_id</code> one (using a tuple) while I'm at it. | |||
=== 2009-07-13 === | |||
Worked on changing the Python APIs and code (i.e. sugar and sugar-toolkit) to use the new datastore API. This included changing several modules that directly interfaced with the datastore DBus API to use the Python-side API (sugar.datastore.datastore) instead. Now only the Journal uses the DBus API directly (because the way it currently works internally is too different from the way sugar.datastore.datastore works). | |||
=== 2009-07-20 === | |||
Added support for Xapian prefix search and started working with Tomeu about merging my changes into mainline. Some small fixes are already in ([http://dev.sugarlabs.org/ticket/1040 #1040], [http://dev.sugarlabs.org/ticket/1053 #1053], [http://dev.sugarlabs.org/ticket/1059 #1059]), for the larger one ([http://dev.sugarlabs.org/ticket/1090 #1090]) I've now got feedback from Tomeu and work on changing it accordingly (mostly stylistic issues). | |||
=== 2009-07-27 === | |||
Will need to work on merging Tomeus latest changes (esp. in sugar and sugar-toolkit) into my tree since the metacity change broke my sugar-jhbuild installation (even a standard sugar-jhbuild installation - i.e. without version support for datastore - is currently quite buggy, even Terminal isn't working properly). | |||