Difference between revisions of "Version support for datastore/Progress"
Sascha silbe (talk | contribs) (report on progress) |
Sascha silbe (talk | contribs) (→2009-06-01: report on VCS evaluation progress) |
||
Line 76: | Line 76: | ||
would need to remember and pass through the resumed <code>version_id</code> in order | would need to remember and pass through the resumed <code>version_id</code> in order | ||
to use the corresponding branch on save. | to use the corresponding branch on save. | ||
+ | |||
+ | Started evaluating Version Control Systems (VCS). From the 27 systems on the | ||
+ | [http://better-scm.berlios.de/comparison/comparison.html comparison list] done by | ||
+ | the "Better SCM Initiative", 13 are open source with 8 of them being shipped by | ||
+ | all of Debian, Fedora and Ubuntu (the systems currently officially supported | ||
+ | in [[Development Team/Jhbuild|sugar-jhbuild]]). | ||
+ | |||
+ | While writing a [http://git.sugarlabs.org/projects/versionsupport-project/repos/mainline/trees/master/benchmarks benchmark] | ||
+ | to help in further elimination of candidates, I noticed our use case is actually quite distinct from that targeted by most systems | ||
+ | (which is storing a small number of projects each carrying source code, i.e. a large number of '''related''' files): | ||
+ | |||
+ | # we're going to store a large number of '''un'''related entries (i.e. "projects" in traditional VCS nomenclature) | ||
+ | # most of our entries are going to be rather small (compared to entire source trees) | ||
+ | # for space efficiency, we don't want to keep working copies around after the activity using it has finished | ||
+ | |||
+ | Point 3 offers an excellent chance for VCS' which expose their low level working primitives (e.g. git) to be tuned to our | ||
+ | use case as we might be able to directly access the repository instead of using the working directory as intermediate | ||
+ | storage. It will only affect timing, not repository size, though. | ||
+ | |||
+ | The sample set chosen for the benchmark (789 text files from [http://www.gutenberg.org/ Project Gutenberg], 295 MB) occupied | ||
+ | my desktop for about 10 hours, so while I've done only a single run (in multi user mode) yet the numbers should be accurate | ||
+ | enough for an initial impression. | ||
+ | |||
+ | ==== Benchmark results ==== | ||
+ | |||
+ | [[Image:Op-vs-time.png|thumb|Plot showing the time taken for operations common to our usage scenario for various Version Control Systems]] | ||
+ | |||
+ | The operations still missing from the benchmark are looking up and checking out intermediate versions (prior to branching it) | ||
+ | instead of the latest version of the branch. Since many VCS' store the latest version as-is ("full" copy) and only deltas | ||
+ | of the intermediate versions, this might change the timings quite a bit. I'll also need to do a weighted summary to account | ||
+ | for the prospective usage pattern (more commits than checkouts due to autosave and only few branches created). | ||
+ | |||
+ | |||
+ | [[Image:Op-vs-size.png|thumb|Plot showing the space occupied after the named operations have finished]] | ||
+ | |||
+ | While I included my favourite VCS, [http://www.gnu.org/software/gnu-arch/ GNU arch], just out of curiousity | ||
+ | (unfortunately [http://lists.gnu.org/archive/html/gnu-arch-users/2008-11/msg00001.html not maintained anymore]), | ||
+ | it's comparing very well with the other systems: It's on second place both for total (unweighted) runtime and | ||
+ | final repository size, giving perfect balance between those two goals. |
Revision as of 09:53, 5 June 2009
Time line
For reference, I've copied the timeline from the proposal.
- 2009-04-03
- Application deadline
- 2009-04-12
- Easter (sunday); UI mockup submitted for review by Design Team
- 2009-04-20
- start of (university) term; announcement of accepted GSoC proposals
- 2009-05-10
- submitted API draft for review by Development Team
- 2009-05-16
- SugarCamp Europe 2009
- 2009-05-23
- start of GSoC
- 2009-05-31
- current code examined and understood; API, on-disk format and UI design chosen
- 2009-06-07
- data store enhanced to be able to deal with versions (basic API)
- 2009-06-14
- added (working) prev/next buttons to Journal details view
- 2009-06-21
- added support for importing from existing data store
- 2009-06-28
- added unit tests (and potentially regression tests), fixed all known bugs, submitted for review by Design Team
- 2009-07-06
- GSoC midterm evaluation ("working and 90% done"); added indexing (e.g. using sqlite)
- 2009-07-13
- code integrated upstream for increased exposure (testing!); started discussion on extended UI design (version tree etc.)
- 2009-07-25
- end of (university) term
- 2009-08-10
- end of GSoC
- 2009-10-31
- Fedora 12 release; Sugar 0.86 release short time later?
Progress
2009-05-16
SugarCamp was great! I got to know a lot of the SugarLabs people - they're a cool bunch. :)
Time in general was too short to do anything more than getting to know each other,
but Tomeu quickly showed me an old effort at introducing version support into data
store that I didn't know about (both
data store
and
Journal).
It looks quite interesting API-wise (simply adds another,
optional parameter called vid
that can be used to request a specific version).
Also had some time with Bernie and he offered to set up a host for VMs
(for our build bots).
Unfortunately we also catched some virus in Paris, so I had to lay in bed for the next two weeks.
2009-06-01
Slowly getting up to speed again, diving a bit into data store code (both old and new) while fixing the build infrastructure (about the same time as we ran SugarCamp Gnome did some largish changes in jhbuild that broke our sugar-jhbuild).
Discussed some possible data store API changes with alsroot, but they didn't really interfere with my design and we decided to discuss them again after the version support is finished.
Tomeu had the idea of treating the current object_id
(distinguishing between instances of an activity) as a combined instance and version
identifier and introducing a new
"super_object_id
" that does what object_id
does now. The
advantage would be that activities could transparently access old versions. If we'd
introduce a version_id
in parallel to object_id
(the naive
approach taken by the old implementation) an activity (or at least the framework)
would need to remember and pass through the resumed version_id
in order
to use the corresponding branch on save.
Started evaluating Version Control Systems (VCS). From the 27 systems on the comparison list done by the "Better SCM Initiative", 13 are open source with 8 of them being shipped by all of Debian, Fedora and Ubuntu (the systems currently officially supported in sugar-jhbuild).
While writing a benchmark to help in further elimination of candidates, I noticed our use case is actually quite distinct from that targeted by most systems (which is storing a small number of projects each carrying source code, i.e. a large number of related files):
- we're going to store a large number of unrelated entries (i.e. "projects" in traditional VCS nomenclature)
- most of our entries are going to be rather small (compared to entire source trees)
- for space efficiency, we don't want to keep working copies around after the activity using it has finished
Point 3 offers an excellent chance for VCS' which expose their low level working primitives (e.g. git) to be tuned to our use case as we might be able to directly access the repository instead of using the working directory as intermediate storage. It will only affect timing, not repository size, though.
The sample set chosen for the benchmark (789 text files from Project Gutenberg, 295 MB) occupied my desktop for about 10 hours, so while I've done only a single run (in multi user mode) yet the numbers should be accurate enough for an initial impression.
Benchmark results
The operations still missing from the benchmark are looking up and checking out intermediate versions (prior to branching it) instead of the latest version of the branch. Since many VCS' store the latest version as-is ("full" copy) and only deltas of the intermediate versions, this might change the timings quite a bit. I'll also need to do a weighted summary to account for the prospective usage pattern (more commits than checkouts due to autosave and only few branches created).
While I included my favourite VCS, GNU arch, just out of curiousity (unfortunately not maintained anymore), it's comparing very well with the other systems: It's on second place both for total (unweighted) runtime and final repository size, giving perfect balance between those two goals.