Difference between revisions of "Development Team/Datastore Rewrite"

From Sugar Labs
Jump to navigation Jump to search
Line 55: Line 55:
 
  1a
 
  1a
 
       1ab88287-766a-4d98-a7c0-4233dc48647a
 
       1ab88287-766a-4d98-a7c0-4233dc48647a
             1ab88287-766a-4d98-a7c0-4233dc48647a
+
             data
 
             metadata
 
             metadata
 
                   activity_id
 
                   activity_id
Line 65: Line 65:
 
  2b
 
  2b
 
       2b90597c-0912-4e7f-8eeb-71a0f004490d
 
       2b90597c-0912-4e7f-8eeb-71a0f004490d
             2b90597c-0912-4e7f-8eeb-71a0f004490d
+
             data
 
             checksum ~> checksums/464493d8d929436b6152e868867ed451
 
             checksum ~> checksums/464493d8d929436b6152e868867ed451
 
             metadata
 
             metadata
Line 76: Line 76:
 
  3c
 
  3c
 
       3cdf5f0e-7595-4166-b1f9-cbedfcfe1c4a
 
       3cdf5f0e-7595-4166-b1f9-cbedfcfe1c4a
             3cdf5f0e-7595-4166-b1f9-cbedfcfe1c4a -> 2b/2b90597c-0912-4e7f-8eeb-71a0f004490d/2b90597c-0912-4e7f-8eeb-71a0f004490d
+
             data -> 2b/2b90597c-0912-4e7f-8eeb-71a0f004490d/data
 
             checksum ~> checksums/464493d8d929436b6152e868867ed451
 
             checksum ~> checksums/464493d8d929436b6152e868867ed451
 
             metadata
 
             metadata
Line 87: Line 87:
 
  4d
 
  4d
 
       4db11d29-2f07-4452-bd8e-22a6a483ac19
 
       4db11d29-2f07-4452-bd8e-22a6a483ac19
             4db11d29-2f07-4452-bd8e-22a6a483ac19
+
             data
 
             metadata
 
             metadata
 
                   activity_id
 
                   activity_id
Line 96: Line 96:
 
                   title
 
                   title
 
       4d9f2027-b41e-4015-a848-6b3972193eb8
 
       4d9f2027-b41e-4015-a848-6b3972193eb8
             4d9f2027-b41e-4015-a848-6b3972193eb8
+
             data
 
             metadata
 
             metadata
 
                   activity_id
 
                   activity_id
Line 106: Line 106:
 
  checksums
 
  checksums
 
       464493d8d929436b6152e868867ed451
 
       464493d8d929436b6152e868867ed451
             2b90597c-0912-4e7f-8eeb-71a0f004490d ~> 2b90597c-0912-4e7f-8eeb-71a0f004490d/2b90597c-0912-4e7f-8eeb-71a0f004490d
+
             2b90597c-0912-4e7f-8eeb-71a0f004490d ~> 2b90597c-0912-4e7f-8eeb-71a0f004490d/data
             3cdf5f0e-7595-4166-b1f9-cbedfcfe1c4a ~> 3cdf5f0e-7595-4166-b1f9-cbedfcfe1c4a/3cdf5f0e-7595-4166-b1f9-cbedfcfe1c4a
+
             3cdf5f0e-7595-4166-b1f9-cbedfcfe1c4a ~> 3cdf5f0e-7595-4166-b1f9-cbedfcfe1c4a/data
 
  index
 
  index
 
       flintlock
 
       flintlock
Line 128: Line 128:
 
'''1a/1ab88287-...-4233dc48647a''': directory holding the files related to one entry
 
'''1a/1ab88287-...-4233dc48647a''': directory holding the files related to one entry
  
'''1a/1ab88287-...-4233dc48647a/1ab88287-...-4233dc48647a''': "data" file related to an entry
+
'''1a/1ab88287-...-4233dc48647a/data''': file related to an entry
  
 
'''1a/1ab88287-...-4233dc48647a/metadata''': directory containing a file for each metadata property of an entry
 
'''1a/1ab88287-...-4233dc48647a/metadata''': directory containing a file for each metadata property of an entry
Line 134: Line 134:
 
'''2b/2b90597c-...-71a0f004490d/metadata/activity_id''': file containing the value of the '''activity_id''' property
 
'''2b/2b90597c-...-71a0f004490d/metadata/activity_id''': file containing the value of the '''activity_id''' property
  
'''3c/3cdf5f0e-...-cbedfcfe1c4a/3cdf5f0e-...-cbedfcfe1c4a''': hard link to the same file in the entry '''2b90597c-...-71a0f004490d'''
+
'''3c/3cdf5f0e-...-cbedfcfe1c4a/data''': hard link to the same file in the entry '''2b90597c-...-71a0f004490d'''
  
 
'''3c/3cdf5f0e-...-cbedfcfe1c4a/checksum ~> checksums/464493d8d929436b6152e868867ed451''': symbolic link to the file in '''checksums'''. Used to get the checksum of the entry without having to recalculate it nor read it from the metadata file
 
'''3c/3cdf5f0e-...-cbedfcfe1c4a/checksum ~> checksums/464493d8d929436b6152e868867ed451''': symbolic link to the file in '''checksums'''. Used to get the checksum of the entry without having to recalculate it nor read it from the metadata file

Revision as of 07:23, 30 September 2008

Goals

Reliability

A good DataStore doesn't lose data easily.

Performance

Queries should be fast enough for the journal to be very responsive when browsing its contents.

Activities should be able to store their data quickly and present a fast UI to their users.

The shell should be able to quickly query the DS to allow the user to resume entries from other views than the journal.

Maintainability

The original implementation tried to achieve goals that were hard and that proved not to be necessary at this stage. This has caused the code base to be unnecessarily complex and several changes to the requirements added considerable confusion to it. We wish to focus the code on what is really needed and do it well.

Custom metadata properties

Activities should be able to store in their entries the metadata they wish, should not be limited to a predefined set.

More efficient file storage

Identical files should be stored just once.

Versioned entries (not fulfilled yet)

Entries may be related in version trees.

Design

Filesystem knows which entries are stored

By examining the directory structure, we know where is localized the data related to each entry. We don't depend any more on a binary structure that could become corrupted and unusable as a whole.

Metadata is stored in a single file per property

Metadata for each entry is stored in several files, one per property. In this way, if corruption happened on one those properties, the rest of the entry (and the other entries in the DS) would be unaffected.

Queries are accelerated with a disposable database

This allows us to efficiently query the stored entries, but as we only use the database to accelerate queries, we can drop and recreate it in case of corruption or update to an incompatible database format.

Detect identical files and hard-link them

This improves storage efficiency in general, but in our case is more important because we wish to record in the journal several interactions that refer to the same file. For example, "Downloaded lesson3.pdf", "Read lesson3.pdf", "Sent lesson3.pdf to Juan" would all refer to the same file and we need to only store it once.

Layout on disk

The proposed implementation relies heavily on the data structures provided by the filesystem, thus the layout in which files are disposed on disk is a fundamental part of its design.

Example of a datastore containing 5 entries, two of them referring to the same file (with checksum 464493d8d929436b6152e868867ed451):

1a
      1ab88287-766a-4d98-a7c0-4233dc48647a
            data
            metadata
                  activity_id
                  mime_type
                  preview
                  share-scope
                  timestamp
                  title
2b
      2b90597c-0912-4e7f-8eeb-71a0f004490d
            data
            checksum ~> checksums/464493d8d929436b6152e868867ed451
            metadata
                  activity_id
                  mime_type
                  preview
                  share-scope
                  timestamp
                  title
3c
      3cdf5f0e-7595-4166-b1f9-cbedfcfe1c4a
            data -> 2b/2b90597c-0912-4e7f-8eeb-71a0f004490d/data
            checksum ~> checksums/464493d8d929436b6152e868867ed451
            metadata
                  activity_id
                  mime_type
                  preview
                  share-scope
                  timestamp
                  title
4d
      4db11d29-2f07-4452-bd8e-22a6a483ac19
            data
            metadata
                  activity_id
                  mime_type
                  preview
                  share-scope
                  timestamp
                  title
      4d9f2027-b41e-4015-a848-6b3972193eb8
            data
            metadata
                  activity_id
                  mime_type
                  preview
                  share-scope
                  timestamp
                  title
checksums
      464493d8d929436b6152e868867ed451
            2b90597c-0912-4e7f-8eeb-71a0f004490d ~> 2b90597c-0912-4e7f-8eeb-71a0f004490d/data
            3cdf5f0e-7595-4166-b1f9-cbedfcfe1c4a ~> 3cdf5f0e-7595-4166-b1f9-cbedfcfe1c4a/data
index
      flintlock
      iamflint
      postlist.baseA
      postlist.baseB
      postlist.DB
      record.baseA
      record.baseB
      record.DB
      termlist.baseA
      termlist.baseB
      termlist.DB
      value.baseA
      value.baseB
      value.DB

1a: directory holding entries, it's only function is to avoid having too many directories in a single directory, as this is considered specially harmful on jffs2.

1a/1ab88287-...-4233dc48647a: directory holding the files related to one entry

1a/1ab88287-...-4233dc48647a/data: file related to an entry

1a/1ab88287-...-4233dc48647a/metadata: directory containing a file for each metadata property of an entry

2b/2b90597c-...-71a0f004490d/metadata/activity_id: file containing the value of the activity_id property

3c/3cdf5f0e-...-cbedfcfe1c4a/data: hard link to the same file in the entry 2b90597c-...-71a0f004490d

3c/3cdf5f0e-...-cbedfcfe1c4a/checksum ~> checksums/464493d8d929436b6152e868867ed451: symbolic link to the file in checksums. Used to get the checksum of the entry without having to recalculate it nor read it from the metadata file

checksums: directory containing a directory per each file contained in the DS, named by its md5 checksum

checksums/464493d8d929436b6152e868867ed451: directory containing links to all the entries that contain a file with this checksum

checksums/464493d8d929436b6152e868867ed451/2b90597c-...-71a0f004490d symbolic link to a file in an entry with this checksum.

index: directory containing all files that belong to the search database. Can be deleted and recreated from the rest of the DS if needed without incurring in data loss.

Source code

http://dev.laptop.org/git?p=users/tomeu/datastore;a=summary