Difference between revisions of "Activities/Wikipedia/HowTo"

From Sugar Labs
Jump to navigation Jump to search
Line 155: Line 155:
  
 
You can create a new icon too, or modify the existing activity/activity-wikipedia-es.svg file.
 
You can create a new icon too, or modify the existing activity/activity-wikipedia-es.svg file.
 +
 +
If you are creating a new wikipedia activity, it's important change the name and the bundle_id in the activity/activity.info file. If you are updating the data in a existinting activity, the activity_version value must be changed.
 +
 +
Finally, to create the new .xo file and distribute it, you must do:
 +
 +
./setup_new_wiki.py es_lat/eswiki-20111112-pages-articles.xml
 +
 +
Now, in the directory dist, a new .xo file will be created and you can distribute it.

Revision as of 16:32, 28 December 2011

"Crear uno, dos, tres... mil Wikipedias" Comandante Ernesto Wales

How to Create a new wikipedia activity or update an existing activity

This page describes how to generate the data files needed to create a wikipedia activity like Wikipedia es or Wikipedia en

The general idea is to download an XML dump-file (backup) containing the current Wikipedia pages for a given language, this will be processed to select certain pages and compress them into a self-contained Sugar activity. Whether or not to include the images from the wiki articles will have a large impact on the size of the activity.

Generating a Wikipedia activity requires a computer with a lot of available disk space, ideally lots of RAM and a working Sugar environment. It is probably best to use packages provided by your favorite Linux distribution or in a virtual machine. The wikipedia xml file is very large (almost 6 GB for the Spanish wikipedia, and it is even bigger in English), and you will need lots of space to generate temporary files. The process has a long run-time, but it is mostly automated, although you will need to confirm success at each stage of the process before moving on to the next.

This page is a work in progress. If you have doubts or the information provided is not adequate, please contact me at gonzalo at laptop dot org and I will try to improve it.

Download the wikipedia base activity

You will need to download the wikipedia base from http://dev.laptop.org/~gonzalo/wikibase.zip. This package includes the activity and the tools to create the data files.

You need to create a directory in your Activities directory, for example WikipediaEs.activity and unzip wikibase.zip inside it.

Download a Wikipedia dump file

Wikipedia provides daily (or nearly daily) XML dump files for each language.

This test was done with the spanish dump. The file used was eswiki-20111112-pages-articles.xml.bz2 from http://dumps.wikimedia.org/eswiki/20110810/

You need to create a directory inside the previously created activity directory, and download the wikipedia dump file there.

The first two letters from your directory must be the language code. For example: es_es or en_us

mkdir es_lat
cd es_lat
bzip2 -d eswiki-20111112-pages-articles.xml.bz2

Process the dump file

You need to edit the file tools2/config.py, and modify the variable input_xml_file_name. If you are creating a wikipedia in a different language than spanish, modify the values of the other variables.

After saving the config.py, you can process the dump file:

../tools2/pages_parser.py

The process will generate files with the same prefix than the uncompressed xml file. This first process will create the following files:

eswiki-20111112-pages-articles.xml.links
eswiki-20111112-pages-articles.xml.page_templates
eswiki-20111112-pages-articles.xml.redirects
eswiki-20111112-pages-articles.xml.templates

If you want to have more information, you can use the option "--debug" to create the files prefix.all_pages, prefix.blacklisted and prefix.titles and help you to research about the process.

With the spanish file and a bit more than 2.3M pages, this process takes aprox 1:30 hours in my system.

Make a selection of pages

To create a selection of pages to be included in the wikipedia activity, you need create two files: favorites.txt and blacklist.txt The two files are a list of titles of pages.

The criteria used to select the pages is: all the pages in favorites.txt will be included and all the pages linked from this pages too, except the pages in the file blacklist.txt.

Our wikipedia activity has a static index page (static/index_lang.html) and you can create your own index page. I have created my favorites.txt with all the pages linked from the index page (in the case of spanish 130 pages) and 300 pages more selected from a statistic of most searched pages in wikipedia (http://stats.grok.se)

There is not a linear relation between the number of favorite pages and the final number of pages selected, but as reference, you can use this numbers:

           Favorites           Total selected pages
               130                     15788
               431                     45105
               544                     63347

The files favorites.txt and blacklist.txt should be in the same directory than the .xml file

To create the selection do:

../tools2/make_selection.py

After you have created the selection, you can look at the list of pages in the file pages_selected-level-1 and modify your favorites.txt and/or blacklist.txt files and rerun the process if you want to modify the selection.

Create the index

../tools2/create_index.py 

This process is faster and when finished you can do a basic test:

../tools2/test_index.py page_title

and this will show you the content in wiki format of the selected page. At this stage, the process will be slow, because it needs to search for every template and do all the templates substitutions. To have faster results we will apply templates substitutions in all the pages.

Optimze the data and download images

To expand the templates you need go out of the data directory:

cd ..
./tools2/expandtemplates.py es_lat

When finish, you need go back to the data directory:

cd es_lat

And move your new file with the templates replaced and recreate the index:

mv eswiki-20111112-pages-articles.xml.processed_expanded eswiki-20111112-pages-articles.xml.processed
../tools2/create_index.py --delete_all

The option --delete_all is used to remove the old index

If you want to include images in your wikipedia activity, you can go again to your data directory and do:

cd es_lat
../tools2/download_images.py

This command will download the images included in the pages in favorites.txt. If you want to include the images in all the pages, you should do:

../tools2/download_images.py --all

An option --cache_dir=directory is available if you have images already downloaded in another directory to acelerate the process.

Modify your activity to use the data files

You need can modify the file activity_es.py and modify the lines:

       self.WIKIDB = 'es_new/eswiki-20111112-pages-articles.xml'
       self.HOME_PAGE = '/static/index_es.html'

to point to your new data files or create a new different file, for example activity_pt.py.

If you create a new file, you will need to modify the file activity/activity.info to point to this new file.

Now, you can test your changes, starting the wikipedia server:

./server.py es_lat/eswiki-20111112-pages-articles.xml 8000

The first parameter is your xml data file and the second parameter a number of port.

In any web browser in the same computer you can look to a page, to check if is working:

In this example, we look at the page "Energía" pointing to http://localhost:8000/wiki/Energía

Wikipedia test.png

You can create a new icon too, or modify the existing activity/activity-wikipedia-es.svg file.

If you are creating a new wikipedia activity, it's important change the name and the bundle_id in the activity/activity.info file. If you are updating the data in a existinting activity, the activity_version value must be changed.

Finally, to create the new .xo file and distribute it, you must do:

./setup_new_wiki.py es_lat/eswiki-20111112-pages-articles.xml

Now, in the directory dist, a new .xo file will be created and you can distribute it.