Activities/Wikipedia/HowTo
"Crear uno, dos, tres... mil Wikipedias" Comandante Ernesto Wales
How to Create a new wikipedia activity or update a existing activity
This page describes how to generate the data files needed to create a wikipedia activity like Wikipedia es or Wikipedia en
The general idea is download a xml file with a dump (backup) with the state of the wikipedia pages, and process it to select a number of pages, and compress them, to include in a activity. Optionally, is possible download the images used in that pages.
You will need a computer with a lot of space on disk, and a working Sugar environment. May be using packages provided by your linux distribution or in a virtual machine. The wikipedia xml file is big (almost 6 GB to the spanish wikipedia, bigger in english), and you need more space to generate temporary files. The process takes a lot of time too, but is automatic, you only need check states at finish of every stage.
This page is a work in progress. If you have doubts or the information provided is not good enough, please contact me at gonzalo at laptop dot org and I will try to improve it.
Download the wikipedia base activity
You will need download the wikipedia base from http://dev.laptop.org/~gonzalo/wikibase.zip. This file include the activity and the tools to create the data files.
You need create a directory in your Activities directory for example WikipediaEs.activity and unzip wikibase.zip inside.
Download a dump
Wikipedia provide a almost daily xml files dump for every language.
This test was done with the spanish dump. The file used was eswiki-20111112-pages-articles.xml.bz2 from http://dumps.wikimedia.org/eswiki/20110810/
You need create a directory inside the create activity and download the wikipedia dump file
The first two letters from your directory must be the language code example: es_es or en_us
mkdir es_lat cd es_lat bzip2 -d eswiki-20111112-pages-articles.xml.bz2
Process the dump file
You need edit the file tools2/config.py, and modify the variable input_xml_file_name. If you are creating a wikipedia in a different language to spanish, modify the values of the other variables.
After save the config.py, you can process the dump file:
../tools2/pages_parser.py
The process will generate files with the same prefix than the uncompressed xml file. This first process will create the following files:
eswiki-20111112-pages-articles.xml.links eswiki-20111112-pages-articles.xml.page_templates eswiki-20111112-pages-articles.xml.redirects eswiki-20111112-pages-articles.xml.templates
If you want have more information, can use the option "--debug" to create the files prefix.all_pages, prefix.blacklisted and prefix.titles and help you to research about the process.
With the spanish file and a little more to 2.3M pages, this process takes aprox 1:30 hours
Make a selection of pages
To create a selection of pages to be included in the wikipedia activity, you need create two files: favorites.txt and blacklist.txt The two files are a list of titles of pages.
The criteria used to select the pages is: all the pages in favorites.txt will be included and all the pages linked from this pages too, except the pages in the file blacklist.txt
Our wikipedia activity have a static index page (static/index_lang.html) and you can create your own index page. I have created my favorites.txt with all the pages linked from the index page (in the case of spanish 130 pages) and 300 pages more selected from a statistic of most searched pages in wikipedia (http://stats.grok.se)
There are not a linear relation between number of favorite pages and final number of pages selected, but as reference, you can use this numbers:
Favorites Total selected pages 130 15788 431 45105 544 63347
The files favorites.txt and blacklist.txt should be in the same directory than the .xml file
To create the selection do:
../tools2/make_selection.py
After you have created the selection, you can look at the list of pages in the file pages_selected-level-1 and modify your favorites.txt and/or blacklist.txt files and rerun the process if you want modify the selection.
Create the index
../tools2/create_index.py
This process is faster and when finished you can do a basic test:
../tools2/test_index.py page_title
and this will show you the content in wiki format of the selected page. At this stage, the process will be slow, because need to search for every template and do all the templates substitutions. To have faster results we will apply templates substitutions in all the pages.
Optimze the data and download images
To expand the templates need go out of the data directory:
cd ..
./tools2/expandtemplates.py es_lat
If you want include images in your wikipedia activity can go again to your data directory and do:
cd es_lat ../tools2/download_images.py
This command will download the images included in the pages in favorites.txt If you want include the images in all the pages, should do:
../tools2/download_images.py --all
A option --cache_dir=directory is available if you have images already downloaded in another directory to acelerate the process.
Modify your activity to use the data files
You need can modify the file activity_es.py and modify the lines:
self.WIKIDB = 'es_new/eswiki-20111112-pages-articles.xml' self.HOME_PAGE = '/static/index_es.html'
to point to your new data files or create a new different file, for example activity_pt.py.
If you create a new file, you will need modify the file activity/activity.info to point to this new file.
You can create a new icon too, or modify the existing activity/activity-wikipedia-es.svg file.