Activities/Wikipedia/HowTo: Difference between revisions
Small redaction changes |
|||
| Line 6: | Line 6: | ||
[http://activities.sugarlabs.org/es-ES/sugar/addon/4401 Wikipedia es] or [http://activities.sugarlabs.org/es-ES/sugar/addon/4411 Wikipedia en] | [http://activities.sugarlabs.org/es-ES/sugar/addon/4401 Wikipedia es] or [http://activities.sugarlabs.org/es-ES/sugar/addon/4411 Wikipedia en] | ||
The general idea is to download an XML dump-file (backup) containing the current Wikipedia pages for a given language, this | The general idea is to download an XML dump-file (backup) containing the current Wikipedia pages for a given language, this will be processed to select certain pages and compress them into a self-contained Sugar activity. Whether or not to include the images from the wiki articles will have a large impact on the size of the activity. | ||
Generating a Wikipedia activity | Generating a Wikipedia activity requires a computer with a lot of available disk space, ideally lots of RAM and a working Sugar environment. It is probably best to use packages provided by your favorite Linux distribution or in a virtual machine. The wikipedia xml file is very large (almost 6 GB for the Spanish wikipedia, and it is even bigger in English), and you will need lots of space to generate temporary files. The process has a long run-time, but it is mostly automated, although you will need to confirm success at each stage of the process before moving on to the next. | ||
This page is a work in progress. If you have doubts or the information provided is not adequate, please contact me at gonzalo at laptop dot org and I will try to improve it. | This page is a work in progress. If you have doubts or the information provided is not adequate, please contact me at gonzalo at laptop dot org and I will try to improve it. | ||
| Line 14: | Line 14: | ||
== Download the wikipedia base activity == | == Download the wikipedia base activity == | ||
You will need download the wikipedia base from http://dev.laptop.org/~gonzalo/wikibase.zip. This | You will need to download the wikipedia base from http://dev.laptop.org/~gonzalo/wikibase.zip. This package includes the activity and the tools to create the data files. | ||
You need create a directory in your Activities directory for example WikipediaEs.activity and unzip wikibase.zip inside. | You need to create a directory in your Activities directory, for example WikipediaEs.activity and unzip wikibase.zip inside it. | ||
== Download a Wikipedia dump file== | == Download a Wikipedia dump file== | ||
| Line 24: | Line 24: | ||
This test was done with the spanish dump. The file used was eswiki-20111112-pages-articles.xml.bz2 from http://dumps.wikimedia.org/eswiki/20110810/ | This test was done with the spanish dump. The file used was eswiki-20111112-pages-articles.xml.bz2 from http://dumps.wikimedia.org/eswiki/20110810/ | ||
You need create a directory inside the | You need to create a directory inside the previously created activity directory, and download the wikipedia dump file there. | ||
The first two letters from your directory must be the language code example: es_es or en_us | The first two letters from your directory must be the language code. For example: es_es or en_us | ||
mkdir es_lat | mkdir es_lat | ||
| Line 34: | Line 34: | ||
== Process the dump file == | == Process the dump file == | ||
You need edit the file tools2/config.py, and modify the variable input_xml_file_name. | You need to edit the file tools2/config.py, and modify the variable input_xml_file_name. | ||
If you are creating a wikipedia in a different language | If you are creating a wikipedia in a different language than spanish, modify the values of the other variables. | ||
After | After saving the config.py, you can process the dump file: | ||
../tools2/pages_parser.py | ../tools2/pages_parser.py | ||
| Line 49: | Line 49: | ||
eswiki-20111112-pages-articles.xml.templates | eswiki-20111112-pages-articles.xml.templates | ||
If you want have more information, can use the option "--debug" to create the files prefix.all_pages, prefix.blacklisted and prefix.titles and help you | If you want to have more information, you can use the option "--debug" to create the files prefix.all_pages, prefix.blacklisted and prefix.titles and help you | ||
to research about the process. | to research about the process. | ||
With the spanish file and a | With the spanish file and a bit more than 2.3M pages, this process takes aprox 1:30 hours in my system. | ||
== Make a selection of pages == | == Make a selection of pages == | ||
| Line 62: | Line 62: | ||
The criteria used to select the pages is: all the pages in '''favorites.txt''' | The criteria used to select the pages is: all the pages in '''favorites.txt''' | ||
will be included and all the pages linked from this pages too, except the pages | will be included and all the pages linked from this pages too, except the pages | ||
in the file '''blacklist.txt''' | in the file '''blacklist.txt'''. | ||
Our wikipedia activity | Our wikipedia activity has a static index page (static/index_lang.html) and you can create your own index page. | ||
I have created my '''favorites.txt''' with all the pages linked from the index page (in the case of spanish 130 pages) and 300 pages more selected from | I have created my '''favorites.txt''' with all the pages linked from the index page (in the case of spanish 130 pages) and 300 pages more selected from | ||
a statistic of most searched pages in wikipedia (http://stats.grok.se) | a statistic of most searched pages in wikipedia (http://stats.grok.se) | ||
There | There is not a linear relation between the number of favorite pages and the final | ||
number of pages selected, but as reference, you can use this numbers: | number of pages selected, but as reference, you can use this numbers: | ||
| Line 85: | Line 85: | ||
After you have created the selection, you can look at the list of pages in the file | After you have created the selection, you can look at the list of pages in the file | ||
pages_selected-level-1 and modify your favorites.txt and/or blacklist.txt | pages_selected-level-1 and modify your favorites.txt and/or blacklist.txt | ||
files and rerun the process if you want modify the selection. | files and rerun the process if you want to modify the selection. | ||
== Create the index == | == Create the index == | ||