Line 6: |
Line 6: |
| [http://activities.sugarlabs.org/es-ES/sugar/addon/4401 Wikipedia es] or [http://activities.sugarlabs.org/es-ES/sugar/addon/4411 Wikipedia en] | | [http://activities.sugarlabs.org/es-ES/sugar/addon/4401 Wikipedia es] or [http://activities.sugarlabs.org/es-ES/sugar/addon/4411 Wikipedia en] |
| | | |
− | The general idea is to download an XML dump-file (backup) containing the current Wikipedia pages for a given language, this is processed to select certain pages and compress them into a self-contained Sugar activity. Whether or not to include the images from the wiki articles will have a large impact on the size of the activity. | + | The general idea is to download an XML dump-file (backup) containing the current Wikipedia pages for a given language, this will be processed to select certain pages and compress them into a self-contained Sugar activity. Whether or not to include the images from the wiki articles will have a large impact on the size of the activity. |
| | | |
− | Generating a Wikipedia activity will require a computer with a lot of available disk space, ideally lots of RAM and a working Sugar environment. It is probably best to use packages provided by your favorite Linux distribution or in a virtual machine. The wikipedia xml file is very large (almost 6 GB tfor the Spanish wikipedia, and it is even bigger in English), and you will need lots of space to generate temporary files. The process has a long run-time, but it is mostly automated, although you will need to confirm success at each stage of the process before moving on to the next. | + | Generating a Wikipedia activity requires a computer with a lot of available disk space, ideally lots of RAM and a working Sugar environment. It is probably best to use packages provided by your favorite Linux distribution or in a virtual machine. The wikipedia xml file is very large (almost 6 GB for the Spanish wikipedia, and it is even bigger in English), and you will need lots of space to generate temporary files. The process has a long run-time, but it is mostly automated, although you will need to confirm success at each stage of the process before moving on to the next. |
| | | |
| This page is a work in progress. If you have doubts or the information provided is not adequate, please contact me at gonzalo at laptop dot org and I will try to improve it. | | This page is a work in progress. If you have doubts or the information provided is not adequate, please contact me at gonzalo at laptop dot org and I will try to improve it. |
Line 14: |
Line 14: |
| == Download the wikipedia base activity == | | == Download the wikipedia base activity == |
| | | |
− | You will need download the wikipedia base from http://dev.laptop.org/~gonzalo/wikibase.zip. This file include the activity and the tools to create the data files. | + | You will need to download the wikipedia base from http://dev.laptop.org/~gonzalo/wikibase.zip. This package includes the activity and the tools to create the data files. |
| | | |
− | You need create a directory in your Activities directory for example WikipediaEs.activity and unzip wikibase.zip inside. | + | You need to create a directory in your Activities directory, for example WikipediaEs.activity and unzip wikibase.zip inside it. |
| | | |
| == Download a Wikipedia dump file== | | == Download a Wikipedia dump file== |
Line 24: |
Line 24: |
| This test was done with the spanish dump. The file used was eswiki-20111112-pages-articles.xml.bz2 from http://dumps.wikimedia.org/eswiki/20110810/ | | This test was done with the spanish dump. The file used was eswiki-20111112-pages-articles.xml.bz2 from http://dumps.wikimedia.org/eswiki/20110810/ |
| | | |
− | You need create a directory inside the create activity and download the wikipedia dump file | + | You need to create a directory inside the previously created activity directory, and download the wikipedia dump file there. |
| | | |
− | The first two letters from your directory must be the language code example: es_es or en_us | + | The first two letters from your directory must be the language code. For example: es_es or en_us |
| | | |
| mkdir es_lat | | mkdir es_lat |
Line 34: |
Line 34: |
| == Process the dump file == | | == Process the dump file == |
| | | |
− | You need edit the file tools2/config.py, and modify the variable input_xml_file_name. | + | You need to edit the file tools2/config.py, and modify the variable input_xml_file_name. |
− | If you are creating a wikipedia in a different language to spanish, modify the values of the other variables. | + | If you are creating a wikipedia in a different language than spanish, modify the values of the other variables. |
| | | |
− | After save the config.py, you can process the dump file: | + | After saving the config.py, you can process the dump file: |
| | | |
| ../tools2/pages_parser.py | | ../tools2/pages_parser.py |
Line 49: |
Line 49: |
| eswiki-20111112-pages-articles.xml.templates | | eswiki-20111112-pages-articles.xml.templates |
| | | |
− | If you want have more information, can use the option "--debug" to create the files prefix.all_pages, prefix.blacklisted and prefix.titles and help you | + | If you want to have more information, you can use the option "--debug" to create the files prefix.all_pages, prefix.blacklisted and prefix.titles and help you |
| to research about the process. | | to research about the process. |
| | | |
− | With the spanish file and a little more to 2.3M pages, this process takes aprox 1:30 hours | + | With the spanish file and a bit more than 2.3M pages, this process takes aprox 1:30 hours in my system. |
| | | |
| == Make a selection of pages == | | == Make a selection of pages == |
Line 62: |
Line 62: |
| The criteria used to select the pages is: all the pages in '''favorites.txt''' | | The criteria used to select the pages is: all the pages in '''favorites.txt''' |
| will be included and all the pages linked from this pages too, except the pages | | will be included and all the pages linked from this pages too, except the pages |
− | in the file '''blacklist.txt''' | + | in the file '''blacklist.txt'''. |
| | | |
− | Our wikipedia activity have a static index page (static/index_lang.html) and you can create your own index page. | + | Our wikipedia activity has a static index page (static/index_lang.html) and you can create your own index page. |
| I have created my '''favorites.txt''' with all the pages linked from the index page (in the case of spanish 130 pages) and 300 pages more selected from | | I have created my '''favorites.txt''' with all the pages linked from the index page (in the case of spanish 130 pages) and 300 pages more selected from |
| a statistic of most searched pages in wikipedia (http://stats.grok.se) | | a statistic of most searched pages in wikipedia (http://stats.grok.se) |
| | | |
− | There are not a linear relation between number of favorite pages and final | + | There is not a linear relation between the number of favorite pages and the final |
| number of pages selected, but as reference, you can use this numbers: | | number of pages selected, but as reference, you can use this numbers: |
| | | |
Line 85: |
Line 85: |
| After you have created the selection, you can look at the list of pages in the file | | After you have created the selection, you can look at the list of pages in the file |
| pages_selected-level-1 and modify your favorites.txt and/or blacklist.txt | | pages_selected-level-1 and modify your favorites.txt and/or blacklist.txt |
− | files and rerun the process if you want modify the selection. | + | files and rerun the process if you want to modify the selection. |
| | | |
| == Create the index == | | == Create the index == |