Changes

Jump to navigation Jump to search
Small redaction changes
Line 6: Line 6:  
[http://activities.sugarlabs.org/es-ES/sugar/addon/4401 Wikipedia es] or [http://activities.sugarlabs.org/es-ES/sugar/addon/4411 Wikipedia en]
 
[http://activities.sugarlabs.org/es-ES/sugar/addon/4401 Wikipedia es] or [http://activities.sugarlabs.org/es-ES/sugar/addon/4411 Wikipedia en]
   −
The general idea is to download an XML dump-file (backup) containing the current Wikipedia pages for a given language, this is processed to select certain pages and compress them into a self-contained Sugar activity. Whether or not to include the images from the wiki articles will have a large impact on the size of the activity.
+
The general idea is to download an XML dump-file (backup) containing the current Wikipedia pages for a given language, this will be processed to select certain pages and compress them into a self-contained Sugar activity. Whether or not to include the images from the wiki articles will have a large impact on the size of the activity.
   −
Generating a Wikipedia activity will require a computer with a lot of available disk space, ideally lots of RAM and a working Sugar environment. It is probably best to use packages provided by your favorite Linux distribution or in a virtual machine. The wikipedia xml file is very large (almost 6 GB tfor the Spanish wikipedia, and it is even bigger in English), and you will need lots of space to generate temporary files. The process has a long run-time, but it is mostly automated, although you will need to confirm success at each stage of the process before moving on to the next.   
+
Generating a Wikipedia activity requires a computer with a lot of available disk space, ideally lots of RAM and a working Sugar environment. It is probably best to use packages provided by your favorite Linux distribution or in a virtual machine. The wikipedia xml file is very large (almost 6 GB for the Spanish wikipedia, and it is even bigger in English), and you will need lots of space to generate temporary files. The process has a long run-time, but it is mostly automated, although you will need to confirm success at each stage of the process before moving on to the next.   
    
This page is a work in progress. If you have doubts or the information provided is not adequate, please contact me at gonzalo at laptop dot org and I will try to improve it.
 
This page is a work in progress. If you have doubts or the information provided is not adequate, please contact me at gonzalo at laptop dot org and I will try to improve it.
Line 14: Line 14:  
== Download the wikipedia base activity ==
 
== Download the wikipedia base activity ==
   −
You will need download the wikipedia base from http://dev.laptop.org/~gonzalo/wikibase.zip. This file include the activity and the tools to create the data files.
+
You will need to download the wikipedia base from http://dev.laptop.org/~gonzalo/wikibase.zip. This package includes the activity and the tools to create the data files.
   −
You need create a directory in your Activities directory for example WikipediaEs.activity and unzip wikibase.zip inside.
+
You need to create a directory in your Activities directory, for example WikipediaEs.activity and unzip wikibase.zip inside it.
    
== Download a Wikipedia dump file==
 
== Download a Wikipedia dump file==
Line 24: Line 24:  
This test was done with the spanish dump. The file used was eswiki-20111112-pages-articles.xml.bz2 from http://dumps.wikimedia.org/eswiki/20110810/
 
This test was done with the spanish dump. The file used was eswiki-20111112-pages-articles.xml.bz2 from http://dumps.wikimedia.org/eswiki/20110810/
   −
You need create a directory inside the create activity and download the wikipedia dump file
+
You need to create a directory inside the previously created activity directory, and download the wikipedia dump file there.
   −
The first two letters from your directory must be the language code example: es_es or en_us
+
The first two letters from your directory must be the language code. For example: es_es or en_us
    
  mkdir es_lat
 
  mkdir es_lat
Line 34: Line 34:  
== Process the dump file ==
 
== Process the dump file ==
   −
You need edit the file tools2/config.py, and modify the variable input_xml_file_name.
+
You need to edit the file tools2/config.py, and modify the variable input_xml_file_name.
If you are creating a wikipedia in a different language to spanish, modify the values of the other variables.
+
If you are creating a wikipedia in a different language than spanish, modify the values of the other variables.
   −
After save the config.py, you can process the dump file:
+
After saving the config.py, you can process the dump file:
    
  ../tools2/pages_parser.py
 
  ../tools2/pages_parser.py
Line 49: Line 49:  
  eswiki-20111112-pages-articles.xml.templates
 
  eswiki-20111112-pages-articles.xml.templates
   −
If you want have more information, can use the option "--debug" to create the files prefix.all_pages, prefix.blacklisted and prefix.titles and help you  
+
If you want to have more information, you can use the option "--debug" to create the files prefix.all_pages, prefix.blacklisted and prefix.titles and help you  
 
to research about the process.
 
to research about the process.
   −
With the spanish file and a little more to 2.3M pages, this process takes aprox 1:30 hours
+
With the spanish file and a bit more than 2.3M pages, this process takes aprox 1:30 hours in my system.
    
== Make a selection of pages ==
 
== Make a selection of pages ==
Line 62: Line 62:  
The criteria used to select the pages is: all the pages in '''favorites.txt'''
 
The criteria used to select the pages is: all the pages in '''favorites.txt'''
 
will be included and all the pages linked from this pages too, except the pages
 
will be included and all the pages linked from this pages too, except the pages
in the file '''blacklist.txt'''
+
in the file '''blacklist.txt'''.
   −
Our wikipedia activity have a static index page (static/index_lang.html) and you can create your own index page.
+
Our wikipedia activity has a static index page (static/index_lang.html) and you can create your own index page.
 
I have created my '''favorites.txt''' with all the pages linked from the index page (in the case of spanish 130 pages) and 300 pages more selected from  
 
I have created my '''favorites.txt''' with all the pages linked from the index page (in the case of spanish 130 pages) and 300 pages more selected from  
 
a statistic of most searched pages in wikipedia (http://stats.grok.se)
 
a statistic of most searched pages in wikipedia (http://stats.grok.se)
   −
There are not a linear relation between number of favorite pages and final
+
There is not a linear relation between the number of favorite pages and the final
 
number of pages selected, but as reference, you can use this numbers:
 
number of pages selected, but as reference, you can use this numbers:
   Line 85: Line 85:  
After you have created the selection, you can look at the list of pages in the file  
 
After you have created the selection, you can look at the list of pages in the file  
 
pages_selected-level-1 and modify your favorites.txt and/or blacklist.txt  
 
pages_selected-level-1 and modify your favorites.txt and/or blacklist.txt  
files and rerun the process if you want modify the selection.
+
files and rerun the process if you want to modify the selection.
    
== Create the index ==
 
== Create the index ==
296

edits

Navigation menu