Difference between revisions of "Activities/Wikipedia/HowTo"
(Small redaction changes) |
|||
(20 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
''"Crear uno, dos, tres... mil Wikipedias" Comandante Ernesto Wales'' | ''"Crear uno, dos, tres... mil Wikipedias" Comandante Ernesto Wales'' | ||
− | === How to Create a new wikipedia activity or update | + | === Object of this HowTo === |
+ | |||
+ | This HowTo explains how to update the data files in the wikipedia activities or how to create new activities with other languages or different selections of articles. | ||
+ | |||
+ | The procedure is not very difficult if you already have a Sugar environment setup. If you have doubts or the information provided is not adequate, please contact me at ''gonzalo at laptop dot org'' or in the sugar-devel mailing list and I will try to help and improve this page. | ||
+ | |||
+ | If you want to create a wikipedia activity in your language, and do not have the technical resources, but can help translating a few files and doing quality control, contact me and I will help you to create the activity. | ||
+ | |||
+ | === How to Create a new wikipedia activity or update an existing activity === | ||
This page describes how to generate the data files needed to create a wikipedia activity like | This page describes how to generate the data files needed to create a wikipedia activity like | ||
− | [http://activities.sugarlabs.org/es-ES/sugar/addon/4401 Wikipedia es] or [http://activities.sugarlabs.org/es-ES/sugar/addon/4411 Wikipedia en] | + | [http://activities.sugarlabs.org/es-ES/sugar/addon/4401 Wikipedia es] or [http://activities.sugarlabs.org/es-ES/sugar/addon/4411 Wikipedia en]. |
− | The general idea is to download an XML dump-file (backup) containing the current Wikipedia pages for a given language, | + | The general idea is to download an XML dump-file (backup) containing the current Wikipedia pages for a given language, then process the dump and select certain pages and compress them into a self-contained Sugar activity. Whether or not to include the images from the wiki articles will have a large impact on the size of the activity. |
− | Generating a Wikipedia activity requires a computer with a lot of available disk space, ideally lots of RAM and a working Sugar environment. It is probably best to use packages provided by your favorite Linux distribution or in a virtual machine. The wikipedia xml file is very large (almost 6 GB for the Spanish wikipedia, and it is even bigger in English), and you will need lots of space to generate temporary files. The process | + | Generating a Wikipedia activity requires a computer with a lot of available disk space, ideally lots of RAM and a working Sugar environment. It is probably best to use packages provided by your favorite Linux distribution or in a virtual machine. The wikipedia xml file is very large (almost 6 GB for the Spanish wikipedia, and it is even bigger in English), and you will need lots of space to generate temporary files. The process does take a lot of time, but it is mostly automated, although you will need to confirm success at each stage of the process before moving on to the next one. |
− | + | == Download the wikipedia base activity == | |
− | + | You will need to download the wikipedia base from http://dev.laptop.org/~gonzalo/wikiserver/WikipediaBase-35.xo. This package includes the activity and the tools to create the data files. | |
− | You | + | You need to unzip it in your Activities directory, or install it, if you do not have another wikipedia activity already installed. |
− | + | The git repository is here https://github.com/godiard/wikipedia-activity . | |
== Download a Wikipedia dump file== | == Download a Wikipedia dump file== | ||
Line 26: | Line 34: | ||
You need to create a directory inside the previously created activity directory, and download the wikipedia dump file there. | You need to create a directory inside the previously created activity directory, and download the wikipedia dump file there. | ||
− | The first two letters from your directory must be the language code. For example: es_es or | + | The first two letters from your directory must be the language code, next a '_', and later any string. For example: es_es, en_us or en_history. |
mkdir es_lat | mkdir es_lat | ||
Line 86: | Line 94: | ||
pages_selected-level-1 and modify your favorites.txt and/or blacklist.txt | pages_selected-level-1 and modify your favorites.txt and/or blacklist.txt | ||
files and rerun the process if you want to modify the selection. | files and rerun the process if you want to modify the selection. | ||
+ | |||
+ | === The case of wikipedias with few articles === | ||
+ | |||
+ | If the wikipedia in the selected language have less than 40000 articles, is possible include all the articles in the activity, then is not needed create the favorites.txt file. You can do the selection of all the articles doing: | ||
+ | |||
+ | ../tools2/make_selection.py --all | ||
== Create the index == | == Create the index == | ||
Line 96: | Line 110: | ||
and this will show you the content in wiki format of the selected page. | and this will show you the content in wiki format of the selected page. | ||
− | At this stage, the process will be slow, because | + | At this stage, the process will be slow, because it needs to search for every template |
and do all the templates substitutions. | and do all the templates substitutions. | ||
To have faster results we will apply templates substitutions in all the pages. | To have faster results we will apply templates substitutions in all the pages. | ||
− | == | + | == Optimize the data and download images == |
− | To expand the templates need go out of the data directory: | + | To expand the templates you need go out of the data directory: |
cd .. | cd .. | ||
./tools2/expandtemplates.py es_lat | ./tools2/expandtemplates.py es_lat | ||
− | When finish, you need | + | When finish, you need go back to the data directory: |
cd es_lat | cd es_lat | ||
Line 114: | Line 128: | ||
mv eswiki-20111112-pages-articles.xml.processed_expanded eswiki-20111112-pages-articles.xml.processed | mv eswiki-20111112-pages-articles.xml.processed_expanded eswiki-20111112-pages-articles.xml.processed | ||
− | ../tools2/create_index.py -- | + | ../tools2/create_index.py --delete_old |
− | The option -- | + | The option --delete_old is used to remove the old index |
− | If you want include images in your wikipedia activity can go again to your data directory and do: | + | If you want to include images in your wikipedia activity, you can go again to your data directory and do: |
cd es_lat | cd es_lat | ||
../tools2/download_images.py | ../tools2/download_images.py | ||
− | This command will download the images included in the pages in favorites.txt | + | This command will download the images included in the pages in favorites.txt. |
− | If you want include the images in all the pages, should do: | + | If you want to include the images in all the pages, you should do: |
../tools2/download_images.py --all | ../tools2/download_images.py --all | ||
− | + | An option --cache_dir=directory is available if you have images already downloaded | |
in another directory to acelerate the process. | in another directory to acelerate the process. | ||
== Modify your activity to use the data files == | == Modify your activity to use the data files == | ||
− | + | To create a wikipedia in a new language, you will need create the following files: | |
− | + | * activity/activity.info.''lang'': is the activity.info file for your language. You can copy | |
− | + | one from other language, and modify the name, the bundle_id, the icon and the exec line. | |
− | to | + | * activity/activity-wikipedia-''lang''.svg: is the activity icon. The file can be copied from |
+ | another language, and modify with a text editor the last text element, to put the labugage code. | ||
+ | If you need edit the image with a graphic editor (like Inkscape) remember add the entities lines | ||
+ | in the header and replace the entities for stroke_color and fill_color, after that. | ||
− | |||
− | + | * '''DEPRECATED, SEE BELOW:''' activity_''lang''.py: is the startup class, sets the configuration values and starts the server. | |
+ | You can copy the class from another language and set the parameters. You need set the name of the class, | ||
+ | equal than the value in the exec value in the activity/activity.info.lang file. | ||
− | ./ | + | * static/about_''lang''.html: Is a static about page. Translate it from a similar page from other language. |
+ | |||
+ | * static/index_''lang''.html: is the activity home page. Will have links to good pages to start to explore. | ||
+ | If you create your favorite list based in a translation of the home page from other language, would be a good idea translate the home page too. | ||
+ | |||
+ | |||
+ | '''DEPRECATED, SEE BELOW:''' Now, you can test your changes, starting the wikipedia server: | ||
+ | |||
+ | ./activity_''lang''.py es_lat/eswiki-20111112-pages-articles.xml 8000 | ||
The first parameter is your xml data file and the second parameter a number of port. | The first parameter is your xml data file and the second parameter a number of port. | ||
Line 150: | Line 176: | ||
In any web browser in the same computer you can look to a page, to check if is working: | In any web browser in the same computer you can look to a page, to check if is working: | ||
− | In this example, look at the page "Energía" pointing to '''http://localhost:8000/wiki/Energía''' | + | In this example, we look at the page "Energía" pointing to '''http://localhost:8000/wiki/Energía''' |
[[File:Wikipedia_test.png]] | [[File:Wikipedia_test.png]] | ||
− | + | Finally, to create the new .xo file and distribute it, you must do: | |
+ | |||
+ | ./setup_new_wiki.py es_lat/eswiki-20111112-pages-articles.xml | ||
+ | |||
+ | Now, in the directory dist, a new .xo file will be created and you can distribute it. | ||
+ | |||
+ | === Notes on updates in the process === | ||
+ | |||
+ | After version 38, with the intention of make more standard the process to allow package the activity | ||
+ | in distributions, we added a standard setup.py. To use it, is needed add the wikipedia initialization | ||
+ | parameters to the activity.info file, as is displayed in the file activity.info.en_simple | ||
+ | |||
+ | https://github.com/godiard/wikipedia-activity/blob/master/activity/activity.info.en_simple | ||
+ | |||
+ | [Wikipedia] | ||
+ | path = en_simple/simplewiki-20130724-pages-articles.xml | ||
+ | port = 8011 | ||
+ | home_page = /static/index_en_simple.html | ||
+ | templateprefix = Template: | ||
+ | wpheader = From Wikipedia, The Free Encyclopedia | ||
+ | wpfooter = Content available under the | ||
+ | <a href="/static/es-gfdl.html">GNU Free Documentation License</a>. | ||
+ | <br/> Wikipedia is a registered trademark of the non-profit | ||
+ | Wikimedia Foundation, Inc.<br/><a href="/static/about_en.html"> | ||
+ | About Wikipedia</a> | ||
+ | resultstitle = Search results for '%s'. | ||
+ | |||
+ | Another change important is that now is not needed create a activity_<lang>.py file, | ||
+ | because the activity starts and read the config from the activity.info file, the "exec" line need be: | ||
+ | |||
+ | exec = sugar-activity activity.WikipediaActivity | ||
+ | |||
+ | Then to create the .xo you can do: | ||
+ | |||
+ | ./setup.py dist_xo es_lat/eswiki-20111112-pages-articles.xml | ||
+ | |||
+ | or to create the sources tar.bz2 file: | ||
+ | |||
+ | ./setup.py dist_source es_lat/eswiki-20111112-pages-articles.xml | ||
+ | |||
+ | With this new version, testing the wiki can be done on the command line doing: | ||
+ | |||
+ | ./test_server.py es_lat/eswiki-20111112-pages-articles.xml 8000 | ||
+ | |||
+ | The two parameters are optional, if are not provided, the parameters in activity.info file will be used. | ||
+ | |||
+ | == Other changes needed == | ||
+ | |||
+ | === Image identifiers === | ||
+ | |||
+ | If after finish the process of the files, the images are not displayed in the pages, check if the image identifier is included in the set imageKeywords in the file mwlib/parser.py. For example, in the Quechua wikipedia, the image identifier is "rikcha" and we needed add it because was not included. | ||
+ | |||
+ | == More tools == | ||
+ | |||
+ | === Big image files === | ||
+ | |||
+ | There are cases where a small group of images are very big, if you want remove them to have a smaller activity, can do: | ||
+ | |||
+ | mkdir big-images | ||
+ | find images -size +100k -exec mv {} big-images \; | ||
+ | |||
+ | (in this example, moving images with more than 100k to another directory) | ||
+ | |||
+ | == Old information == | ||
+ | |||
+ | http://wiki.laptop.org/go/User:Godiard/WkipediaDataRebuild |
Latest revision as of 12:00, 6 July 2015
"Crear uno, dos, tres... mil Wikipedias" Comandante Ernesto Wales
Object of this HowTo
This HowTo explains how to update the data files in the wikipedia activities or how to create new activities with other languages or different selections of articles.
The procedure is not very difficult if you already have a Sugar environment setup. If you have doubts or the information provided is not adequate, please contact me at gonzalo at laptop dot org or in the sugar-devel mailing list and I will try to help and improve this page.
If you want to create a wikipedia activity in your language, and do not have the technical resources, but can help translating a few files and doing quality control, contact me and I will help you to create the activity.
How to Create a new wikipedia activity or update an existing activity
This page describes how to generate the data files needed to create a wikipedia activity like Wikipedia es or Wikipedia en.
The general idea is to download an XML dump-file (backup) containing the current Wikipedia pages for a given language, then process the dump and select certain pages and compress them into a self-contained Sugar activity. Whether or not to include the images from the wiki articles will have a large impact on the size of the activity.
Generating a Wikipedia activity requires a computer with a lot of available disk space, ideally lots of RAM and a working Sugar environment. It is probably best to use packages provided by your favorite Linux distribution or in a virtual machine. The wikipedia xml file is very large (almost 6 GB for the Spanish wikipedia, and it is even bigger in English), and you will need lots of space to generate temporary files. The process does take a lot of time, but it is mostly automated, although you will need to confirm success at each stage of the process before moving on to the next one.
Download the wikipedia base activity
You will need to download the wikipedia base from http://dev.laptop.org/~gonzalo/wikiserver/WikipediaBase-35.xo. This package includes the activity and the tools to create the data files.
You need to unzip it in your Activities directory, or install it, if you do not have another wikipedia activity already installed.
The git repository is here https://github.com/godiard/wikipedia-activity .
Download a Wikipedia dump file
Wikipedia provides daily (or nearly daily) XML dump files for each language.
This test was done with the spanish dump. The file used was eswiki-20111112-pages-articles.xml.bz2 from http://dumps.wikimedia.org/eswiki/20110810/
You need to create a directory inside the previously created activity directory, and download the wikipedia dump file there.
The first two letters from your directory must be the language code, next a '_', and later any string. For example: es_es, en_us or en_history.
mkdir es_lat cd es_lat bzip2 -d eswiki-20111112-pages-articles.xml.bz2
Process the dump file
You need to edit the file tools2/config.py, and modify the variable input_xml_file_name. If you are creating a wikipedia in a different language than spanish, modify the values of the other variables.
After saving the config.py, you can process the dump file:
../tools2/pages_parser.py
The process will generate files with the same prefix than the uncompressed xml file. This first process will create the following files:
eswiki-20111112-pages-articles.xml.links eswiki-20111112-pages-articles.xml.page_templates eswiki-20111112-pages-articles.xml.redirects eswiki-20111112-pages-articles.xml.templates
If you want to have more information, you can use the option "--debug" to create the files prefix.all_pages, prefix.blacklisted and prefix.titles and help you to research about the process.
With the spanish file and a bit more than 2.3M pages, this process takes aprox 1:30 hours in my system.
Make a selection of pages
To create a selection of pages to be included in the wikipedia activity, you need create two files: favorites.txt and blacklist.txt The two files are a list of titles of pages.
The criteria used to select the pages is: all the pages in favorites.txt will be included and all the pages linked from this pages too, except the pages in the file blacklist.txt.
Our wikipedia activity has a static index page (static/index_lang.html) and you can create your own index page. I have created my favorites.txt with all the pages linked from the index page (in the case of spanish 130 pages) and 300 pages more selected from a statistic of most searched pages in wikipedia (http://stats.grok.se)
There is not a linear relation between the number of favorite pages and the final number of pages selected, but as reference, you can use this numbers:
Favorites Total selected pages 130 15788 431 45105 544 63347
The files favorites.txt and blacklist.txt should be in the same directory than the .xml file
To create the selection do:
../tools2/make_selection.py
After you have created the selection, you can look at the list of pages in the file pages_selected-level-1 and modify your favorites.txt and/or blacklist.txt files and rerun the process if you want to modify the selection.
The case of wikipedias with few articles
If the wikipedia in the selected language have less than 40000 articles, is possible include all the articles in the activity, then is not needed create the favorites.txt file. You can do the selection of all the articles doing:
../tools2/make_selection.py --all
Create the index
../tools2/create_index.py
This process is faster and when finished you can do a basic test:
../tools2/test_index.py page_title
and this will show you the content in wiki format of the selected page. At this stage, the process will be slow, because it needs to search for every template and do all the templates substitutions. To have faster results we will apply templates substitutions in all the pages.
Optimize the data and download images
To expand the templates you need go out of the data directory:
cd .. ./tools2/expandtemplates.py es_lat
When finish, you need go back to the data directory:
cd es_lat
And move your new file with the templates replaced and recreate the index:
mv eswiki-20111112-pages-articles.xml.processed_expanded eswiki-20111112-pages-articles.xml.processed ../tools2/create_index.py --delete_old
The option --delete_old is used to remove the old index
If you want to include images in your wikipedia activity, you can go again to your data directory and do:
cd es_lat ../tools2/download_images.py
This command will download the images included in the pages in favorites.txt. If you want to include the images in all the pages, you should do:
../tools2/download_images.py --all
An option --cache_dir=directory is available if you have images already downloaded in another directory to acelerate the process.
Modify your activity to use the data files
To create a wikipedia in a new language, you will need create the following files:
- activity/activity.info.lang: is the activity.info file for your language. You can copy
one from other language, and modify the name, the bundle_id, the icon and the exec line.
- activity/activity-wikipedia-lang.svg: is the activity icon. The file can be copied from
another language, and modify with a text editor the last text element, to put the labugage code. If you need edit the image with a graphic editor (like Inkscape) remember add the entities lines in the header and replace the entities for stroke_color and fill_color, after that.
- DEPRECATED, SEE BELOW: activity_lang.py: is the startup class, sets the configuration values and starts the server.
You can copy the class from another language and set the parameters. You need set the name of the class, equal than the value in the exec value in the activity/activity.info.lang file.
- static/about_lang.html: Is a static about page. Translate it from a similar page from other language.
- static/index_lang.html: is the activity home page. Will have links to good pages to start to explore.
If you create your favorite list based in a translation of the home page from other language, would be a good idea translate the home page too.
DEPRECATED, SEE BELOW: Now, you can test your changes, starting the wikipedia server:
./activity_lang.py es_lat/eswiki-20111112-pages-articles.xml 8000
The first parameter is your xml data file and the second parameter a number of port.
In any web browser in the same computer you can look to a page, to check if is working:
In this example, we look at the page "Energía" pointing to http://localhost:8000/wiki/Energía
Finally, to create the new .xo file and distribute it, you must do:
./setup_new_wiki.py es_lat/eswiki-20111112-pages-articles.xml
Now, in the directory dist, a new .xo file will be created and you can distribute it.
Notes on updates in the process
After version 38, with the intention of make more standard the process to allow package the activity in distributions, we added a standard setup.py. To use it, is needed add the wikipedia initialization parameters to the activity.info file, as is displayed in the file activity.info.en_simple
https://github.com/godiard/wikipedia-activity/blob/master/activity/activity.info.en_simple
[Wikipedia] path = en_simple/simplewiki-20130724-pages-articles.xml port = 8011 home_page = /static/index_en_simple.html templateprefix = Template: wpheader = From Wikipedia, The Free Encyclopedia wpfooter = Content available under the <a href="/static/es-gfdl.html">GNU Free Documentation License</a>.
Wikipedia is a registered trademark of the non-profit Wikimedia Foundation, Inc.
<a href="/static/about_en.html"> About Wikipedia</a> resultstitle = Search results for '%s'.
Another change important is that now is not needed create a activity_<lang>.py file, because the activity starts and read the config from the activity.info file, the "exec" line need be:
exec = sugar-activity activity.WikipediaActivity
Then to create the .xo you can do:
./setup.py dist_xo es_lat/eswiki-20111112-pages-articles.xml
or to create the sources tar.bz2 file:
./setup.py dist_source es_lat/eswiki-20111112-pages-articles.xml
With this new version, testing the wiki can be done on the command line doing:
./test_server.py es_lat/eswiki-20111112-pages-articles.xml 8000
The two parameters are optional, if are not provided, the parameters in activity.info file will be used.
Other changes needed
Image identifiers
If after finish the process of the files, the images are not displayed in the pages, check if the image identifier is included in the set imageKeywords in the file mwlib/parser.py. For example, in the Quechua wikipedia, the image identifier is "rikcha" and we needed add it because was not included.
More tools
Big image files
There are cases where a small group of images are very big, if you want remove them to have a smaller activity, can do:
mkdir big-images find images -size +100k -exec mv {} big-images \;
(in this example, moving images with more than 100k to another directory)