<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.sugarlabs.org/index.php?action=history&amp;feed=atom&amp;title=Google_Code-In_2012%2FOCR_document_reformatting</id>
	<title>Google Code-In 2012/OCR document reformatting - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.sugarlabs.org/index.php?action=history&amp;feed=atom&amp;title=Google_Code-In_2012%2FOCR_document_reformatting"/>
	<link rel="alternate" type="text/html" href="https://wiki.sugarlabs.org/index.php?title=Google_Code-In_2012/OCR_document_reformatting&amp;action=history"/>
	<updated>2026-06-13T15:11:17Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.43.0</generator>
	<entry>
		<id>https://wiki.sugarlabs.org/index.php?title=Google_Code-In_2012/OCR_document_reformatting&amp;diff=84245&amp;oldid=prev</id>
		<title>Cjl at 15:08, 23 November 2012</title>
		<link rel="alternate" type="text/html" href="https://wiki.sugarlabs.org/index.php?title=Google_Code-In_2012/OCR_document_reformatting&amp;diff=84245&amp;oldid=prev"/>
		<updated>2012-11-23T15:08:23Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 11:08, 23 November 2012&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l17&quot;&gt;Line 17:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 17:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;3) Save file as either Pagennn.xls or Pagennn.ods.  Both formats are acceptable.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;3) Save file as either Pagennn.xls or Pagennn.ods.  Both formats are acceptable.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;4) Visually review (copyedit) each of the PDF file and spreadsheet file pairs.  Try to correct any errors introduced by the OCR process (for example a a &quot;b&quot; is sometimes OCR&#039;ed as an &quot;h&quot; and vice-versa. Pay special attention to characters with accents or other diacritical marks.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;4) Visually review (copyedit) each of the PDF file and spreadsheet file pairs.  Try to correct any errors introduced by the OCR process (for example a a &quot;b&quot; is sometimes OCR&#039;ed as an &quot;h&quot; and vice-versa.&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;fo &lt;/ins&gt;Pay special attention to characters with accents or other diacritical marks.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;5) Repeat process from step 3 with next 9 Page files in the zip.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;5) Repeat process from step 3 with next 9 Page files in the zip.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;6) Submit the spreadsheet files to the task mentor for review/&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;sing&lt;/del&gt;-off.  &lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;6) Submit the spreadsheet files to the task mentor for review/&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;sign&lt;/ins&gt;-off.  &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Cjl</name></author>
	</entry>
	<entry>
		<id>https://wiki.sugarlabs.org/index.php?title=Google_Code-In_2012/OCR_document_reformatting&amp;diff=84244&amp;oldid=prev</id>
		<title>Cjl at 15:07, 23 November 2012</title>
		<link rel="alternate" type="text/html" href="https://wiki.sugarlabs.org/index.php?title=Google_Code-In_2012/OCR_document_reformatting&amp;diff=84244&amp;oldid=prev"/>
		<updated>2012-11-23T15:07:41Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 11:07, 23 November 2012&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l17&quot;&gt;Line 17:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 17:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;3) Save file as either Pagennn.xls or Pagennn.ods.  Both formats are acceptable.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;3) Save file as either Pagennn.xls or Pagennn.ods.  Both formats are acceptable.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;4) Visually review (copyedit) each of the PDF file and spreadsheet file pairs.  Try to correct any errors introduced by the OCR process (for example a a &quot;b&quot; is sometimes OCR&#039;ed as an &quot;h&quot; and vice-versa. Pay special attention to characters with accents &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;of &lt;/del&gt;other diacritical marks.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;4) Visually review (copyedit) each of the PDF file and spreadsheet file pairs.  Try to correct any errors introduced by the OCR process (for example a a &quot;b&quot; is sometimes OCR&#039;ed as an &quot;h&quot; and vice-versa. Pay special attention to characters with accents &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;or &lt;/ins&gt;other diacritical marks.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;5) Repeat process from step 3 with next 9 Page files in the zip.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;5) Repeat process from step 3 with next 9 Page files in the zip.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Cjl</name></author>
	</entry>
	<entry>
		<id>https://wiki.sugarlabs.org/index.php?title=Google_Code-In_2012/OCR_document_reformatting&amp;diff=84243&amp;oldid=prev</id>
		<title>Cjl at 15:07, 23 November 2012</title>
		<link rel="alternate" type="text/html" href="https://wiki.sugarlabs.org/index.php?title=Google_Code-In_2012/OCR_document_reformatting&amp;diff=84243&amp;oldid=prev"/>
		<updated>2012-11-23T15:07:04Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 11:07, 23 November 2012&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR&#039;ed digital text.  That digitized text is the starting point for any number of efforts.  A batch processing &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;fo &lt;/del&gt;the PDFs into text files has already been performed.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR&#039;ed digital text.  That digitized text is the starting point for any number of efforts.  A batch processing &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;of &lt;/ins&gt;the PDFs into text files has already been performed.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The next critical step is to take the imperfectly formatted OCR output and restructure it into something &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;tha &lt;/del&gt;corresponds to the two column format of the original PDF.  The format of choice for this is a spreadsheet containing each entry in side by side columns.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt; &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt; &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The next critical step is to take the imperfectly formatted OCR output and restructure it into something &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;that &lt;/ins&gt;corresponds to the two column format of the original PDF.  The format of choice for this is a spreadsheet containing each entry in side by side columns.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Here is the full image-PDF [[File:Dt19.pdf]], which has been broken down to individual pages and grouped in batches below containing both the original PDF page and th eOCR&amp;#039;ed text.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Here is the full image-PDF [[File:Dt19.pdf]], which has been broken down to individual pages and grouped in batches below containing both the original PDF page and th eOCR&amp;#039;ed text.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Download the zips &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;blow &lt;/del&gt;(each contains 10 PDF image files and ten text files).&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Download the zips &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;below &lt;/ins&gt;(each contains 10 PDF image files and ten text files).&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Recommended approach:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Recommended approach:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Cjl</name></author>
	</entry>
	<entry>
		<id>https://wiki.sugarlabs.org/index.php?title=Google_Code-In_2012/OCR_document_reformatting&amp;diff=84241&amp;oldid=prev</id>
		<title>Cjl at 14:43, 23 November 2012</title>
		<link rel="alternate" type="text/html" href="https://wiki.sugarlabs.org/index.php?title=Google_Code-In_2012/OCR_document_reformatting&amp;diff=84241&amp;oldid=prev"/>
		<updated>2012-11-23T14:43:15Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 10:43, 23 November 2012&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l15&quot;&gt;Line 15:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 15:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;3) Save file as either Pagennn.xls or Pagennn.ods.  Both formats are acceptable.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;3) Save file as either Pagennn.xls or Pagennn.ods.  Both formats are acceptable.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;3&lt;/del&gt;) &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Repeat &lt;/del&gt;process &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;from step 3 &lt;/del&gt;with &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;next 9 Page files in the zip&lt;/del&gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;4&lt;/ins&gt;) &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Visually review (copyedit) each of the PDF file and spreadsheet file pairs.  Try to correct any errors introduced by the OCR &lt;/ins&gt;process &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;(for example a a &quot;b&quot; is sometimes OCR&#039;ed as an &quot;h&quot; and vice-versa. Pay special attention to characters &lt;/ins&gt;with &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;accents of other diacritical marks&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;4&lt;/del&gt;) &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Visually review (copyedit&lt;/del&gt;) &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;each of &lt;/del&gt;the &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;PDF file and &lt;/del&gt;spreadsheet &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;file pairs.  Try &lt;/del&gt;to &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;correct any errors introduced by &lt;/del&gt;the &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;OCR process (&lt;/del&gt;for &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;example a a &quot;b&quot; is sometimes OCR&#039;ed as an &quot;h&quot; and vice&lt;/del&gt;-&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;versa. Pay special attention to characters with accents of other diacritical marks&lt;/del&gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;5&lt;/ins&gt;) &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Repeat process from step 3 with next 9 Page files in the zip.&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt; &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;6&lt;/ins&gt;) &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Submit &lt;/ins&gt;the spreadsheet &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;files &lt;/ins&gt;to the &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;task mentor &lt;/ins&gt;for &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;review/sing&lt;/ins&gt;-&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;off&lt;/ins&gt;.  &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[File:Batch-cni-dict-01.zip]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[File:Batch-cni-dict-01.zip]] &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt; (Completed example)&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[File:Batch-cni-dict-02.zip]]  &lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[File:Batch-cni-dict-02.zip]]  &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Cjl</name></author>
	</entry>
	<entry>
		<id>https://wiki.sugarlabs.org/index.php?title=Google_Code-In_2012/OCR_document_reformatting&amp;diff=84214&amp;oldid=prev</id>
		<title>Cjl: Created page with &quot;The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR&#039;ed digital text.  That digitized text is the starting po...&quot;</title>
		<link rel="alternate" type="text/html" href="https://wiki.sugarlabs.org/index.php?title=Google_Code-In_2012/OCR_document_reformatting&amp;diff=84214&amp;oldid=prev"/>
		<updated>2012-11-23T14:25:11Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR&amp;#039;ed digital text.  That digitized text is the starting po...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR&amp;#039;ed digital text.  That digitized text is the starting point for any number of efforts.  A batch processing fo the PDFs into text files has already been performed.&lt;br /&gt;
&lt;br /&gt;
The next critical step is to take the imperfectly formatted OCR output and restructure it into something tha corresponds to the two column format of the original PDF.  The format of choice for this is a spreadsheet containing each entry in side by side columns.&lt;br /&gt;
&lt;br /&gt;
Here is the full image-PDF [[File:Dt19.pdf]], which has been broken down to individual pages and grouped in batches below containing both the original PDF page and th eOCR&amp;#039;ed text.&lt;br /&gt;
&lt;br /&gt;
Download the zips blow (each contains 10 PDF image files and ten text files).&lt;br /&gt;
&lt;br /&gt;
Recommended approach:&lt;br /&gt;
&lt;br /&gt;
1) Copy the text from the text file (Pagennn.txt) and paste it into two side-by-side columns in an MSExcel or OpenOffice .ods spreadsheet (Pagennn.xls or Pagennn.ods)&lt;br /&gt;
&lt;br /&gt;
2) Carefully delete certain entries in the two columns so that the result mirrors the aligned organization of the PDF file columns.  Please note that line wraps in the PDF should be ignored and all of an entry should be placed into a single cell.  In addition, the header and footer information present on the PDF image should be excluded from the spreadsheet.  It is irrelevant whether the entries are separated by a blank row or not as long as the entry in column A corresponds to the entry in column B.&lt;br /&gt;
&lt;br /&gt;
3) Save file as either Pagennn.xls or Pagennn.ods.  Both formats are acceptable.&lt;br /&gt;
&lt;br /&gt;
3) Repeat process from step 3 with next 9 Page files in the zip.&lt;br /&gt;
&lt;br /&gt;
4) Visually review (copyedit) each of the PDF file and spreadsheet file pairs.  Try to correct any errors introduced by the OCR process (for example a a &amp;quot;b&amp;quot; is sometimes OCR&amp;#039;ed as an &amp;quot;h&amp;quot; and vice-versa. Pay special attention to characters with accents of other diacritical marks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-01.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-02.zip]] &lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-03.zip]] &lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-04.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-05.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-06.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-07.zip]] &lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-08.zip]] &lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-09.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-10.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-11.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-12.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-13.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-14.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-15.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-16.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-17.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-18.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-19.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-20.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-21.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-22.zip]] &lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-23.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-24.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-25.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-26.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-27.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-28.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-29.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-30.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-31.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-32.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-33.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-34.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-35.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-36.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-37.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-38.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-39.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-40.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-41.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-42.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-43.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-44.zip]]&lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-45.zip]] &lt;br /&gt;
&lt;br /&gt;
* [[File:Batch-cni-dict-46.zip]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category: GCI2012]]&lt;/div&gt;</summary>
		<author><name>Cjl</name></author>
	</entry>
</feed>