Google Code-In 2012/OCR document reformatting

The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR'ed digital text. That digitized text is the starting point for any number of efforts. A batch processing of the PDFs into text files has already been performed.

The next critical step is to take the imperfectly formatted OCR output and restructure it into something that corresponds to the two column format of the original PDF. The format of choice for this is a spreadsheet containing each entry in side by side columns.

Here is the full image-PDF File:Dt19.pdf, which has been broken down to individual pages and grouped in batches below containing both the original PDF page and th eOCR'ed text.

Download the zips below (each contains 10 PDF image files and ten text files).

Recommended approach:

1) Copy the text from the text file (Pagennn.txt) and paste it into two side-by-side columns in an MSExcel or OpenOffice .ods spreadsheet (Pagennn.xls or Pagennn.ods)

2) Carefully delete certain entries in the two columns so that the result mirrors the aligned organization of the PDF file columns. Please note that line wraps in the PDF should be ignored and all of an entry should be placed into a single cell. In addition, the header and footer information present on the PDF image should be excluded from the spreadsheet. It is irrelevant whether the entries are separated by a blank row or not as long as the entry in column A corresponds to the entry in column B.

3) Save file as either Pagennn.xls or Pagennn.ods. Both formats are acceptable.

4) Visually review (copyedit) each of the PDF file and spreadsheet file pairs. Try to correct any errors introduced by the OCR process (for example a a "b" is sometimes OCR'ed as an "h" and vice-versa.fo Pay special attention to characters with accents or other diacritical marks.

5) Repeat process from step 3 with next 9 Page files in the zip.

6) Submit the spreadsheet files to the task mentor for review/sign-off.