Changes

Google Code-In 2012/OCR document reformatting (view source)

Revision as of 11:07, 23 November 2012

4 bytes added , 11:07, 23 November 2012

no edit summary

Line 1: Line 1: −

The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR'ed digital text. That digitized text is the starting point for any number of efforts. A batch processing fo the PDFs into text files has already been performed.

+

The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR'ed digital text. That digitized text is the starting point for any number of efforts. A batch processing of the PDFs into text files has already been performed.

−

The next critical step is to take the imperfectly formatted OCR output and restructure it into something ~~tha~~ corresponds to the two column format of the original PDF. The format of choice for this is a spreadsheet containing each entry in side by side columns.

+

The next critical step is to take the imperfectly formatted OCR output and restructure it into something that corresponds to the two column format of the original PDF. The format of choice for this is a spreadsheet containing each entry in side by side columns.

Here is the full image-PDF [[File:Dt19.pdf]], which has been broken down to individual pages and grouped in batches below containing both the original PDF page and th eOCR'ed text.

−

Download the zips ~~blow~~ (each contains 10 PDF image files and ten text files).

+

Download the zips below (each contains 10 PDF image files and ten text files).

Recommended approach:

Cjl

Bureaucrats, Check users, Administrators

1,157

edits

Changes

Google Code-In 2012/OCR document reformatting (view source)

Revision as of 11:07, 23 November 2012

Navigation menu

Search