Changes

101 bytes added , 11:08, 23 November 2012

no edit summary

Line 1: Line 1: −

The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR'ed digital text. That digitized text is the starting point for any number of efforts. A batch processing fo the PDFs into text files has already been performed.

+

The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR'ed digital text. That digitized text is the starting point for any number of efforts. A batch processing of the PDFs into text files has already been performed.

−

The next critical step is to take the imperfectly formatted OCR output and restructure it into something ~~tha~~ corresponds to the two column format of the original PDF. The format of choice for this is a spreadsheet containing each entry in side by side columns.

+

The next critical step is to take the imperfectly formatted OCR output and restructure it into something that corresponds to the two column format of the original PDF. The format of choice for this is a spreadsheet containing each entry in side by side columns.

Here is the full image-PDF [[File:Dt19.pdf]], which has been broken down to individual pages and grouped in batches below containing both the original PDF page and th eOCR'ed text.

−

Download the zips ~~blow~~ (each contains 10 PDF image files and ten text files).

+

Download the zips below (each contains 10 PDF image files and ten text files).

Recommended approach:

Line 15: Line 17:

3) Save file as either Pagennn.xls or Pagennn.ods. Both formats are acceptable.

−

3) Repeat process from step 3 with next 9 Page files in the zip.

+

4) Visually review (copyedit) each of the PDF file and spreadsheet file pairs. Try to correct any errors introduced by the OCR process (for example a a "b" is sometimes OCR'ed as an "h" and vice-versa.fo Pay special attention to characters with accents or other diacritical marks.

+

5) Repeat process from step 3 with next 9 Page files in the zip.

−

4) ~~Visually review (copyedit) each of~~ the ~~PDF file and~~ spreadsheet ~~file pairs. Try~~ to ~~correct any errors introduced by~~ the ~~OCR process (~~for ~~example a a "b" is sometimes OCR'ed as an "h" and vice~~-~~versa. Pay special attention to characters with accents of other diacritical marks~~.

+

6) Submit the spreadsheet files to the task mentor for review/sign-off.

−

* [[File:Batch-cni-dict-01.zip]]

+

* [[File:Batch-cni-dict-01.zip]] (Completed example)

* [[File:Batch-cni-dict-02.zip]]

Cjl

Bureaucrats, Check users, Administrators

1,157

edits

Changes

Google Code-In 2012/OCR document reformatting (view source)

Revision as of 11:08, 23 November 2012

Navigation menu

Search