Google Code-In 2012/OCR document reformatting: Difference between revisions
Created page with "The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR'ed digital text. That digitized text is the starting po..." |
No edit summary |
||
| Line 15: | Line 15: | ||
3) Save file as either Pagennn.xls or Pagennn.ods. Both formats are acceptable. | 3) Save file as either Pagennn.xls or Pagennn.ods. Both formats are acceptable. | ||
4) Visually review (copyedit) each of the PDF file and spreadsheet file pairs. Try to correct any errors introduced by the OCR process (for example a a "b" is sometimes OCR'ed as an "h" and vice-versa. Pay special attention to characters with accents of other diacritical marks. | |||
5) Repeat process from step 3 with next 9 Page files in the zip. | |||
6) Submit the spreadsheet files to the task mentor for review/sing-off. | |||
* [[File:Batch-cni-dict-01.zip]] | * [[File:Batch-cni-dict-01.zip]] (Completed example) | ||
* [[File:Batch-cni-dict-02.zip]] | * [[File:Batch-cni-dict-02.zip]] | ||