Google Code-In 2012/OCR document reformatting: Difference between revisions

Created page with "The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR'ed digital text. That digitized text is the starting po..."
 
No edit summary
Line 15: Line 15:
3) Save file as either Pagennn.xls or Pagennn.ods.  Both formats are acceptable.
3) Save file as either Pagennn.xls or Pagennn.ods.  Both formats are acceptable.


3) Repeat process from step 3 with next 9 Page files in the zip.
4) Visually review (copyedit) each of the PDF file and spreadsheet file pairs.  Try to correct any errors introduced by the OCR process (for example a a "b" is sometimes OCR'ed as an "h" and vice-versa. Pay special attention to characters with accents of other diacritical marks.


4) Visually review (copyedit) each of the PDF file and spreadsheet file pairs.  Try to correct any errors introduced by the OCR process (for example a a "b" is sometimes OCR'ed as an "h" and vice-versa. Pay special attention to characters with accents of other diacritical marks.
5) Repeat process from step 3 with next 9 Page files in the zip.
 
6) Submit the spreadsheet files to the task mentor for review/sing-off.  






* [[File:Batch-cni-dict-01.zip]]
* [[File:Batch-cni-dict-01.zip]] (Completed example)


* [[File:Batch-cni-dict-02.zip]]  
* [[File:Batch-cni-dict-02.zip]]