Changes

Jump to navigation Jump to search
no edit summary
Line 1: Line 1: −
The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR'ed digital text.  That digitized text is the starting point for any number of efforts.  A batch processing fo the PDFs into text files has already been performed.
+
The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR'ed digital text.  That digitized text is the starting point for any number of efforts.  A batch processing of the PDFs into text files has already been performed.
   −
The next critical step is to take the imperfectly formatted OCR output and restructure it into something tha corresponds to the two column format of the original PDF.  The format of choice for this is a spreadsheet containing each entry in side by side columns.
+
 
 +
 
 +
The next critical step is to take the imperfectly formatted OCR output and restructure it into something that corresponds to the two column format of the original PDF.  The format of choice for this is a spreadsheet containing each entry in side by side columns.
    
Here is the full image-PDF [[File:Dt19.pdf]], which has been broken down to individual pages and grouped in batches below containing both the original PDF page and th eOCR'ed text.
 
Here is the full image-PDF [[File:Dt19.pdf]], which has been broken down to individual pages and grouped in batches below containing both the original PDF page and th eOCR'ed text.
   −
Download the zips blow (each contains 10 PDF image files and ten text files).
+
Download the zips below (each contains 10 PDF image files and ten text files).
    
Recommended approach:
 
Recommended approach:
Line 15: Line 17:  
3) Save file as either Pagennn.xls or Pagennn.ods.  Both formats are acceptable.
 
3) Save file as either Pagennn.xls or Pagennn.ods.  Both formats are acceptable.
   −
3) Repeat process from step 3 with next 9 Page files in the zip.
+
4) Visually review (copyedit) each of the PDF file and spreadsheet file pairs.  Try to correct any errors introduced by the OCR process (for example a a "b" is sometimes OCR'ed as an "h" and vice-versa.fo Pay special attention to characters with accents or other diacritical marks.
 +
 
 +
5) Repeat process from step 3 with next 9 Page files in the zip.
   −
4) Visually review (copyedit) each of the PDF file and spreadsheet file pairs.  Try to correct any errors introduced by the OCR process (for example a a "b" is sometimes OCR'ed as an "h" and vice-versa. Pay special attention to characters with accents of other diacritical marks.
+
6) Submit the spreadsheet files to the task mentor for review/sign-off.  
         −
* [[File:Batch-cni-dict-01.zip]]
+
* [[File:Batch-cni-dict-01.zip]] (Completed example)
    
* [[File:Batch-cni-dict-02.zip]]  
 
* [[File:Batch-cni-dict-02.zip]]  

Navigation menu