Changes

Jump to navigation Jump to search
no edit summary
Line 1: Line 1: −
The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR'ed digital text.  That digitized text is the starting point for any number of efforts.  A batch processing fo the PDFs into text files has already been performed.
+
The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR'ed digital text.  That digitized text is the starting point for any number of efforts.  A batch processing of the PDFs into text files has already been performed.
   −
The next critical step is to take the imperfectly formatted OCR output and restructure it into something tha corresponds to the two column format of the original PDF.  The format of choice for this is a spreadsheet containing each entry in side by side columns.
+
 
 +
 
 +
The next critical step is to take the imperfectly formatted OCR output and restructure it into something that corresponds to the two column format of the original PDF.  The format of choice for this is a spreadsheet containing each entry in side by side columns.
    
Here is the full image-PDF [[File:Dt19.pdf]], which has been broken down to individual pages and grouped in batches below containing both the original PDF page and th eOCR'ed text.
 
Here is the full image-PDF [[File:Dt19.pdf]], which has been broken down to individual pages and grouped in batches below containing both the original PDF page and th eOCR'ed text.
   −
Download the zips blow (each contains 10 PDF image files and ten text files).
+
Download the zips below (each contains 10 PDF image files and ten text files).
    
Recommended approach:
 
Recommended approach:

Navigation menu