Workflow Overview

<< Click to Display Table of Contents >>

Navigation:  OCR Module > Appendix >

Workflow Overview


 

PRO SDK Icon Workflow Overview

 


 

The overview below illustrates the manner in which the OCR Library Types and Functions operate in relation to each other:

 

OCR Preparation

 

PXODocument is the main document structure that the PDF_XChange PRO SDK utilizes.

OCR_Init sets up a new PXODocument in order to load input files and perform OCR.

OCR_LoadA and OCR_LoadW load input files into the PXODocument object’s input layer.

OCR_GetText processes a PXODocument and then formats and returns the plain text.

OCR_MakeSearchable processes a PXODocument and then generates a new output layer that contains searchable PDF results.

OCR_SaveW  and OCR SaveA can be used to save these results.

OCR_GetField performs OCR on a PXODocument and then formats and returns the plain text.

OCR_GetFields performs multiple OCRs on a PXODocument and then formats and returns the plain text.

PXO_FieldInputFlags is an enumerated type that determines the style of input coordinates that OCR_GetField and OCR_Getfields use.

OCR_SetCallBack sets the callback function for the PDF rasterization process of PXODocument structures.

PXO_CallbackStage is an enumerated type that is passed to the user-defined callback function that OCR_SetCallback determines.

OCRp_Page performs OCR on a specified page of a PXODocument, then returns the results in a structure that can be queried for text layout details.

OCRp_Field performs OCR on a specified area of a PXODocument, then returns the results in a structure that can be queried for text layout details.

OCR_GetNumInputPages returns the number of pages in the input layer of the PXODocument.

OCR_Delete deletes the PXODocument, which is a necessary step once all functions are complete.

 

PXO_Options is the OCR options input structure that determines variables for the OCR process. It utilizes:

PXO_Language to determine the language used for OCR.

OCR_RegionMode to improve the accuracy of page segmentation for specific text formats.

OCR_ImageProcessingFlags to enable additional operations when images are processed.

 

PXO_Pagelist is an input type used to store PDF page numbers for OCR operations.

OCR_NewPagelist initializes a new PXO_Pagelist structure.

OCR_AddPage adds a new input document page number to the PXO_Pagelist structure.

OCR_NumPages returns the number of input page numbers stored in the PXO_Pagelist structure.

OCR_GetPageByIndex returns a specified input document page number from the PXO_Pagelist structure.

OCR_PagesToInputFields duplicates an input field for the pages that the input PXO_Pagelist structure specifies.

OCR_ReleasePagelist releases the memory that PXO_Pagelist structures use, which is a necessary step once all functions are complete.

 

PXO_InputField is an input structure for OCR.

OCR_NewInputFields initializes a new PXO_InputField structure.

OCR_GetInputFieldByIndex returns the zero-index PXO_InputField currently stored in its InFields structure.

PXO_FieldInputFlags is an enumerated type that specifies the flags used to determine the style of input coordinates that PXO_Inputfield, OCR_Get Field and OCR_GetFields use.

OCR_ReleaseInputfields deletes PXO_InputField structures, which is a necessary step once all functions are complete.

 

PXO_Inputfields contains a list of PXO_InputField structures for zonal/regional OCR.

OCR_AddInputField adds a new PXO_InputField to a PXO_InputFields structure.

OCR_LoadTemplateW loads a list of input fields into a PXO_InputFields structure from an ASCII text input file.

OCR_SaveTemplateW saves a list of input fields from a PXO_InputFields structure to a template file.

OCR_NumInputFields returns the number of input fields stored in PXO_InputFields.

OCR_ReleaseInputFields frees the memory that PXO_InputFields structures use, which is a necessary step once all functions are complete.

 

OCR_RasterPageSettings converts PDF coordinates to/from rasterized page image coordinates, which is a necessary step for some of the Low-Level Functions.

OCRp_RasterRectToPDF utilizes this structure.

OCRp_Page and OCRp_Field return OCR_RasterPageSettings.

OCRp_RasterRectToPDF converts results from OCRp_Page or OCRp_Field into PDF coordinates to make it possible to query them for text layout details.

OCR_SymbolBox is a structure that contains a single character and, when available, descriptive information from the OCR process. It uses OCR_Baseline to store the baseline for text elements.

 

OCR Results Hierarchy

 

Results are returned in a hierarchy after the OCR process is performed:

 

PXO_Page is the top level of the OCR results hierarchy.

PXO_Page may contain PXO_Region members (see below).

OCRp_PageText returns plain text from specified PXO_Page structures.

OCRp_RegionCountFromPage returns the number of regions in the specified PXO_Page structure.

OCRp_GetRegionFromPage returns the requested output region from the specified PXO_Page structure.

OCRp_FreePage is used to delete PXO_Page structures and free the memory that they use.

 

PXO_Region is the second level of the OCR results hierarchy.

OCRp_GetRegionFromPage can be used to return PXO_Region members. (N.b. OCRp_FreePage must be used to free memory when this process is complete, which will also delete associated PXO_Region members).

PXO_Region structures may contain OCR_SymbolBox members, which OCRp_GetSymbolFromRegion can be used to access.

OCRp_SymbolCountFromRegion returns the number of symbols in the specified PXO_Region structure.

OCRp_GetSymbolFromRegion returns a requested symbol from the specified PXO_Region structure.