This section talks about the Optical Character Recognition (OCR) feature provided in Astera.
We have a PDF file which is a scanned copy of an invoice for a consulting service. We will use the OCR option in Astera to extract the data.
1. Go to File > New > Report Model,
Select the PDF and click Open.
2. As the file does not have text, and is comprised of images, Astera presents it as follows:
Select Use OCR to begin OCR processing on the document.
As the OCR is processed, a green processing bar is shown and if you want to cancel the processing, you can uncheck the Use OCR option in the OCR Options section of the Report Options panel.
4. For the current file read, the OCR gives us the following result:
As can be seen, there are a few misread items in the quantity column.
There is another option, OCR Resolution, where the user can select a suitable resolution to apply OCR. In order to determine which option is best for your case, visit the Best Practices for OCR document.
We will select OCR Resolution at low and we get the following result.
5. After processing, the data is presented on the designer. Along with useful data, there are also some noise elements in the original file which are incorrectly converted to text and some erroneous readings of data.
Astera has an Edit Mode to allow the user to edit the extracted data in order to clean and correct the result of OCR.
Click the Start Edit Mode icon on the toolbar above the designer.
6. In the Edit Mode, there is a separate toolbar with tools for editing.
We will edit the data in the file to remove the noise elements and correct a few numbers.
7. The editable data can be saved at any point to a .txt file for ease in later use.
8. Once done with the edits, click on the End Edit Mode icon on the toolbar above the designer. After edits, closing the edit mode changes the source file path of the report model to the .txt file we just saved.
Now, you can proceed to interact with the data on the designer as normal in a Report Model.
This concludes our discussion on the usage of OCR in loading files in Astera.
Icon | Name | Purpose & Functionality |
---|---|---|
Save
To save the extracted data to a .txt file.
Find and Replace
To find and/or replace a particular string within the text.
Cut
To cut a particular data to clipboard.
Copy
To copy a particular data to clipboard.
Paste
To paste the last item from the clipboard.
Undo
To undo the previous action taken.
Redo
To redo the undone action.
Revert to Original
To revert the edits back to original.
End Edit Mode
To end Edit Mode and go back to the Report Model.
Optical Character Recognition or OCR detects and recognizes the text on a scanned image or scanned document. OCR technologies are not able to convert 100% of the text found in such documents. However, there are certain properties of the scanned document that complement the OCR technology.
This document provides a list of properties that a quality scan should possess for the OCR present in Astera to recognize and process the data with at least 95% accuracy.
1. It is recommended to maintain a dots per inch (DPI) quality of at least 250 DPI. Improving the camera quality or the lighting can increase the DPI of a document. PDF readers can detect the DPI of scanned document.
2. Avoid tilted document scans. Whenever you scan a document, ensure the orientation is set to portrait, and the image is aligned with the borders.
3. Avoid pencil/pen marks near the text. If you have to sign a document, it is recommended to do it at the bottom corner of the document to be as far away as possible from the central text.
4. Avoid watermarks on a document.
5. The original document should maintain an adequate spacing between columns and records so as to avoid overlapping of text before the scan takes place.
6. Black and white color themed documents are recommended, where the font color is black on a white background.
7. Table headers are also recommended to not have filled in colors.
8. High contrast between text and the background is helpful for better visibility.
9. Avoid highlighting of text on the document.
10. Consistent font size within a line is helpful for better data processing.
11. The minimum font size is 12 pts Calibri. Any font type can be used, Calibri is used as a reference.
In its OCR implementation, Astera provides the OCR Resolution option. Resolutions, Low, Medium, and High are based on the zoom factor, that is, how much zoom factor would OCR use to convert the image data.
Just like we take pictures with a phone camera on different resolutions and get more pixels on high resolution and vice versa, Astera’s OCR converts images based on the resolution selected.
It is recommended to use Low resolution when the scanned PDFs have less text. This will consume lesser time as well. For more data or text in the PDF, we can shift to the Medium resolution and similarly High on even more data. This will take comparatively more time than low and medium resolutions.
Medium and High resolutions can be interchangeably used, sometimes medium resolution does not convert all the text and the result has some missing values. In this case, using high resolution can be beneficial. It can bring all the data in a more structured manner.
This concludes our discussion on best practices for OCR in Astera.