Text Converter
The Text Convertor object enables users to extract text from various file formats, including documents, images, and audio files. It supports Optical Character Recognition (OCR) for enhanced performance in text extraction.
The Text Convertor object provides conversion for:
Document to Text: Extract text from PDFs, Doc/Docx and TXT files.
Image to Text: Use Optical Character Recognition (OCR) to extract text from images in JPG, PNG, and JPEG formats.
HTML to Text: Extract text from HTML, HTM, and XHTML files.
Markdown to Text: Extract text from MD, MARKDOWN, MKD, MKDN, MDWN, and MDOWN files.
Excel to Text: Extract text from XLS, XLSX, and CSV files.
Overview
In this guide, we will cover how to:
Convert a PDF document to text.
Extract text from PNG images using OCR.
Use the Text Converter object as a transformation
How to Use the Text Convertor Object
Getting the Text Convertor Object
To get a Text Converter object, go to Toolbox > Sources > Text Converter. If you are unable to see the Toolbox, go to View > Toolbox or press Ctrl + Alt + X.
Drag-and-drop the Text Convertor object onto the designer.
Configuring the Text Convertor Object
Configure the object, by right-clicking on its header and select Properties from the context menu.
A dialog box will open.
This is where you can configure the properties for the Text Converter object.
Convert a Scanned PDF Document to Text
1. The first step is to specify the File Path to the PDF file that needs to be converted.
Next, define an Output Directory where the converted text will be stored in another pdf file. (optional)
Configure the PDF Converter Options.
Pdf Password: Provide the password if the pdf file is password protected.
Pages To Read: Specify which pages need to be read. Leaving this empty means read all pages.
Text Converter Model: Select from the available models:
Google OCR
TesseractOCR (Beta)
PaddleOCR (Beta)
TextractOCR (Beta)
Note: PaddleOCR and TesseractOCR are free to use, while GoogleOCR and Textract require subscriptions. Tesseract and PaddleOCR are third-party open-source components and may not provide the highest quality. For the best results, GoogleOCR is recommended.
Resolution: there are three types of resolutions to choose from: Auto, High, Low, and Medium. To learn more on which resolution would suit your needs best, click here.
Force OCR: This option applies OCR to both digital and scanned files, regardless of their format. When unchecked (the default setting), the system first determines whether an incoming file is scanned or an image. OCR is applied only if the file is detected as scanned or image based.
Split Output: Check this box to split the text for each page into a separate output record.
Configure the Excel Converter Options.
Work Sheet Name: Specify the name of your worksheet that you want to read data from.
Space Between Excel Columns: Specify the space between the Excel Columns.
Blank Lines Before End of File: Specify the number of blank lines at which the file ends.
Tab Size: Specify the tab spacing to be used in the extracted text.
Once you have configured the Text Convertor object, click OK.
Right-click on the Text Convertor object’s header and select Preview Output from the context menu.
A Data Preview window will open and will show you the preview of the extracted text.
Extract Text from PNG Images using OCR
The first step is to specify the File Path to the PNG image that needs to be processed for text extraction.
Next, define an Output Directory where the extracted text will be stored. (optional)
Configure the Text Converter Options.
Text Converter Models: Select from the available models:
Google OCR
TesseractOCR
PaddleOCR
TextractOCR
Once you have configured the Text Convertor object, click OK.
Right-click on the Text Convertor object’s header and select Preview Output from the context menu.
A Data Preview window will open, displaying the extracted text from the PNG image.
Use the Text Converter Object as a Transformation
We can also use the Text Converter object as a transformation. To do so, right click on the header of the object and select Transformation.
You’ll see the color of
headerthe header changing from green to purple. Depicting its transition from a source to transformation. You can also notice an input node being added along with the output node.
You can now provide the input from any source object to the Text Converter object in your dataflow directly.
This concludes working with the Text Converter object in Astera Data Stack.
Last updated
Was this helpful?