# Text Converter

The *Text Convertor* object enables users to extract text from various file formats, including documents, images, and audio files. It supports Optical Character Recognition (OCR) for enhanced performance in text extraction.

The Text Convertor object provides conversion for:

* **Document to Text**: Extract text from PDFs, Doc/Docx and TXT files.
* **Image to Text**: Use Optical Character Recognition (OCR) to extract text from images in JPG, PNG,          and JPEG formats.
* **HTML to Text**: Extract text from HTML, HTM, and XHTML files.
* **Markdown to Text**: Extract text from MD, MARKDOWN, MKD, MKDN, MDWN, and MDOWN files.
* **Excel to Text**: Extract text from XLS, XLSX, and CSV files.

## Overview

In this guide, we will cover how to:

* [Convert documents using OCR to text](#convert-documents-using-ocr-to-text)
* [Extract text from Excel files](#extract-text-from-excel-files)
* [Use the Text Converter object as a transformation](#use-the-text-converter-object-as-a-transformation)

## How to Use the Text Convertor Object

### Getting the Text Convertor Object

1. To get a *Text Converter* object, go to *Toolbox > Sources > Text Converter*. If you are unable to see the Toolbox, go to *View > Toolbox* or press Ctrl + Alt + X.

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/yU1HtWz4CuStyrVp8wih/image.png" alt="" width="350"><figcaption></figcaption></figure>

2. Drag-and-drop the *Text Convertor* object onto the designer.

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/zbUYy6PyYCh5sQ7QHT6V/image.png" alt=""><figcaption></figcaption></figure>

### Configuring the Text Convertor Object

1. Configure the object, by right-clicking on its header and select *Properties* from the context menu.

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/AYDFBVrFdLtjiDqLQ6z3/image.png" alt="" width="386"><figcaption></figcaption></figure>

A dialog box will open.

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/8RmecP4lIUlcFswTS7ax/image.png" alt="" width="563"><figcaption></figcaption></figure>

This is where you can configure the properties for the Text Converter object.

#### Convert documents using OCR to Text

1. Specify the *File Path* to the scanned PDF/Image file that needs to be converted.

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/4DYTU6CP7m4VCBb4L25a/image.png" alt="" width="563"><figcaption></figcaption></figure>

2. Next, define an *Output Directory* where the converted text will be stored in another file. *(optional)*

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/Spinr1uLh9cdbrVtJTCU/image.png" alt="" width="563"><figcaption></figcaption></figure>

3. &#x20;Configure the *PDF Converter Options*.

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/43AzA2W2G2yqNAKbVaKo/image.png" alt="" width="563"><figcaption></figcaption></figure>

* *Pdf Password:* Provide the password if the pdf file is password protected.&#x20;
* *Pages To Read:* Specify which pages need to be read. Leaving this empty means read all pages.&#x20;
* *Text Converter Model:* Select from the available models:
  * Google OCR
  * TesseractOCR (Beta)
  * PaddleOCR (Beta)
  * TextractOCR (Beta)

{% hint style="info" %}
**Note:** PaddleOCR and TesseractOCR are free to use, while GoogleOCR and Textract require subscriptions. Tesseract and PaddleOCR are third-party open-source components and may not provide the highest quality. For the best results, GoogleOCR is recommended.
{% endhint %}

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/GhoiuCD4d9h5n5wnW2C7/image.png" alt="" width="563"><figcaption></figcaption></figure>

* *Resolution*: There are three types of resolutions to choose from: Auto, High, Low, and Medium. To learn more on which resolution would suit your needs best, click [here](https://documentation.astera.com/report-model/optical-character-recognition/best-practices-for-ocr-usage#resolution-option-in-astera).

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/0nAGwrqYU3wGmyUFzW3K/image.png" alt="" width="563"><figcaption></figcaption></figure>

* *Force OCR*: This option applies OCR to both digital and scanned files, regardless of their format. When unchecked (the default setting), the system first determines whether an incoming file is scanned or an image. OCR is applied only if the file is detected as scanned or image based.
* *Split Output:* Check this box to split the text for each page into a separate output record.

4. Once you have configured the *Text Convertor* object, click *OK*.
5. Right-click on the *Text Convertor* object’s header and select *Preview Output* from the context menu.

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/pOq8QGUk0sC5HqeehBnk/image.png" alt="" width="391"><figcaption></figcaption></figure>

7. A *Data Preview* window will open and will show you the preview of the extracted text.

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/7syegLtn926g8X9rLRlD/image.png" alt=""><figcaption></figcaption></figure>

#### Extract Text from Excel files

1. Specify the *File Path* to the Excel file that needs to be converted.
2. Next, configure the *Excel Converter Options*

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/krYytiniAa9jLCQ58rVD/TextConverter_Excel.png" alt="" width="563"><figcaption></figcaption></figure>

* *Work Sheet Name:* Specify the name of your worksheet that you want to read data from.
* *Space Between Excel Columns:* Specify the space between the Excel Columns.
* *Blank Lines Before End of File: S*pecify the number of blank lines at which the file ends.
* *Tab Size:* Specify the tab spacing to be used in the extracted text.

4. Once you have configured the *Text Convertor* object, click *OK*.
5. Right-click on the *Text Convertor* object’s header and select *Preview Output* from the context menu.

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/HuJxaTS8x2tEo8tqBiWd/image.png" alt="" width="391"><figcaption></figcaption></figure>

6. A *Data Preview* window will open, displaying the extracted text from the Excel file.

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/YYteQJxMXroMkNwHv9im/TextConverter_ExcelPreview.png" alt=""><figcaption></figcaption></figure>

### Use the Text Converter Object as a Transformation

1. We can also use the Text Converter object as a transformation. To do so, right click on the header of the object and select Transformation.

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/zXom9uYUTrrXp9AViDA8/image.png" alt="" width="542"><figcaption></figcaption></figure>

2. You’ll see the color of the header changing from green to purple. Depicting its transition from a source to transformation. You can also notice an input node being added along with the output node.

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/6JHOYANPnMyJkVklbwZM/image.png" alt=""><figcaption></figcaption></figure>

3. You can now provide the input from any source object to the Text Converter object in your dataflow directly.

<figure><img src="https://content.gitbook.com/content/zEifS4h8yurLAAwiGNX2/blobs/0ONPY1LhLo6fJVcHsXR9/image.png" alt=""><figcaption></figcaption></figure>

This concludes working with the Text Converter object in Astera Data Stack.
