1 of 100

Version 11

Welcome to Astera Data Stack Documentation

Getting Started!

Artifacts

Use Cases

More...

RELEASE NOTES

ReportMiner 11.1 - Release Notes

ReportMiner 11.1 marks our step into the realm of AI, empowering users to harness the full potential of data extraction and processing with newly added AI capabilities. From user interface improvements to advanced AI-powered features, this release sets a new standard for template-less data extraction. Template-less data extraction uses AI and machine learning to pull data from documents without the need to use pre-configured templates, making it easier to handle different and unorganized document layouts.

ReportMiner 11.1 sets a new benchmark in efficiency and usability, seamlessly integrating AI into your data workflows.

Elevate your data journey with ReportMiner 11.1 – where performance, visibility, and efficiency converge effortlessly, all within an intuitive drag-and-drop interface.

LLM Generate

LLM Generate is a core component of Astera’s AI capabilities, enabling the creation of AI-powered solutions when combined with other objects on the flow, including sources, transformations, and destinations. It retrieves an output from a Large Language Model based on a user-defined input prompt, with support for various LLM providers such as OpenAI, Llama, and custom models.

The object features:

Input Port: Maps fields to be included in the prompt for the LLM model.

Output Port: Contains the generated result from the LLM model.

LLM Generate’s flexibility in processing input and generating output through natural language instructions makes it a versatile and powerful tool in your data pipeline. Numerous use cases are made possible thanks to this new object, which will be explored in the product documentation.

Text Converter

The Text Converter object is another addition to the dataflow’s toolbox, which enables users to extract text from various file formats, including documents, images, and scanned files. It enhances text extraction performance using Optical Character Recognition (OCR) technology. Currently, the Text Processor supports Google OCR, PaddleOCR (Beta), TesseractOCR (Beta), and TextractOCR (Beta) platforms.

Key conversion features include:

Document to Text: Extract text from PDFs, Doc/DocX, and TXT files.

Image to Text: Use OCR to extract text from image formats such as JPG, PNG, and JPEG.

HTML to Text: Extract text from HTML, HTM, and XHTML files.

Markdown to Text: Extract text from MD, MARKDOWN, MKD, MKDN, MDWN, and MDOWN files.

Excel to Text: Extract text from XLS, XLSX, and CSV files.

Auto-Generate Layout

Astera ReportMiner's Auto-Generate Layout (AGL) feature uses AI to automatically identify data regions and fields in your source document, making it easier to create layouts in different document types and extract data.

Normally, creating a template can take over 10 minutes, but with the AGL feature, it could take as little as 5 seconds. This feature also checks the extracted data for accuracy identifying any errors or issues that may require user’s review.

This helps speed up data extraction with less manual work involved, allowing the user to focus on other tasks.

Installer Upgrades

Python Server

To enable advanced extraction features, such as the Text Converter, the Python server is required. As part of the installation, the Python server is embedded within the integration server in v11.1, and it runs seamlessly as a component within the server. Python server activation is required to utilize the Text Converter object within the tool.

Install Manager

The Install Manager now includes the Python Server installation as well for features like Text Converter and more. The Python Server must be installed on the same machine as the Integration Server, and the Install Manager will launch automatically if selected during the installation. If the user doesn't open the install manager upon server installation (via Wix installer), they can launch it from the Start menu later on.

Installer with Wix

With the WiX installer, users can customize their installation by choosing the installation directory and modifying the service port for the server installer, offering greater flexibility during setup compared to the previous versions. The v11.1 installation is designed to run alongside any pre-11 versions, so you can test your existing flows on either version side-by-side while transitioning to the new version. While we recommend setting up a new repository database for your new 11.1 installation, upgrading an existing repository is also supported.

This concludes ReportMiner 11.1 release notes.

SETTING UP

System Requirements

Client Application Processor

Dual Core or greater (recommended); 2.0 GHz or greater

Server Application Processor

8 Cores or greater (recommended)

Repository Database

MS SQL Server 2008R2 or newer, or PostgreSQL v15 or newer for hosting repository database

Operating System - Client

Windows 10 (recommended)

Operating System - Server

Windows: Windows 10 or Windows Server 2012 or newer

Memory

Client: 4GB or greater (recommended)

Server: 8GB or greater (recommended)

32GB or greater for large data processing

32GB or greater for AI processing

Hard Disk Space

Client: 300 MB – (if .NET Framework is pre-installed)

Server: 700 MB – (if .NET Framework is pre-installed)

Additional 300 MB if .NET Framework is not installed

AI Subscription Requirements

OpenAI API (provided as part of the package)

LLAMA API

Together AI

Postgres Db (if knowledgebase is needed)

Other

Requires AST .NET Core 8.0.x Windows and Desktop Runtime 8.0.x for the client, .NE.NET Core Runtime 8.0.x for the server

Note: The overall speed and performance of the application depend on the configuration of your machine. More memory and higher processing speed on the system will result in faster performance, especially when transferring large amounts of data as the application takes advantage of the multicore hardware to parallelize operations.

Product Architecture

Astera Data Stack is built on a client-server architecture. The client is the part of the application which a user can run locally on their machine, whereas the server performs processing and querying requested by the client. In simple words, the client sends a request to the server, and the server, in turn, responds to the request. Therefore, database drivers are installed only on the Astera Data Stack server. This enables horizontal scaling by adding multiple clients to an existing cluster of servers and eliminating the need to install drivers on every machine.

The Astera client and server applications communicate on REST architecture. REST-compliant systems, often called RESTful systems, are characterized by statelessness and separate concerns of the client and server, which means that the implementation of both can be done independently if each side knows what format of messages to send to the other. The server communicates with the client using HTTPS commands, which are encrypted using a certified key/certificate signed by an authority. This saves the data from being intercepted by an attacker as the plaintext is encrypted as a random string of characters.

Installing Client and Server Applications

In this section we will discuss how to install and configure Astera Server and Client applications.

How to Install Astera Server

Run ‘IntegrationServer.exe’ from the installation package to start the server installation setup.
Astera Software License Agreement window will appear; check I agree to the license terms and conditions checkbox, then click Install.

Note: You can select Options to change the default installation directory and server port.

How to install Astera Client

Run the ‘ReportMiner’ application from the installation package to start the client installation setup.
Astera Software License Agreement window will appear; check I agree to the license terms and conditions checkbox, then click Install.

Note: You can select Options to change the default installation directory.

When the installation is successfully completed, click Close.

Install Manager

In order for advanced features such as AGL, OCR and Text Converter to work in Astera, different packages are required to be installed, such as Python, Java etc. To avoid the tedious process of separately installing these, Astera provides a built in Install Manager in your tool.

There are two types of packages which are required to be installed:

Prerequisites for Python Server: This package is required to be installed on server machine only.
Prerequisites for ReportMiner: This package is required to be installed on client and server machine.

Note: If the server and client are on same machine, then ReportMiner packages need to be installed once only.

The packages being installed for AGL are listed as follows:

The packages being installed for OCR are listed as follows:

Python packages

The packages being installed for Python Server are listed as follows:

Python Server executable is installed (comes with all packages necessary for Python Server)

In the following documents, we will look at how to use the install manager to install these packages on client and server machines.

Installing Packages on Client Machine

Open Astera as an administrator.
Once Astera is open, go to the Tools > Run Install Manager.

The Install Manager welcome window will appear. Click on Next.

If the prerequisite packages are already installed, the Install Manager will inform you about them and give you the option to uninstall or update them as needed.
If the prerequisite packages are not installed, then the Install Manager will present you with the option to install them. Check the box next to the pre-requisite package, and then click on Install.

Note: If your client and server are on the same machine, then you can also install Python Server directly at this stage.

During the installation, the Install Manager window will display a progress bar showing the installation progress.

You can also cancel the installation at any point if necessary.

Once the installation is complete, the Install Manager will prompt you. Click on Close to exit out of the Install Manager.

The packages for AGL and OCR usage are now installed, and the features are ready to use.

Note: The packages are ready to use in the case when both the Integration Server and Astera are installed on the same machine.

In case the Integration Server is installed on a separate machine, we will need to install the packages for AGL and OCR there as well.

Installing Packages on Server Machine

In order to access the install manager on the sever machine, open start and search for “Install Manager for Integration Server”.

Run this Install Manager as admin.
The Install Manager welcome window will appear. Click on Next.

If the prerequisite packages are already installed, the Install Manager will inform you about them and give you the option to uninstall or update them as needed.
If the prerequisite packages are not installed, then the Install Manager will present you with the option to install them. Check the box next to the pre-requisite package, and then click on Install.

During the installation, the Install Manager window will display a progress bar showing the installation progress.

You can also cancel the installation at any point if necessary.

Once the installation is complete, the Install Manager will prompt you. Click on Close to exit out of the Install Manager.

This concludes our discussion on how to use the install manager for Astera.

Connecting to an Astera Server using the Client

How to connect to an Astera Server from the Client Startup Screen

After you have successfully installed Astera client and server applications, open the client application and you will see the Server Connection screen as pictured below.

Enter the Server URI and Port Number to establish the connection.

The server URI will be the IP address of the machine where Astera Integration server is installed.

Server URI: (HTTPS://IP_address)

Note: You can get help of your network administrator to get the IP address of the machine where Astera Integration server is installed. Or you can launch the command prompt and type the command ipconfig to get the IP configuration details for the machine and use that information to provide Server URI.

The default port for the secure connection between the client and the Astera Integration server is 9264.

If you have connected to any server recently, you can automatically connect to that server by selecting that server from the Recently Used drop-down list.

Click Connect after you have filled out the information required.

The client will now connect to the selected server. You should be able to see the server listed in the Server Explorer tree when the client application opens.

To open Server Explorer go to Server > Server Explorer or use the keyboard shortcut Ctrl + Alt + E.

Before you can start working with the Astera client, you will have to create a repository and configure the server.

How to Connect to a Different Astera Server from the Client

You can connect to different servers right from the Server Explorer window in the Client. Go to the Server Explorer window and click on the Connect to Server icon.

A prompt will appear that will confirm if you want to disconnect from the current Server and establish connection to a different server. Click Yes to proceed.

Note: A client cannot be connected to multiple servers at once.

You will be directed to the Server Connection screen. Enter the required server information (Server URI and Port Number) to connect to the server and click Connect.

If the connection is successfully established, you should be able to see the connected server in the Server Explorer window.

How to Build a Cluster Database and Create Repository

Before you start using the Astera server, a repository must be set up. Astera supports SQL Server and PostgreSQL for building cluster databases, which can then be used for maintaining the repository. The repository is where job logs, job queues, and schedules are kept.

To see these options, go to Server > Configure > Step 1: Build repository database and configure server.

The first step is to point to the SQL Server or PostgreSQL instance where you want to build the repository and provide the credentials to establish the connection.

Note: Astera will not create the database itself, just the tables. A database will have to be created beforehand or an existing database can be used. We recommend Astera to have its own database for this purpose.

Building a Repository on SQL Server

Go to Server > Configure > Step 1: Build repository database and configure server.
Select SQL Server from the Data Provider drop-down list and provide the credentials for establishing the connection.
From the drop-down list next to the Database option, select the database on the SQL instance where you want to host the repository.

Click Test Connection to test whether the connection is successfully established or not. You should be able to see the following message if the connection is successfully established.

Click OK to exit out of the test connection window and again click OK, the following message will appear. Select Yes to proceed.

The repository is now set up and configured with the server to be used.

The next step is to log in using your credentials.

Building a Repository on PostgreSQL

Go to Server > Configure > Step 1: Build repository database and configure server.
Select PostgreSQL from the Data Provider drop-down list and provide the credentials for establishing the connection.
From the drop-down list next to the Database option, select the database on the PostgreSQL instance where you want to host the repository.

Click Test Connection to test whether the connection is successfully established or not. You should be able to see the following message if the connection is successfully established.

Click OK and the following message will appear. Select Yes to proceed.

The repository is now set up and configured with the server to be used.

The next step is to log in using your credentials.

Repository Upgrade Utility in Astera

Existing Astera customers can upgrade to the latest version of Astera Data Stack by executing an exe. script, which automates the repository update to the latest release. This streamlined approach enhances the efficiency and effectiveness of the upgrade process, ensuring a smoother transition for users.

Note: This upgrade applies to v10.0 and later ones. Previous versions cannot be upgraded and will need a clean repository as part of the upgrade.

To start, download and run the latest server and client installers to upgrade the build.

Note: Depending on the build, the user can upgrade any of the respective client and server.

Run the Repository Upgrade Utility to upgrade the repository.

Note: If you do not perform this step, you might encounter an error.

Once run, you will be faced with the following prompt.

Click OK and the repository will be upgraded.

Once done, you will be able to view all jobs, schedules, and deployments that you previously worked with in the Job Monitor, Scheduler, and Deployment windows.

This concludes the working of the Repository Upgrade Utility in Astera Data Stack.

How to Login from the Client

Once you have created the repository and configured the server, the next step is to login using your Astera account credentials.

You will not be able to design any dataflows or workflows on the client if you haven’t logged in to your Astera account. The options will be disabled.

Log in to your user account

Go to Server > Configure > Step 2: Login as admin.

This will direct you to a login screen where you can provide your user credentials.

If you are using Astera 10 for the first time, you can login using the default credentials as follows:

Username: admin Password: Admin123

After you log in, you will see that the options in the Astera Client are enabled.

You can use these options until your trial period is active. For fully activating the options and the product, you’ll have to enter your license.

How to automatically reconnect on client startup

If you don’t want Astera to show you the server connection screen every time you run the client application, you can skip that by modifying the settings.

To do that go to Tools > Options > Client Startup and select the Auto Connect to Server option. On enabling the option, Astera will store the server details you entered previously and will use those details to automatically reconnect to the server every time you run the application.

The next step after logging in is to unlock Astera using the License key.

How to Verify Admin Email

Once you have logged into the Astera client, you can set up an admin email to access the Astera server. This will also allow you to be able to use the “Forgot Password” option at the time of log in.

In this document, we will discuss how to verify admin email in Astera.

Verifying Admin Email

1. Once logged in, we will now proceed to enter an email address to associate with the admin user by verifying the email address.

Go to Server > Configure > Step 3: Verify Admin Email

2. Unless you have already set up an email address in the Mail Setup section of Cluster settings, the following dialogue box will pop up asking you to configure your email settings.

Click on Yes to open your cluster settings.

Click on the Mail Setup tab.

3. Enter your email server settings.

4. Now, right-click on the Cluster Settings active tab and click on Save & Close in order to save the mail setup.

5. Re-visit the Verify Admin Email step by going to Server > Configure > Step 3: Verify Admin Email.

This time, the Configure Email dialogue box will open.

6. Enter the email address you previously set up and click on Send OTP.

7. Use the OTP from the email you received and enter it in the Configure Email dialogue and proceed.

On correct entry of the OTP, an email successfully configured dialogue will appear.

8. Click OK to exit it. We can confirm our email configuration by going to the User List.

Right click on DEFAULT under Server Connections in the Server Explorer and go to User List.

9. This opens the User List where you can confirm that the email address has been configured with the admin user.

Using Forgot Password feature

The feature is now configured and can be utilized when needed by clicking on Forgot Password in the log in window.

This opens the Password Reset window, where you can enter the OTP sent to the specified e-mail for the user and proceed to reset your password.

This concludes our discussion on verifying admin email in Astera.

Licensing in Astera

Single license key model

The license key provided to you contains information about how many Astera clients can connect to a single server as well as the functionality available to the connected clients.

Unlocking Astera using your license key

After you have configured the server, and logged in with the admin credentials, the last step is to insert your license key.

Go to Server > Configure > Step 4: Enter License Key.

On the License Management window, click on Unlock using a key.

Enter the details to unlock Astera – Name, Organization, and Product Key and select Unlock.

You’ll be shown the message that your license has been successfully activated.

Note: The connected client applications will shut down for the server license to take effect.

Your client is now activated. To check your license status, you can go to Tools > Manage Server License.

This opens a window containing information about your license.

This concludes unlocking Astera client and server applications using a single licensing key.

How to Supply a License Key Without Prompting the User

In some cases, it may be necessary to supply a license key without prompting the end user to do so. For example, in a scenario where the end user does not have access to install software, a systems administrator may do this as part of a script.

One possible solution is to place the license key in a text file. This way, the administrator can easily license each machine without having to go through the licensing prompt for each user.

Here’s a step-by-step guide to supplying a license key without prompting the user:

Step 1: Create a New Text Document

To get started, create a new text document that will hold the license key required to access the application.

Step 2: Enter the License Key

In the text document, enter a valid license key. The key must be the only thing in the document, and it must be on the very first line. Make sure there are no unnecessary leading or trailing spaces, lines, or any characters other than those of the license key.

Step 3: Save the Text Document by the name “Serial” in the Server Application Folder

Name the text document “Serial” and save it in the Integration Server Folder of the application located in Program Files on your PC. For instance, if the application is Astera, save the Text Document in the “Astera Integration Server 10” folder. This folder contains the files and settings for the server application. The directory path would be as follows:

C:\Program Files\Astera Software\Astera Integration Server 10.

Note: This approach works for all Astera applications, except for Astera API Management and Astera Express. For API Management, there is a different server. Thus, for it, you’ll need to locate the corresponding server folder and follow the same steps. Whereas for Astera Express, since there is no server involved, simply copy the Serial text document to the “Astera Express 10” folder, and the remaining steps remain unchanged.

Step 4: Restart the Service

Finally, restart the Astera Integration Server 10 service to complete the process. This step ensures that, from now on, when the user launches Astera or any other application by Astera, they will not be prompted to enter a license key.

Note: There is no need to restart the service for Astera Express as it does not have a corresponding server.

Also, please keep in mind that all license restrictions are still in effect, and this process only bypasses the user prompt for the key.

In conclusion, by following these simple steps, system administrators can easily supply a license key without prompting the end user. This approach is particularly useful when installing software remotely or when licensing multiple machines.

Enabling Python Server

The Python server is embedded in the Astera server and is required to use the Text Converter object in the tool. It is disabled by default, and this document will guide us through the process of enabling it.

Steps

Launch the client and navigate to Server > Manage > Server Properties.

In the Server Properties, check the Start Python Server checkbox and press Ctrl + S to save your changes.

Now open the Start Menu > Services and restart the service of Astera Integration Server 11.1.

After restarting the service, wait for a few minutes and run cmd as administrator.

Typenetstat -ano | findstr :5001 command in the command prompt to check if your python server is running.

We’ve successfully enabled the Python server. You can now close this window and use any python server dependent features in the Client.

User Roles and Access Control

This article introduces the role-based access control mechanism in Astera. This means that administrators can grant or restrict access to various users within the organization, based on their role in the entire data management cycle.

In this article, we will look at the user lists and role management features in detail.

How to Create a New User

Note: When you run the application for the first time, sign in using the default credentials provided on our help site.

Username: admin

Password: Admin123

Once you have logged in, you now have the option to create new users and we recommend you to do this as a first step.

To create/register a new user, right-click on the DEFAULT server node in the Server Explorer window and select User List from the context menu.

This will open the Server Browser panel.

Under the Security node, right-click on the User node and select Register User from the context menu.

This will open a new window. You can see quite a few fields required to be filled here to register a new user.

Note: Astera currently supports three authentication types when registering a user; Astera, Windows, and Azure Authentication.

Once the fields have been filled, click Register and a new user will be registered.

How to Assign User Roles

Now that a new user is registered, the next step is assign roles to the user.

Select the user you want to assign the role(s) to and right-click on it. From the context menu, select Edit User Roles.

A new window will open where you can see all roles that are there by default in Astera or are custom created. We haven’t created any custom role, so we’ll see the three default roles that are - Developer, Operator, and Root.

Select the role that you want to assign to the user and click on the arrows in the middle section of the screen. You’ll see that the selected role will get transferred from the All Roles section to the User Roles section.

Note: You can assign multiple roles to a single user.

After you have assigned the roles. click OK and the specific role(s) will be assigned to the user.

Managing Role Resources

Astera lets the admin manage resources allowed to any user. They can assign permissions of resources or they can restrict resources.

To edit role resources, right-click on any of the roles and select Edit Role Resources from the context menu.

This will open a new window. Here, you can see four nodes on the left under which resources can be assigned,

The admin can provide a role with resources from the Url node, the Cmd node, access to deployments from the REST node and access to Catalog aritfacts from the Catalog Node.

Expanding the Url node shows us the following resources,

Expanding the Cmd node will give us the following checkboxes as resources.

If we expand the REST checkbox, we can see a list of available API resources, including endpoints you might have deployed.

Upon expanding the Catalog node, we can see the artifacts that have been added to the Catalog, along with which endpoint permissions are to be given.

This concludes User Roles and Access Control in Astera Data Stack.

Offline Activation of Astera

To activate Astera on your machine, you need to enter the license key provided with the product. When the license key is entered, the client sends a request to the licensing server to grant the permission to connect. This action can only be performed when the client machine is connected to the internet.

However, Astera provides an alternative method to activate the license offline by providing an activation code to the users who request for it. Follow the steps given below for offline activation:

1. Go to the menu bar and click on Server > Configure > Step 4: Enter License Key

2. Click on Unlock using a key.

3. Type your Name, Organization and paste the Key provided to you. Then, click Unlock. Do the same if you are changing (instead of activating) your license offline.

4. Another pop-up window will give you an error about the server being unavailable because you cannot connect to the server offline. Click OK.

5. Click on Activate using a key button.

7. You will receive an activation code from the support staff via e-mail. Paste this code into the Activation Code textbox and click on Activate.

8. You have successfully activated Astera on your machine offline using the activation code. Click OK.

9. A pop-up window will notify you that the client needs to restart for the new license to take effect. Click OK and restart the client.

You have successfully completed the offline activation of Astera.

Silent Installation

What is Silent Installation?

Silent installation refers to the installation of software or applications on a computer system without requiring any user interaction or input. In a silent installation, the installation process occurs in the background, without displaying any user interfaces, prompts, or dialog boxes that would normally require the user to make choices or provide information. This type of installation is particularly useful in scenarios where an administrator or IT professional needs to deploy software across multiple computers or systems efficiently and consistently.

Silent Installation Method

Download the Installer

Obtain the installer file you want to install silently. This could be an executable (.exe), Microsoft Installer (MSI), or any other installer format.

For example, in this article we will be using the ReportMiner.exe file to perform the silent installation.

Open a Command Prompt

To initiate the silent installation, you'll need to use a command-line interface. Open the Command Prompt as an administrator.

To achieve this, first search for “Command Prompt” in the Windows search bar then right-click the Command Prompt app, and select Run as administrator from the context menu. This will launch the Command Prompt with administrative privileges.

Navigate to the Installer Location

Locate the installation file and open its location in Windows Explorer. Once you have located the file in Windows Explorer, the full path will be displayed in the address bar at the top of the window. The address bar shows the complete path from the drive letter to the file's location.

For example, this file is located at "C:\Users\muhammad.hasham\Desktop\Silent Installation Files" as evident with the full path displayed in the address bar.

Alternatively, you can also right-click the file and select Properties from the context menu. In the Properties Window, you'll find a Location field that displays the full path to the file.

Navigating to a Directory Using Command Prompt

To silent install the file, change your current location to the specific folder containing the installer using the Command Prompt. To do so, enter the following command in the Command Prompt:

cd {Path to installer}

For example:

cd C:\Users\muhammad.hasham\Desktop\Silent Installation Files

Run the Silent Installation Command

Use the appropriate command to run the installer in silent mode. This command might involve specifying command-line switches that suppress dialogs and prompts.

For .exe File

General File:

“Application.exe” /s /v"INSTALLDIR=“path\toInstall\files”/qn"

Example:

"ReportMiner.exe" /s /v" INSTALLDIR=“C:\Users\muhammad.hasham\Desktop\Silent Installation Files”/qn"

Using MSI FILE general cmd:

General File:

msiexec /i Product.msi /qn INSTALLDIR=“path\toInstall\files”

Example:

msiexec /i "ReportMiner 7 64-bit.msi" /qn INSTALLDIR=“C:\Users\userName\Desktop\Silent Installation Files”

Note: INSTALLDIR=“path\toInstall\files” during installation is entirely optional. If you choose to provide this parameter, the software will be installed in the designated location. However, if you omit this parameter, the software will be installed by default in the “Program Files” folder on the C drive.

To run it in this manner:

"ReportMiner.exe" /s /v" /qn"

Wait for Installation to Complete

The silent installation might take some time. Wait for the installation process to finish. Depending on the software, you might receive an output indicating the progress and success of the installation.

Verify the Installation

After the installation is complete, verify that the software is installed as expected. You might want to check the installation directory, program shortcuts, or any other relevant indicators.

To uninstall

Use the provided command in the Command Prompt to remove the silently installed file.

General File:

msiexec /x "Product.msi" /qn

Example:

msiexec /x "ReportMiner 7 64-bit.msi" /qn

This concludes our discussion on Silent Installations.

Astera Intelligence

LLM Generate

Overview

LLM Generate is the primary object of Astera’s AI offerings. When used in a logical combination with other objects, we can use it to create AI-powered solutions.

LLM Generate allows the user to retrieve an output from an LLM model, based on the input prompt. The user can select from a choice of LLM providers, including OpenAI, Llama etc. User also has the option to use custom LLM models.

It has an input port and an output port.

Input port allows us to map fields that we want to include in the input prompt to the LLM model to generate the result.

Output port is populated with the result of LLM Generate.

Use Case

LLM Generate can be used in countless use cases to generate unique applications. Here, we will cover a basic use case, where LLM Generate will be used to create an invoice extraction solution.

The source file is a PDF invoice. In the output, we want the extracted data from the invoice in a structured JSON file. We want to create a flexible extraction solution that can take invoices of various unpredictable layouts and generate the JSON output in a fixed format.

How To Work with LLM Generate

Create a new dataflow. Here we will design our invoice extraction pipeline.

Configuring the source

In the output port of the source object, we have the entire content of the pdf file as a single string.
This string can now be mapped to the LLM Generate object as input, along with our instructions in the prompt to generate the output.

Using LLM Generate

To do this, we will drag-and-drop LLM Generate object from AI Section of the toolbox to the dataflow designer.

To use an LLM Generate object, we need to map input field/s and define a prompt. In the Output node we get the response of the LLM model, which we can map to downstream objects in our data pipeline. Other configurations of the LLM Generate are set as default but may be adjusted if required by the use case.

As the first step, we will map our input fields to the LLM Generate object’s input node. We can map any number of input fields as required by our use case. For our use case, we will map a single input field, the invoice text from the Text Convertor. This field will have the invoice content as a string. We can rename the input fields, if needed, inside the LLM Generate object.

The next step is to write the prompt that will act as a set of instructions to the LLM for the response that we would like in the output. Go into the properties of LLM Generate object and right-click on the Prompts node and select ‘Add Prompt’. You can also use the ‘Add Prompt’ button at the top of the layout window.

A Prompt node will appear containing the Properties and Text fields.

Prompt Properties

Properties are set by default. Clicking the Properties field opens the Prompt options. The default settings are as shown in the image below:

Run Strategy Type: Defines the execution of the object based on the input.

Once Per Item means that the object will run once per input record. This option is used in cases where the input has multiple records and LLM Generate is to be executed for each record. The output of LLM Generate will have the same number of records as the input.
Chain means that the object will use the output of one prompt and feed it as input for the subsequent prompt within the LLM Generate object. To use the output from the last prompt within the current prompt use the syntax {LLM.LastPrompt.Result}. To use the prompt from the last prompt within the current prompt use the syntax {LLM.LastPrompt.Text}.
Conditional Expression: Here, you can provide the condition that must be satisfied for this prompt to be used in the LLM Generate execution. It works in conjunction with multiple prompts, in cases where one of the multiple prompts are to be used based on some criteria.

For our use case, we have used the default settings of the Prompt Properties.

Prompt Text

Prompt text allows us to write the prompt that is sent to the LLM model to get the response in the output.

In the prompt, we can include the contents of the input fields using the syntax: {Input.field}

In the above syntax, we can provide the input field name in place of field.

We can also use functions to customize our prompt by clicking the functions icon.

For instance, the following syntax will resolve to the first 1000 characters of the input field value in the prompt:

{Left(Input.field,1000)}

For our use case, we will write a prompt that instructs the LLM to extract data from the provided invoice and generate the output in the JSON structure we have provided in the prompt.

Click Next to move to the next screen. This is the LLM Generate Properties screen.

Properties

General Options

We can select the AI provider and model for our object. Additionally, we can use LLM models not provided in the list or a custom LLM model by configuring their API connection as part of a shared connection file inside the project. However, custom fine-tuned models are only supported when using the "Open AI" AI provider.

Ai SDK Options

AI SDK Options allow fine-tuning the output or behavior of the model.

Evaluation Metrics: Enabling this option introduces three additional fields in the output, OutputTokens, LogProbs, and PerplexityScore.

Output Tokens : The total number of tokens of the generated result by LLM Generate. It can help understand the volume/length of the generated content.
LogProbs: Log-probabilities associated with each generated token. These values represent the likelihood (in logarithmic scale) of a specific token being generated based on the model's understanding of the input and context. It depicts the model’s confidence in generating each token.

To understand Log-Probs better, we have a second flow here that identifies the document type from the available options we have provided in the prompt.

We want the confidence score of the result of the AI model as well. For this, we can parse the LogProbs and calculate its exponential. This linear prob calculation is only possible when there is a single value of logprob, which means the output will be set to generate a single token. It is useful for classification cases such as this one, or where boolean response is expected.

After applying exponential to logprob, it becomes linear probability. In the output, we will have the result and the linear probability, or the confidence score. A value closer to 1 means higher confidence of the result.

PerplexityScore: It measures how well a language model predicts a sequence of words. Lower perplexity (closer to 1) means better predictions, while higher perplexity indicates greater uncertainty in predicting the next word.
Max Tokens: Limits the output tokens. Limit is specific to the model. Each model has its max token limit, and we need to set our limit within that threshold.
Temperature: It controls randomness in model predictions. At temperature 0 (default), the model's output is deterministic and consistent, always choosing the most likely token. This also means the model will always produce the same output for the same input. Higher temperatures increase randomness and creativity in the output.
Top P: Also called nucleus sampling controls the diversity of generated output by selecting tokens from a subset of the most likely options. Top P 0.1 (default) means the model will only consider the smallest set of tokens whose cumulative probability is at least 10%. This significantly narrows down the possible token choices, making the output more focused and less random. Increasing the Top P results in less constrained and more creative responses.

For our primary use case, of invoice extraction, we are using OpenAI gpt-4 and default configurations for other options on the Properties screen to generate the result.
Now, we’ll click OK to complete the configuration of the LLM Generate object. We can preview the output to confirm that we are getting the desired response.

Writing to Destination

Now we want to write this text output to a JSON file. We will first drag and drop a JSON Parser onto the designer. We will map the output field of the LLM Generate object to the input field of the JSON Parser object.
Open the Properties of JSON Parser. On the layout screen, we can create our preferred layout or provide a JSON sample to generate the layout automatically. We have copied the same layout we provided in the prompt and pasted into the ‘Generate Layout by Providing Sample Text’ option in the Json layout window.

Once the JSON Parser object is configured, drag and drop a JSON Destination file, configure its file path and map all the fields from the JSON Parser output.

Our dataflow is now configured, and we can run it to create the JSON file for our invoice.

Automating the Extraction for Multiple Invoices

To automate the process for extracting multiple invoices, we will create a workflow. To parameterize the source and destination file paths, we will add and configure a Variables object in our dataflow.

In our workflow, we’ll configure three objects:

Once our workflow is configured, we can run it to extract data and write to JSON files for all of our invoices.

Summary

The flexibility of LLM Generate to provide an input, give natural language commands on how to manipulate the input to generate the output makes it a dynamic universal transformation object in a data pipeline. There can be countless use cases for LLM Generate. We will cover some of these in the next documents.

Text Converter

The Text Convertor object enables users to extract text from various file formats, including documents, images, and audio files. It supports Optical Character Recognition (OCR) for enhanced performance in text extraction.

The Text Convertor object provides conversion for:

Document to Text: Extract text from PDFs, Doc/Docx and TXT files.
Image to Text: Use Optical Character Recognition (OCR) to extract text from images in JPG, PNG, and JPEG formats.
HTML to Text: Extract text from HTML, HTM, and XHTML files.
Markdown to Text: Extract text from MD, MARKDOWN, MKD, MKDN, MDWN, and MDOWN files.
Excel to Text: Extract text from XLS, XLSX, and CSV files.

Overview

In this guide, we will cover how to:

Convert a PDF document to text.
Extract text from PNG images using OCR.
Use the Text Converter object as a transformation

How to Use the Text Convertor Object

Getting the Text Convertor Object

To get a Text Converter object, go to Toolbox > Sources > Text Converter. If you are unable to see the Toolbox, go to View > Toolbox or press Ctrl + Alt + X.

Drag-and-drop the Text Convertor object onto the designer.

Configuring the Text Convertor Object

Configure the object, by right-clicking on its header and select Properties from the context menu.

A dialog box will open.

This is where you can configure the properties for the Text Converter object.

Convert a Scanned PDF Document to Text

1. The first step is to specify the File Path to the PDF file that needs to be converted.

Next, define an Output Directory where the converted text will be stored in another pdf file. (optional)

Configure the PDF Converter Options.

Pdf Password: Provide the password if the pdf file is password protected.
Pages To Read: Specify which pages need to be read. Leaving this empty means read all pages.
Text Converter Model: Select from the available models:
- Google OCR
- TesseractOCR (Beta)
- PaddleOCR (Beta)
- TextractOCR (Beta)

Note: PaddleOCR and TesseractOCR are free to use, while GoogleOCR and Textract require subscriptions. Tesseract and PaddleOCR are third-party open-source components and may not provide the highest quality. For the best results, GoogleOCR is recommended.

Force OCR: This option applies OCR to both digital and scanned files, regardless of their format. When unchecked (the default setting), the system first determines whether an incoming file is scanned or an image. OCR is applied only if the file is detected as scanned or image based.
Split Output: Check this box to split the text for each page into a separate output record.

Configure the Excel Converter Options.

Work Sheet Name: Specify the name of your worksheet that you want to read data from.
Space Between Excel Columns: Specify the space between the Excel Columns.
Blank Lines Before End of File: Specify the number of blank lines at which the file ends.
Tab Size: Specify the tab spacing to be used in the extracted text.

Once you have configured the Text Convertor object, click OK.
Right-click on the Text Convertor object’s header and select Preview Output from the context menu.

A Data Preview window will open and will show you the preview of the extracted text.

Extract Text from PNG Images using OCR

The first step is to specify the File Path to the PNG image that needs to be processed for text extraction.
Next, define an Output Directory where the extracted text will be stored. (optional)

Configure the Text Converter Options.

Text Converter Models: Select from the available models:
- Google OCR
- TesseractOCR
- PaddleOCR
- TextractOCR

Once you have configured the Text Convertor object, click OK.
Right-click on the Text Convertor object’s header and select Preview Output from the context menu.

A Data Preview window will open, displaying the extracted text from the PNG image.

Use the Text Converter Object as a Transformation

We can also use the Text Converter object as a transformation. To do so, right click on the header of the object and select Transformation.

You’ll see the color of ~~header~~the header changing from green to purple. Depicting its transition from a source to transformation. You can also notice an input node being added along with the output node.

You can now provide the input from any source object to the Text Converter object in your dataflow directly.

This concludes working with the Text Converter object in Astera Data Stack.

DATAFLOWS

What are Dataflows?

The ETL and ELT functionality of Astera Data Stack is represented by Dataflows. When you open a new Dataflow, you’re provided with an empty canvas knows as the dataflow designer. This is accompanied with a Toolbox that contains an extensive variety of objects, including Sources, Destinations, Transformations, and more.

Using the Toolbox objects and the user-friendly drag-and-drop interface, you can design ETL pipelines from scratch on the Dataflow designer.

The Dataflow Toolbar also consists of various options.

These include:

Undo/Redo: The Dataflow designer supports unlimited Undo and Redo capability. You can quickly Undo/Redo the last action done, or Undo/Redo several actions at once.
Auto Layout Diagram: The Auto Layout feature allows you to arrange objects on the designer, improving its visual representation.
Zoom (%): The Zoom feature helps you adjust the display size of the designer. Additionally, you can select a custom zoom percentage by clicking on the Zoom % input box and typing in your desired value.
Auto-Size All: The Auto-Size All feature resizes all the object in a manner where all fields of the expanded nodes are visible and empty area inside the object is cropped out.
Expand All: The Expand All feature expands or enlarges the objects on the designer, improving the visual representation.
Collapse All: The Collapse All feature closes or collapses the objects on the designer, improving the visual representation and reducing clutter.
Use Orthogonal Links: The Use Orthogonal Links feature replaces the links between objects with orthogonal curves instead of straight lines.
Data Quality Mode: Data Quality Mode in Astera enhances Dataflows with advanced profiling and debugging by adding a Messages node to objects. This node captures statistical information, such as, TotalCount, ErrorCount, and WarningCount etc.
Safe Mode: The Safe Mode option allows you to study and debug your Dataflows in cases when access to source files or databases is not available. You can open a Dataflow/Subflow and then proceed to debug or understand it after activating Safe Mode.
Show Diagram Overview: This feature opens a Diagram Overview panel, allowing you to get an overview of the whole Dataflow designer.
Link Actions to Create Maps Using AI: The AI Auto-mapper semantically maps fields between different data layouts, automatically linking related fields, for example, "Country" to "Nation."

In the next sections, we will go over the object-wise documentation for the various Sources, Destination, Transformations, etc., in the Dataflow Toolbox.

Sources

Data Providers and File Formats Supported in Astera Data Stack

Astera Data Stack can read data from a wide range of file sources and database providers. In this article, we have compiled a list of file formats, data providers, and web-applications that are supported for use in Astera Data Stack.

Databases and Data Warehouses

Amazon Aurora
Azure SQL Server
MySQL
Amazon Aurora Postgres
Amazon RDS
Amazon Redshift
DB2
Google BigQuery
Google Cloud SQL
MariaDB
Microsoft Azure
Microsoft Dynamics CRM
MongoDB (as a Source)
MS Access
MySQL
Netezza
Oracle
Oracle ODP .Net
Oracle ODP .Net Managed
PostgreSQL
PowerBI
Salesforce (Legacy)
Salesforce Rest
SAP SQL Anywhere
SAP Hana
Snowflake
SQL Server
SQLite
Sybase
Tableau
Teradata
Vertica

In addition, Astera features an ODBC connector that uses the Open Database Connectivity (ODBC) interface by Microsoft to access data in database management systems using SQL as a standard.

File Formats

COBOL
Delimited files
Fixed length files
XML/JSON
Excel workbooks
PDFs
Report sources
Text files
Microsoft Message Queue
EDI formats (including X12, EDIFACT, HL7)

Cloud-Based Data Providers

Microsoft Dynamics CRM
Microsoft Azure Blob Storage
Microsoft SharePoint
Amazon S3 Bucket Storage
Amazon Aurora MySQL
Azure Data Lake Gen 2
PowerBI
Salesforce
SAP
Tableau

File Systems and Transfer Protocols

AS2
FTP (File Transfer Protocol)
Email
HDFS (Hadoop Distributed File System) n/a
SCP (Secure Copy Protocol)
SFTP (Secure File Transfer Protocol)

Web Services

SOAP (Simple Object Access Protocol)
REST (REpresentational State Transfer)

Using the SOAP and REST web services connector, you can easily connect to any data source that uses SOAP protocol or can be exposed via REST API.

Here are some applications that you can connect to using the API Client object in Astera Data Stack:

FinancialForce
Force.com Applications
Google Analytics
Google Cloud
Google Drive
Hubspot
IBM DB2 Warehouse
Microsoft Azure
OneDrive
Oracle Cloud
Oracle Eloqua
Oracle Sales Cloud
Oracle Service Cloud
Salesforce Lightning
ServiceMAX
SugarCRM
Veeva CRM

The list is non-exhaustive.

Support for Custom Connectors

You can also build a custom transformation or connector from the ground up quickly and easily using the Microsoft .NET APIs, and retrieve data from various other sources.

Setting Up Sources

Each source on the dataflow is represented as a source object. You can have any number of sources in the dataflow, and they can feed into zero or more destinations.

The following source types are supported by the dataflow engine:

Flat File Sources:

Tree File Sources:

Database Sources:

Data Model

All sources can be added to the dataflow by picking a source type on the Toolbox and dropping it on the dataflow. File sources can also be added by dragging-and-dropping a file from an Explorer window. Database sources can be drag-and-dropped from the Data Source Browser. For more details on adding sources to the dataflow, see Introducing Dataflows.

Flat File Sources

Delimited File

Adding a Delimited File Source object allows you to transfer data from a delimited file. An example of what a delimited file source object looks like is shown below.

To configure the properties of a Delimited File Source object after it is added to the dataflow, right-click on its header and select Properties from the context menu.

Fixed-Length File

Adding a Fixed-Length File Source object allows you to transfer data from a fixed-length file. An example of what a Fixed-Length File Source object looks like is shown below.

To configure the properties of a Fixed-Length File Source object after it is added to the dataflow, right-click on its header and select Properties from the context menu.

Excel File

Adding an Excel Workbook Source object allows you to transfer data from an Excel file. An example of what an Excel Workbook Source object looks like is shown below.

To configure the properties of an Excel Workbook Source object after it is added to the dataflow, right-click on its header and select Properties from the context menu.

Tree File Sources

COBOL File

Adding a COBOL File Source object allows you to transfer data from a COBOL file. An example of what a COBOL File Source object looks like is shown below.

To configure the properties of a COBOL File Source object after it is added to the dataflow, right-click on its header and select Properties from the context menu.

XML/JSON File

Adding an XML/JSON File Source object allows you to transfer data from an XML file. An example of what an XML/JSON File Source object looks like is shown below.

To configure the properties of an XML/JSON File Source object after it is added to the dataflow, right-click on its header and select Properties from the context menu. The following properties are available:

General Properties window:

File Path – Specifies the location of the source XML file. Using UNC paths is recommended if running the dataflow on a server.

Schema File Path – Specifies the location of the XSD file controlling the layout of the XML source file.

Note: Astera can generate a schema based on the content of the source XML file. The data types will be assigned based on the source file’s content.

Optional Record Filter Expression – Allows you to enter an expression to selectively filter incoming records according to your criteria. You can use the Expression Builder to help you create your filter expression. For more information on using Expression Builder, see Expression Builder.

Note: To ensure that your dataflow is runnable on a remote server, please avoid using local paths for the source. Using UNC paths is recommended.

Database Sources

Database Table

Adding a Database Table Source object allows you to transfer data from a database table. An example of what a Database Table Source object looks like is shown below.

To configure the properties of a Database Table Source object after it is added to the dataflow, right-click on its header and select Properties from the context menu. The following properties are available:

Source Connection window – Allows you to enter the connection information for your source, such as Server Name, Database, and Schema, as well as credentials for connecting to the selected source.

Pick Source Table window:

Select a source table using the Pick Table dropdown.

Select Full Load if you want to read the entire table.
Select Incremental Load Based on Audit Fields to perform an incremental read starting at a record where the previous read left off.

Incremental load based on Audit Fields is based around the concept of Change Data Capture (CDC), which is a set of reading and writing patterns designed to optimize large-scale data transfers by minimizing database writing in order to improve performance. CDC is implemented in Astera using Audit Fields pattern. The Audit Fields pattern uses create time or last update time to determine the records that have been inserted or updated since the last transfer and transfers only those records.

Advantages

Most efficient of CDC patterns. Only records that were modified since the last transfer are retrieved by the query thereby putting little stress on the source database and network bandwidth

Disadvantages

Requires update date time and/or create date time fields to be present and correctly populated
Does not capture deletes
Requires index on the audit field(s) for efficient performance

To use the Audit Fields strategy, select the Audit Field and an optional Alternate Audit Field from the appropriate dropdown menus. Also, specify the path to the file that will store incremental transfer information.

Where Clause window:

You can enter an optional SQL expression serving as a filter for the incoming records. The expression should start with the WHERE word followed by the filter you wish to apply.

For example, WHERE CreatedDtTm >= ‘2001/01/05’

General Options window:

The Comments input allows you to enter comments associated with this object.

SQL Query

Adding a SQL Query Source object allows you to transfer data returned by the SQL query. An example of what an SQL Query Source object looks like is shown below.

To configure the properties of a SQL Query Source object after it is added to the dataflow, right-click on its header and select Properties from the context menu. The following properties are available:

Source Connection window – Allows you to enter the connection information for your SQL Query, such as Server Name, Database, and Schema, as well as credentials for connecting to the selected database.

SQL Query Source window:

Enter the SQL expression controlling which records should be returned by this source. The expression should follow SQL syntax conventions for the chosen database provider.

For example, select OrderId, OrderName, CreatedDtTm from Orders.

Source/Destination File Options

Source or Destination is a Delimited File

If your source or destination is a Delimited File, you can set the following properties

First Row Contains Header - Check this option if you want the first row of your file to display the column headers. In the case of Source file, this indicates if the source contains headers.
Field Delimiter - Allows you to select the delimiter for the fields. The available choices are , and . You can also type the delimiter of your choice instead of choosing the available options.
Record Delimiter - Allows you to select the delimiter for the records in the fields. The choices available are carriage-return line-feed combination , carriage-return and line-feed . You can also type the record delimiter of your choice instead of choosing the available options. For more information on Record Delimiters, please refer to the Glossary.
Encoding - Allows you to choose the encoding scheme for the delimited file from a list of choices. The default value is Unicode (UTF-8)
Quote Char - Allows you to select the type of quote character to be used in the delimited file. This quote character tells the system to overlook any special characters inside the specified quotation marks. The options available are ” and ’.

You can also use the Build fields from an existing file feature to help you build a destination fields based on an existing file instead of manually typing the layout.

Source or Destination is a Microsoft Excel Worksheet

If the Source and/or the Destination chosen is a Microsoft Excel Worksheet, you can set the following properties:

First Row Contains Header - Check this option if you want the first row of your file to display the column headers. In the case of Source file, this indicates if the source contains headers.
Worksheet - Allows you to select a specific worksheet from the selected Microsoft Excel file.

You can also use the Build fields from an existing file feature to help you build a destination fields based on an existing file instead of manually typing the layout.

Source or Destination is a Fixed Length File

If the Source and/or the Destination chosen is a Fixed Length File, you can set the following properties:

First Row Contains Header - Check this option if you want the first row of your file to display the column headers. In the case of Source file, this indicates if the source contains headers.
Record Delimiter - Allows you to select the delimiter for the records in the fields. The choices available are carriage-return line-feed combination , carriage-return and line-feed . You can also type the record delimiter of your choice instead of choosing the available options. For more information on Record Delimiters, please refer to the Glossary.
Encoding - Allows you to choose the encoding scheme for the delimited file from a list of choices. The default value is Unicode (UTF-8)

You can also use the Build fields from an existing file feature to help you build a destination fields based on an existing file instead of manually typing the layout.

Using the Length Markers window, you can create the layout of your fixed-length file, The Length Markers window has a ruled marker placed at the top of the window. To insert a field length marker, you can click in the window at a particular point. For example, if you want to set the length of a field to contain five characters and the field starts at five, then you need to click at the marker position nine.

In case the records don’t have a delimiter and you rely on knowing the size of a record, the number in the RecordLength box is used to specify the character length for a single record.

You can delete a field length marker by clicking the marker.

Source or Destination is an XML file

If the source is an XML file, you can set the following options:

Source File Path specifies the file path of the source XML file.
Schema File Path specifies the file path of the XML schema (XSD file) that applies to the selected source XML file.

Record Filter Expression allows you to optionally specify an expression used as a filter for incoming source records from the selected source XML file. The filter can refer to a field or fields inside any node inside the XML hierarchy.

The following options are available for destination XML files.

Destination File Path specifies the file path of the destination XML file.
Encoding - Allows you to choose the encoding scheme for the XML file from a list of choices. The default value is Unicode (UTF-8).
Format XML Output instructs Astera to add line breaks to the destination XML file for improved readability.
Read From Schema File specifies the file path of the XML schema (XSD file) that will be used to generate the destination XML file.
Root Element specifies the root element from the list of the available elements in the selected schema file.
Generate Destination XML Schema Based on Source Layout creates the destination XML layout to mirror the layout of the source.
Root Element specifies the name of the root element for the destination XML file.
Generate Fields as XML Attributes specifies that fields will be written as XML attributes (as opposed to XML elements) in the destination XML file.
Record Node specifies the name of the node that will contain each record transferred.

Note: To ensure that your dataflow is runnable on a remote server, please avoid using local paths for the source. Using UNC paths is recommended.

Advanced Flat-File Reading Options

When importing from a fixed-width, delimited, or Excel file, you can specify the following advanced reading options:

Header Spans x Rows - If your source file has a header that spans more than 1 row, select the number of rows for the header using this control.

Skip Initial Records - Sets the number of records which you want skipped at the beginning of the file. This option can be set whether or not your source file has a header. If your source file has a header, the first record after the specified number of rows to skip will be used as the header row.

Raw Text Filter - Only records starting with the filter string will be imported. The rest of the records will be filtered.

You can optionally use regular expressions to specify your filter. For example, the regular expression ^[12][4] will only include records starting with 1 or 2, and whose second character is 4.

Note: Astera supports Regular Expressions implemented with the Microsoft .NET Framework and uses the Microsoft version of named captures for regular expressions.

Raw Text Filter setting is not available for Excel source files.

Managing Differences between Source Layout and Source File

If your source is a fixed-length file, delimited file, or Excel spreadsheet, it may contain an optional header row. A header row is the first record in the file that specifies field names and, in the case of a fixed-length file, the positioning of fields in the record.

If your source file has a header row, you can specify how you want the system to handle the differences between your actual source file, and the source layout specified in the setting. Differences may arise due to the fact that the source file has a different field order from what is specified in the source layout, or it may have extra fields compared to the source layout. Conversely, the source file may have fewer fields than what is defined in the source layout, and the field names may also differ, or may have changed since the time the layout was created.

By selecting from the available options, you can have Astera handle those differences exactly as required by your situation. These options are described in more detail below:

Enforce exact header match – Lets Astera Data Stack proceed with the transfer only if the source file’s layout matches the source layout defined in the setting exactly. This includes checking for the same number and order of fields and field names.

Columns order in file may be different from the layout – Lets Astera Data Stack ignore the sequence of fields in the source file, and match them to the source layout using the field names.

Column headers in file may be different from the layout – This mode is used by default whenever the source file does not have a header row. You can also enable it manually if you want to match the first field in the layout with the first field in the source file, the second field in the layout with the second field in the source file, and so on. This option will match the fields using their order as described above even if the field names are not matched successfully. We recommend that you use this mode only if you are sure that the source file has the same field sequence as what is defined in the source layout.

Creating Field Layout

The Field Layout window is available in the properties of most objects on the dataflow to help you specify the fields making up the object. The table below explains the attributes you can set in the Field Layout window.

The table below provides a list of all the attributes available for a particular layout type.

Using Data Formats

Astera supports a variety of formats for each data type. For example, for Dates, you can specify the date as “April 12” or “12-Apr-08”. Data Formats can be configured independently for source and for destination, giving you the flexibility to correctly read source data and change its format as it is transferred to destination.

If you are transferring from a flat file (for example, Delimited or Fixed-Width), you can specify the format of a field so that the system can correctly read the data from that field.

If you do not specify a data format, the system will try to guess the correct format for the field. For example, Astera is able to correctly interpret any of the following as a Date:

April 12

12-Apr-08

04-12-2008

Saturday, 12 April 2008

and so on

Astera comes with a variety of pre-configured formats for each supported data type. These formats are listed in the Sample Formats section below. You can also create and save your own data formats.

To select a data format for a source field, go to Source Fields and expand the Format dropdown menu next to the appropriate field.

Sample Formats

Dates:

Booleans:

Integers:

Real Numbers:

Numeric Format Specifiers:

COBOL File Source

The COBOL File Source object holds the ability to fetch data from a COBOL source file if the user has the workbook file available. The data present in this file can then be processed further in the dataflow and then written to a destination of your choice.

Video

Working with the COBOL Source Object In Astera Data Stack

Expand the Sources section of the Toolbox and select the COBOL Source object.

Drag-and-drop the COBOL Source object onto the dataflow. It will appear like this:

By default, the COBOL Source object is empty.

To configure it according to your requirements, right-click on the object and select Properties from the context menu.

Alternatively, you can open the properties window by double-clicking on the COBOL Source object header.

COBOL Source Properties

The following is the properties tab of the COBOL Source object.

File Path: Clicking on this option allows you to define a path to the data file of a COBOL File.

Note: File Path registers files with extensions of .dat and .txt (Additionally, it can also register files with an .EBC extension)

For our use case, we will be using a sample file with an .EBC extension.

Encoding: This drop-down option allows us to select the encoding from multiple options.

In this case, we will be using the IBM EBCDIC (US-Canada) encoding.

Record Delimiter: This allows you to select the kind of delimiter from the drop-down menu.

(Carriage Return): Moves the cursor to the beginning of the line without advancing to the next line.

(Line Feed): Moves the cursor down to the next line without returning to the beginning of the line.

<CR><LF>: Does both.

For our use case, we have selected the following.

Copybook: This option allows us to define a path to the schema file of a COBOL File.

Note: Copybook registers files with the extensions of .txt and .cpy

For our use case, we are using a file with the .cpy extension.

Next, three checkboxes can be configured according to the user application. There is also a Record Filter Expression field given under these checkboxes.

Ignore Line Numbers at Start of Lines: This option is checked when the data file has incremental values. It is going to ignore line numbers at the start of lines.

Zone Decimal Sign Explicit: Controls whether there is an extra character for the minus sign of a negative integer.

Fields with COMP Usage Store Data in a Nibble: Checking this box will ignore the COMP encryption formats where the data is stored.

COMP formats range from 1-6 in COBOL Files.

Record Filter Expression: Here, we can add a filter expression that we wish to apply to the records in the COBOL File.

On previewing output, the result will be filtered according to the expression.

Once done with this configuration, click Next, and you will be taken to the next part of the properties tab.

COBOL Source Layout

The COBOL Source Layout window lets the user check values which have been read as an input.

Expand the Source node, and you will be able to check each of the values and records that have been selected as an input.

This gives the user data definition and field details on further expanding the nodes.

Once these values have been checked, click Next.

The Config Parameters window will now open. Here, you can further configure and define parameters for the COBOL Source Object.

Parameters can provide easier deployment of flows by eliminating hardcoded values and provide an easier way of changing multiple configurations with a simple value change.

Note: Parameters left blank will use their default values assigned on the properties page.

Click Next.

Now, a new window, General Options, will appear.

Here, you can add any Comments that you wish to add. The rest of the options in this window have been disabled for this object.

Once done, click OK.

The COBOL Source object has now been configured. The extracted data can now be transformed and written to various destinations.

This concludes our discussion on the COBOL Source Object and its configuration in Astera Data Stack.

Database Table Source

The Database Table Source object provides the functionality to retrieve data from a database table. It also provides change data capture functionality to perform incremental reads, and supports multi-way partitioning, which partitions a database table into multiple chunks and reads these chunks in parallel. This feature brings about major performance benefits for database reads.

The object also enables you to specify a WHERE clause and sort order to control the result set.

Overview

In this article, we will be discussing how to:

Get a Database Table Source object on the dataflow designer.
Configure the Database Table Source object according to the required layout and settings.

We will also be discussing some best practices for using a Database Table Source object.

Getting a Database Table Source Object

To get a Database Table Source from the Toolbox, go to Toolbox > Sources > Database Table Source. If you are unable to see the Toolbox, go to View > Toolbox or press Ctrl + Alt + X.

Drag-and-drop the Database Table Source object onto the designer.

You can see that the dragged source object is empty right now. This is because we have not configured the object yet.

Configuring the Database Table Source Object

To configure the Database Table Source object, right-click on its header and select Properties from the context menu.

A dialog box will open.

This is where you can configure the properties for the Database Table Source object.

The first step is to specify the Database Connection for the source object.

Provide the required credentials. You can also use the Recently Used drop-down menu to connect to a recently connected database.
You will find a drop-down list next to the Data Provider.

This is where you select the specific database provider to connect to. The connection credentials will vary according to the provider selected.

Test Connection to make sure that your database connection is successful and click Next.

Next, you will see a Pick Source Table and Reading Options window. On this window, you will select the table from the database that you previously connected to and configure the table from the given options.

From the Pick Table field, choose the table that you want to read the data from.

Note: We will be using the Customers table in this case.

Once you pick a table, an icon will show up beside the Pick Table field.

View Data: You can view data in a separate window in Astera.

View Schema: You can view the schema of your database table from here.

View in Database Browser: You can see the selected table in the Database Source Browser in Astera.

Table Partition Options
This feature substantially improves the performance of large data movement jobs. Partitioning is done by selecting a field and defining value ranges for each partition. At runtime, Astera generates and runs multiple queries against the source table and processes the result set in parallel.

Check the Partition Table for Reading option if you want your table to be read in partitions.
You can specify the Number of Partitions.
The Pick Key for the Partition drop-down will let you choose the key field for partitioning the table.
If you have specific key values based on which you want to partition the table, you can use the Specify Key Values (separated by comma) option.
Next, two checkboxes can be configured according to the user application.

The Favor Centerprise Layout option is useful in cases where your source database table layout has changed over time, but the layout built in Astera is static. And you want to continue to use your dataflows even with the updated source database table layout. You check this option and Astera will favor its own layout over the DB layout.
Check the Trim Trailing Spaces option if you want to remove the trailing whitespaces.
Dynamic Layout

Checking the Dynamic Layout option enables the two following options, Add Fields in Subsequent Objects, and Delete Fields in Subsequent Objects. These options can also be unchecked by users.
Add Fields in Subsequent Objects: Checking this option ensures that a field is definitely added in subsequent objects in a flow in case additional fields are manually added in the source database by the user or need to be added into the source database.
Delete Fields in Subsequent Objects: Checking this option ensures that a field is definitely deleted from subsequent objects in a flow in case of deletion from the source database.
Incremental Read Options
The Database Table Source object provides incremental read functionality based on the concept of audit fields. Incremental read is one of the three change data capture approaches supported by Astera. Audit fields are fields that are updated when a record is created or modified. Examples of audit fields include created date time, modified date time, and version number.
Incremental read works by keeping a track of the highest value for the specified audit field. On the next run, only the records with value higher than the saved value are retrieved. This feature is useful in situations where two applications need to be kept in sync and the source table maintains audit field values for rows.

Select Full Load if you want to read the entire table.
Select Incremental Load Based on Audit Fields to perform an incremental read. Astera will start reading the records from the last read.

Checking the Perform full load on next run option will override the incremental load function from the next run onwards and will perform a full load on it.
Use Audit Field to compare when the last read was performed on the dataset.
Specify the path to the file in the File Path, that will store incremental transfer information.

The next window is the Layout Builder. In this window you can modify the layout of your database table.

Note: By default, Astera reads the source layout.

If you want to delete a field from your dataset, click on the serial column of the row that you want to delete. The selected row will be highlighted in blue.

Right-click on the highlighted line, a context menu will appear in which you will have the option to Delete.

Selecting Delete will delete the entire row.

The field is now deleted from the layout and will not appear in the output.

Note: Modifying the layout (adding or deleting fields) from the Layout Builder in Astera will not make any changes to the actual database table. The layout is only specific to Astera.

If you want to change the position of any field and want to move it below or above another field in the layout, you can do this by selecting the row and using the Move up/Move down keys.

Note: You will find the Move up/Move down icons on the top left of the builder.

For example: We want to move the Country field right below the Region field. We will select the row and use the Move up key to move the field from the 9th row to the 8th.

After you are done customizing the Layout Builder, click Next. You will be taken to a new window, Where Clause. Here, you can provide a WHERE clause, which will filter the records from your database table.

Note: If the wizard is left blank, Astera will use the default values of the database table.

For instance, if you add a WHERE clause that selects all the customers from the country “Mexico” in the Customers table.

Your output will be filtered out and only the records that satisfy the WHERE condition will be read by Astera.

Once you have configured the Database Table Source object, click Next.

A new window, Config Parameters will open. Here, you can define parameters for the Database Table Source object.

Parameters can provide easier deployment of flows by eliminating hardcoded values and provide an easier way of changing multiple configurations with a simple value change.

Note: Parameters can be changed in the Config Parameters wizard page. Parameters left blank will use their default values assigned on the properties page.

Click OK.

You have successfully configured your Database Table Source object. The fields from the source object can now be mapped to other objects in a dataflow.

Best Practices for Using the Database Table Source Object

Get the Database Table Source Object from the Data Source Browser

To get the Database Table Source object from the Data Source Browser, go to View > Data Source > Data Source Browser or press Ctrl + Alt + U.

A new window will open. You can see that the pane is empty right now. This is because we are not connected to any database source yet.

To connect the browser to a database source, go to the Add Data Source icon located at the top left corner of the pane and click on Add Database Connection.

A Database Connection box will open.

This is where you can connect to your database from the browser.

You can either connect to a Recently Used database or create a new connection.

Note: In this case we will use one of our recent connections.

To create a new connection, select your Data Provider from the drop-down list.

Note: We will be using the SQL Server in this case.

The next step is to fill in the required credentials. Also, to ensure that the connection is successfully made, select Test Connection.

Once you test your connection, a dialog box will indicate whether the test was successful or not.

Click OK.

Once you have connected the browser, your Data Source Browser will now have the databases that you have on your server.

Select the database that you want to work with and then choose the table you want to use.

Note: In this case we will be using the Northwind database and Customers table.

Drag-and-drop Customers table onto the designer in Astera.

If you expand the dropped object, you will see that the layout for the source file is already built. You can even preview the output at this stage.

Database Table Options

Right-clicking on the Database Table Source object will also display options for the database table.

Show in Data Source Browser - Will show where the table resides in the database in the Database Browser.
View Table Data - Builds a query and displays all the data from the table.
View Table Schema - Displays the schema of the database table.
Create Table - Creates a table on a database based on the schema.

File System Items Source

The File System Items Source in Astera Data Stack is used to provide metadata information to a task in a dataflow or workflow. In a dataflow, it can be used in conjunction with a source object, especially in cases where you want to process multiple files through the transformation and loading process.

Video

In a workflow, the File System Items Source object can be used to provide input paths to a subsequent object such as a RunDataflow task.

Let’s see how it works in a dataflow.

Using File Systems Items Source in a Dataflow

Scenario

Here we have a dataflow that we want to run on multiple source files that contain Customer_Data from a fictitious organization. We are going to use the source object as a transformation and provide the location of the source files using a File System Items Source object. The File System Items Source will provide the path to the location where our source files reside and the source will object pick the source files from that location, one by one, and pass it on for further processing in the dataflow.

Steps to Use the File System Item Source in a Dataflow

Here, we want to sort the data, filter out records of customers from Germany and write the filtered records into a database table. The source data is stored in delimited (.csv) files.

First, change the source object into a Transformation object. This is because the data is stored in multiple delimited files and we want to process all of them in the dataflow. For this, right-click on the source object’s header and click Transformation in the context menu.

You can see that the color of the source object has changed from green to purple which indicates that the source object has been changed into a transformation object.

Notice that the source object now has two nodes: Input and Output. The Input node has an input mapping port which means that it can take the path to the source file from another object.

Now we will use a File System Items Source object to provide a path to Customer_Data Transformation object. Go to the Sources section in the Toolbox and drag-and-drop the File System Items Source object onto the designer.

If you look at the File System Items Source object, you can see that the layout is pre-populated with fields such as FileName, FileNameWithoutExtension, Extension, FullPAth, Directory, ReadOnly, Size, and other attributes of the files.

To configure the properties of the File System Items Source object, right-click on the File System Items Source object’s header and go to Properties.

This will open the File System Properties window.

The first thing you need to do is point the Path to the directory or folder where your source files reside.

You can see a couple of other options on this screen:

Filter: If your specified source location contains multiple files in different formats, you can use this option to filter and read files in the specified format. For instance, our source folder contains multiple PDF, .txt. doc, .xls, and .csv files, so we will write “*.csv” in the Filter field to filter and read delimited files only.

Include items in subdirectories: Check this option if you want to process files present in the sub-directories

Include Entries for Directories: Check this option if you want to include all items in the specified directory

Once you have specified the Path and other options, click OK.

Now right-click on the File System Items Source object’s header and select Preview Output.

You can see that the File System Items Source object has filtered out delimited files from the specified location and has returned the metadata in the output. You can see the FileName, FileNameWithoutExtension, Extension, FullPath, Directory, and other attributes such as whether the file is ReadOnly, FileSize, LastAccessed, and other details in the output.

Now let’s start mapping. Map the FullPath field from the File System Items Source object to the FullPath field under the Input node in the Customer_Data Transformation object.

Once mapped, when we run the dataflow, the File System Items Source will pass the path to the source files, one by one, to the Customer_Data Transformation object. The Customer_Data Transformation object will read the data from the source file and pass it to the subsequent transformation object to be processed further in the dataflow.

Using File System Items Source in a Workflow

In a workflow, the File System Items Source object can be used to provide input paths to a subsequent task such as a RunDataflow task. Let’s see how this works.

Steps to Use the File System Items Source in a Workflow

We want to design a workflow to orchestrate the process of extracting customer data stored in delimited files, sorting that data, filtering out records of customers from Germany and loading the filtered records in a database table.

We have already designed a dataflow for the process and have called this dataflow in our workflow using the RunDataflow task object.

We have multiple source files that we want to process in this dataflow. So, we will use a File System Items Source object to provide the path to our source files to the RunDataFlow task. For this, go to the Sources section in the Toolbox and drag-and-drop the File System Items Source onto the designer.

If you look at the File System Items Source, you can see that the layout is pre-populated with fields such as FileName, FileNameWithoutExtension, Extension, FullPAth, Directory, ReadOnly, Size, and other attributes of the files. Also, there is this small blue icon with the letter ‘s’, this indicates that the object is set to run in Singleton mode.

By default, all objects in a workflow are set to execute in Singleton mode. However, since we have multiple files to process in the dataflow, we will set the File System Items Source object to run in loop. For this, right-click on the File System Items Source and click Loop in the context menu.

You can see that the color of the object has changed to purple, and it now has this purple icon over the header which denotes the loop function.

It also has these two mapping ports on the header to map the File System Items Source object to the subsequent action in the workflow. Let’s map it to the RunDataflowTask.

To configure the properties of the File System Items Source, right-click on the File System Item Source object’s header and go to Properties.

This will open the File System Items Source Properties window.

The first thing you need to do is point the Path to the directory or folder where your source files reside.

You can see a couple of other options on this window:

Include items in subdirectories: Check this option if you want to process files present in the sub-directories.

Include Entries for Directories: Check this option if you want to include all items in the specified directory.

Once you have specified the Path and other options, click OK.

Now right-click on the File System Items Source object’s header and click Preview Output.

Now let’s start mapping. Map the FullPath field from the File System Items Source object to the FilePath variable in the RunDataflow task.

Once mapped, upon running the dataflow, the File System Items Source object will pass the path to the source files, one by one, to the RunDataflow task. In other words, the File System Items Source acts as a driver to provide source files to the RunDataflow tasks, which will then process them in the dataflow.

When the File System Items Source is set to run in a loop, the dataflow will run for ‘n’ number of times; where ‘n’ = the number of files passed by the File System Items Source to the RunDataflow task. For instance, you can see that we have six source files in the specified folder. The RunDataflow task object will pass these six files one by one to the RunDataflow task to be processed in the dataflow.

This concludes using the File System Items Source object in Astera Data Stack.

Fixed Length File Source

The Fixed-Length File Source object in Astera provides a high-speed reader for files containing fixed length records. It supports files with record delimiters as well as files without record delimiters.

Video

Getting Fixed Length Source Object

In this section, we will cover how to get Fixed Length File Source object on the dataflow designer from the Toolbox.

To get a Fixed Length File Source object from the Toolbox, go to Toolbox > Sources > Fixed Length File Source. If you are unable to see the Toolbox, go to View > Toolbox or press Ctrl + Alt + X.

Drag-and-drop the Fixed Length File Source object onto the designer.

You can see that the dragged source object is empty right now. This is because we have not configured the object yet.

Configuring the Fixed Length File Source Object

To configure the Fixed Length File Source object, right-click on its header and select Properties from the context menu.

When you select the Properties option from the context menu, a dialog box will open.

This is where you configure the properties for Fixed Length File Source object.

The first step is to provide the File Path for the Fixed Length File Source object. By providing the File Path you are building the connectivity to the source dataset.

Note: In this case we are going to be using a fixed length file that contains Orders sample data. This file works with the following options:

File Contains Headers
Record Delimiter is specified as

The dialog box has some other configuration options:

If the source File Contains Header and you want the Astera source layout to read headers from the source file, check this option.
If you want the file to be read in portions, for instance, your file has data over 1000 rows, upon selecting Partition File for Reading, Astera will read your file according to the specified Partition Count. For example, a file with 1000 rows, with the Partition Count specified as 2, will be read in two partitions of 500 rows each. This is a back-end process that makes data reading more efficient and helps in processing data faster. This will not have any effect on your output.
Record Delimiter field allows you to select the delimiter for the records in the source file. The choices available are carriage-return line-feed combination , carriage-return and line-feed**. You can also type the record delimiter of your choice instead of choosing from the available options.
In case the records do not have a delimiter and you rely on knowing the size of a record, the number in the Record Length field is used to specify the character length for a single record.
The Encoding field allows you to choose the encoding scheme for the delimited file from a list of choices. The default value is Unicode (UTF-8)
To define a hierarchical file layout and process the data file as a hierarchical file check the This is a Hierarchical File option. Astera IDE provides extensive user interface capabilities for processing hierarchical structures.
Advanced File Options

In the Header spans over field, give the number of rows that your header takes. Refer to this option when your header spans over multiple rows.
Check the Enforce exact header match option if you want the header to be read as it is.
Check the Column order in file may be different from the layout option, if the field order in your source layout is different from the field order in Astera’s layout.
Check the Column headers in file may be different from the layout option if you want to use alternate header values for your fields. The Layout Builder lets you specify alternate header values for the fields in the layout.
To skip any unwanted rows at the beginning of your file, you can specify the number of records that you want to omit through the Skip initial records option.

Raw text filter

If you do not want to apply any filter and process all records, check the No filter. Process all records option.
If there is a specific value which you want to filter out, you can check the Process if begins with option and specify the value that you want Astera to read from the data, in the provided field.
If there is a specific expression which you want to filter out, you can check the Process if matches this regular expression option and give the expression that you want Astera to read from the data, in the provided field.
String Processing
String processing options come in use when you are reading data from a file system and writing it to a database destination.

Check the Treat empty string as null value option when you have empty cells in the source file and want those to be treated as null objects in the database destination that you are writing to, otherwise Astera will omit those accordingly in the output.
Check the Trim strings option when you want to omit any extra spaces in the field value.

Once you have specified the data reading options on this window, click Next.

The next window is the Length Markers window. You can put marks and specify the columns in your data.

Using the Length Markers window, you can create the layout of your fixed-length file. To insert a field length marker, you can click in the window at any point. For example, if you want to set the length of a field to contain five characters and the field starts at five, then you need to click at the marker position nine.

Note: In this case we are using a fixed length file with Orders sample data.

If you point your cursor to where the data is starting from, (in this case next to OrderID) and double-click on it, Astera will automatically detect columns and put markers in your data. Blue lines will appear as markers on the columns that will get detected.

You can modify the markers manually. To delete a marker, double-click on the column which has been marked.

In this case we removed the second marker and instead added a marker after CustomerID and EmployeeID.

In this way you can add as many markers as the number of columns/fields there are in the data set.

You can also use the Build from Specs feature to help you build destination fields based on an existing file instead of manually specifying the layout.

After you have built the layout by inserting the field markers, click Next.

The next window is the Layout Builder. On this window, you can modify the layout of your fixed length source file.

If you want to add a new field to your layout, go to the last row of your layout (Name column), which will be blank and double-click on it, and a blinking text cursor will appear. Type in the name of the field you want to add and select subsequent properties for it. A new field will be added to the source layout.

Note: Make sure to specify the length of the field that you have added in the properties of the field.

If you want to delete a field from your dataset, click on the serial column of the row that you want to delete. The selected row will be highlighted in blue.

Right-click on the highlighted line, a context menu will open where you will have the option to Delete.

Selecting Delete will delete the entire row.

The field is now deleted from the layout and will not appear in the output.

Note: Modifying the layout (adding or deleting fields) from the Layout Builder in Astera will not make any changes to the actual source file. The layout is specific to Astera only.

Other options that the Layout Builder provides are:

After you are done customizing the layout in the Object Builder window, click Next. You will be taken to a new window, Config Parameters. Here, you can define parameters for the Fixed Length File Source.

Parameters can provide easier deployment of flows by eliminating hardcoded values and provide an easier way of changing multiple configurations with a simple value change.

Note: Parameters left blank will use their default values assigned on the properties page.

Once you have been through all configuration options, click OK.

The FixedLengthFileSource object is now configured.

The Fixed Length File Source object has now been modified from its previous configuration. The new object has all the modifications that we specified in the Layout Builder.

In this case, the modifications that we made were:

Separated the EmployeeID column from the OrderDate column.
Added the CustomerName column.

You have successfully configured your Fixed Length File Source object. The fields from the source object can now be mapped to other objects in a dataflow.

Data Logging and Profiling