1 of 7

End-to-End Use Cases

Data Integration

Using Astera Data Stack to Create and Orchestrate an ETL Process for Partner Onboarding

The secret to enterprise success lies in the expansion of business networks in a way that maximizes operational efficiency. These networks include suppliers, resellers, vendors, and countless other business partners that provide an open door to vast markets that are inaccessible through in-house logistics alone. However, to secure these advantages, the external data coming in from these partners should be integrated through an efficient, reliable, and scalable process. With Astera Astera Data Stack, you can achieve exactly that.

In this article, we will be looking at a detailed use case that showcases the effectiveness of Astera Data Stack in Partner Onboarding.

Use Case

Wheel Dealers, an online automotive marketplace, will be the subject of our discussion. The WD portal provides customers with a comprehensive list of available cars from multiple different dealerships, with added options to compare features, prices, and other services. The company currently serves over one million customers and acquires data from over 200 different dealerships across New England and the Mid-Atlantic, with tentative plans to add a thousand new vendors from the Midwest and Northeast.

At the moment, Wheel Dealers runs a manual ETL process that requires ample time and resources. Moreover, with major expansion plans in the pipeline, this approach will not suffice in the near future. Using Astera Data Stack, we will be designing an automated data integration solution that extracts data from disparate sources with minimal manual intervention, thus providing Wheel Dealers with a seamless Partner Onboarding experience.

This implementation of an ETL process will employ the use of numerous items/features available in Astera Data Stack, such as the following:

Dataflow
- Excel Workbook Source
- Variables
- Data Cleanse Transformation
- Data Quality Rules
- Database Table Destination
- Record Level Log
Workflows
- File Transfer Action
- List FTP Directory Contents
- Expression
- Run Workflow
- Context Information
- Run Dataflow
Job Scheduler

A Dataflow will be used to create the ETL process, whereas two Workflows and the Job Scheduler will be used to orchestrate it.

Let’s take a detailed look at the methodologies being used.

Creating a Project

All of the items being used will be consolidated in one Project. To open a new Integration Project in Astera Data Stack, go to Project > New > Integration Project.

You can add new items to the Project by right-clicking the Project tab and selecting Add New Item.

Creating the ETL Process

As mentioned earlier, we will be creating this ETL Process in one Dataflow. The Astera Data Stack Toolbox is packed with various source, destination, and transformation objects that can serve numerous different purposes. For Wheel Dealers, we want to design a system that captures data, presents it in a refined manner, and uploads it to the dealer database.

Capturing Data

For this demonstration, we will be using an Excel Workbook Source object to capture the partner data. We are assuming that all of the incoming data is in an Excel format. However, any other source object could have been used depending on the nature of the incoming files.

While the actual source files will be coming in from an FTP server, we will configure the source object through a template file. This will populate the object with all of the required fields.

2. To ensure that the source object picks up the files that are received from the FTP server, we will define a different file path at runtime by using the Config Parameters screen in the configuration window.

First, create a file path variable using the Variables object from the Toolbox.

Enter this variable as a parameterized value in the FilePath section of the Config Parameters screen.

For more information on how you can configure an Excel Workbook Source object, click here.

Cleansing Data

To refine and cleanse Wheel Dealers’ data, we will be using two separate transformations.

Sort Transformation will be used to sort the data in ascending order.

A simple sort will make your data easier to query and analyze, which will consequently reduce burden on the integration server.

Data Cleanse Transformation will be used to further refine the data by removing unwanted characters and spaces.

The data received by Wheel Dealers is usually well-formatted. Therefore, we will only be using this transformation to remove leading and trailing whitespaces.

Resolving Errors & Uploading to Database

Wheel Dealers needs a method to fix the errors present in the partner data and integration flows. For this purpose, we will be using the Data Quality Rules and Record Level Log objects. Moreover, the final data will be uploaded to the dealer database by using a Database Table Destination object.

Data Quality Rules can be used to define some conditions, known as data quality rules, for incoming records. The records that do not meet these conditions are flagged as errors or warnings.

In this case, we will be applying two data quality rules.

Rule 1: The mileage of a used car cannot be null.

Rule 2: The transmission cannot be null.

Notice that some of the element names in the Data Quality Rules object are inconsistent with those in the database. With an ample amount of partner data coming in from disparate sources, discrepancies are bound to occur. Despite these discrepancies, Astera Data Stack has still managed to match some of the elements with their counterparts due to naming similarities, for example, ID and Dealer_ID. However, there are still a few elements that have not been matched with any others, such as MethodOfPayment, SP, DateOfOrder, and DateOfDelivery. For these, we will be using the Synonym Dictionary.

Add a Synonym Dictionary file to the Project and create alternatives for element names, as needed.

Press Shift and map the parent node of the Data Quality Rules object to that of the Database Destination object.

Finally, a Record Level Log will be used to flag the errors that do not meet the data quality rules that were defined earlier.

Once you preview the output, you can see that the relevant erroneous records have been indicated.

Wheel Dealer’s business analysts can investigate this further and follow up with the concerned dealership.

Final Dataflow

The ETL process has now been created. This is what the final Dataflow should look like:

Orchestrating the Process for Automation

The orchestration of this ETL process requires two main Workflows, an inner Workflow, and the Job Scheduler. Wheel Dealers wants to implement self-operating and self-regulating processes wherever possible, so the purpose here is to automate the entire system. Workflows will be used to coordinate how the integration flows are executed; Workflow 1 will collect newly uploaded documents from Wheel Dealers’ FTP server and download them to their local network, whereas Workflow 2 will dynamically pick up the input files and feed them to the Dataflow we have designed. The Job Scheduler will be used to schedule Workflow runs.

Workflow 1

Workflow 1 will consist of a main Workflow and another inner Workflow. The former’s job will be to fetch new files from the FTP server, whereas the latter will download this new data to the local directory while deleting older files to ensure that only fresh partner files are downloaded during each run.

Inner Workflow

First, we will create a file path variable using the Variables object from the Toolbox. This will be used to specify the path of the files coming in from the FTP server.

The File Transfer object in the Workflow Tasks section of the Toolbox will be used to download and delete the files. Two objects, one for each purpose, will be added to the Workflow designer.

File Transfer Task 1: Downloading files.

File Transfer Task 2: Deleting files.

Connect the File Transfer objects to run sequentially, hence the older files will be deleted once the new files have been downloaded.

Map the file path variable onto the RemotePath input element of both File Transfer Task objects. As a result, this is what the final inner Workflow should look like:

Main Workflow

First, we will use the List FTP Directory Contents object, present in the Sources section of the Toolbox, to extract data from the FTP server. The object will be used to specify the FTP Connection and the remote path where our required files are present.

Notice the S on the top-left of the object header. This denotes ‘Singleton’, which means that the object will only work in a single instance and will thus only retrieve one file from the FTP server. However, our purpose is to retrieve all of the new files present. For this to happen, right-click on the object header and select Loop from the context menu.

A Loop icon will now appear on the top-left of the object header.

2. The Expression object will be used to concatenate the file names coming in from the FTP Server with the path of the directory, thus creating a complete path for each file after every loop.

First, map the FileName element from the List FTP Directory Contents object to the FileName element in the Expression object.

A new element by the name of FilePath will be created in the Expression object. Using the Expression Builder in the Layout Builder screen, concatenate the directory path with the file name. This will dynamically update the file path as new files are fetched from the FTP server.

The next step is to load the fetched files into the inner Workflow so that they may be downloaded or deleted. For this, we will be using the Run Workflow object from the Workflow Tasks section of the Toolbox.

Configure the Run Workflow object by providing the file Path for the inner Workflow.

The FP element in the Run Workflow object denotes the file path variable that we had defined in the inner Workflow. Map the FilePath element from the Expression object onto FP in the Run Workflow object.

In the inner Workflow, this variable has been mapped onto two File Transfer objects. As a result, their respective actions will take place on the files that are being fetched from the FTP server in a loop.

This is what the final main Workflow should look like:

Workflow 2

Workflow 2 will feed the downloaded partner files to the integration flow that we had designed earlier for cleansing, validating, and consolidating data.

1. The Context Information object in the Resources section of the Toolbox will be used to dynamically provide our Dataflow with the dropped file path i.e. the complete path of files that are downloaded from the FTP server and dropped into a particular folder.

This object, paired with the Job Scheduler, will automate the process of running this Workflow whenever a new partner file is downloaded.

The Run Dataflow object in the Workflow Tasks section of the Toolbox will be used to run our Dataflow whenever a new partner file is downloaded.

Configure the Run Dataflow object by providing the file Path to our Dataflow.

The FilePath input element in the Run Dataflow object denotes the file path variable that we had defined in the Dataflow. Map the DroppedFilePath parameter from the Context Information object onto FilePath in the Run Dataflow object. As a result, this is what Workflow 2 should look like:

The FilePath variable has been used to dynamically provide the source object in our Dataflow with a file path. Consequently, whenever a new file is downloaded from the FTP server, a new file path will be fed to the source object through this mapping.

Job Scheduler

As mentioned earlier, Wheel Dealers wants to automate its data integration process wherever possible. The Job Scheduler in Astera Data Stack can be used to schedule any job to run at specific intervals or whenever a file is dropped into a folder, without any manual intervention. In this case, we will be making two schedules – one for each Workflow.

Scheduling Workflow 1

For Workflow 1, we will assume that partner files are coming in on a biweekly basis, specifically on Mondays and Wednesdays. Thus, Workflow 1 will be scheduled to run every week on Mondays and Wednesdays.

Scheduling Workflow 2

The files being fetched from the FTP server are added to a specific folder. Therefore, we will schedule Workflow 2 to run whenever a new file is dropped into that folder.

Everything is now in place, and Wheel Dealers can rely on this automated ETL Process to handle their Partner Onboarding needs. With Astera Data Stack, most of the tasks associated with Partner Onboarding can be automated, significantly cutting down on the time and resources required.

This concludes creating and orchestrating an ETL process for Partner Onboarding in Astera Data Stack.

Data Warehousing

Building a Data Warehouse – A Step by Step Approach

Establishing a data warehousing system infrastructure that enables you to meet all your business intelligence targets is by no means an easy task. With Astera Data Stack, you can cut down the numerous standard and repetitive tasks involved in the data warehousing lifecycle to just a few simple steps.

In this article, we will examine a use case that describes the process of building a data warehouse with a step-by-step approach using Astera Data Stack.

Use Case

Shop-Stop is a fictitious online retail store that currently maintains its sales data in a SQL database. The company has recently decided to implement a data warehouse across its enterprise to improve business intelligence and gain a more solid reporting architecture. However, their IT team and technical experts have warned them about the substantial amount of capital and resources needed to execute and maintain the entire process.

As an alternative to the traditional data warehousing approach, Shop-Stop has decided to use Astera Data Stack to design, develop, deploy, and maintain their data warehouse. Let’s look at the process we’d follow to build a data warehouse for them.

Step 1: Create a Source Data Model

The first step in building a data warehouse with Astera Data Stack is to identify and model the source data. But before we can do that, we need to create a data warehousing project that will contain all the work items needed as part of the process. To learn how you can create a data warehousing project and add new items to it, click here.

Once we’ve added a new data model to the project, we’ll reverse engineer Shop-Stop’s sales database using the Reverse Engineer icon on the data model toolbar.

To learn more about reverse engineering from an existing database, click here.

Here’s what Shop-Stop’s source data model looks like once we’ve reverse engineered it:

Note: Each entity in this model represents a table that contains Shop-Stop’s source data.

Next, we’ll verify the data model to perform a check for errors and warnings. You can verify a model through the Verify for Read and Write Deployment option in the main toolbar.

For more information on verifying a data model, click here.

After the model has been verified successfully, all that’s left to do is deploy it to the server and make it available for use in ETL/ELT pipelines or for data analytics. In Astera Data Stack, you can do this through the Deploy Data Model option in the data model toolbar.

For more information on deploying a data model, click here.

We’ve successfully created, verified, and deployed a source data model for Shop-Stop.

Step 2: Build and Deploy a Dimensional Model

The next step in the process is to design a dimensional model that will serve as a destination schema for Stop-Stop’s data warehouse. You can use the Entity object available in the data model toolbox, and the data modeler’s drag-and-drop interface to design a model from scratch.

However, in Shop-Stop’s case, they’ve already designed a database schema in a SQL database, which we had reverse engineered.

Note: Each entity in this model represents a table in Shop-Stop’s final data warehouse.

Next, we’ll convert this model into a dimensional model by assigning facts and dimensions. The type for each entity, when a database is reverse engineered, is set as General by default. You can conveniently change the type to Fact or Dimension by right-clicking on the entity, hovering over Entity Type in the context menu, and selecting an appropriate type from the given options.

In this model, the Order and OrderDetails entities are the fact entities and the rest of them are dimension entities. To learn more about converting a data model into a dimensional model from scratch, click here.

Alternatively, you can use the Build Dimensional Model option to automate the process of dimensional modelling. For more information on using the Build Dimensional Model option, click here.

Here, we have used the Build Dimensional Model option with the following configurations:

Here is a look at the dimensional model created:

Once the dimensions and facts are in place, we’ll configure each entity for enhanced data storage and retrieval by assigning specified roles to the fields present in the layout of each entity.

For dimension entities, the Dimension Role column in the Layout Builder provides a comprehensive list of options. These include the following:

Surrogate Key and Business Key.
Slowly Changing Dimension types (SCD1, SCD2, SCD3, and SCD6).
Record identifiers (Effective and Expiration dates, Current Record Designator, and Version Number) to keep track of historical data.
Placeholder Dimension to keep track of early arriving facts and late arriving dimensions.

As an example, here is the layout of the Employee entity in the dimensional model after we’ve assigned dimension roles to its fields.

To learn more about dimension entities, click here.

The fact entity’s Layout Builder contains a Fact Role column, through which you can assign the Transaction Date Key role to one of the fields.

Here is a look at the layout of the OrderDetails entity once we’ve assigned the Transaction Date Key role to a field:

To learn more about fact entities, click here.

Now that the dimensional model is ready, we’ll first verify it for forward engineering, then forward engineer it to the destination where Shop-Stop wants to maintain its data warehouse, and finally deploy it for further usage.

Step 3: Populate the Data Warehouse

In this step, we’ll populate Shop-Stop’s data warehouse by designing ETL pipelines to load relevant source data into each table. In Astera Data Stack, you can create ETL pipelines in the dataflow designer.

Once you’ve added a new dataflow to the data warehousing project, you can use the extensive set of objects available in the dataflow toolbox to design an ETL process. The Fact Loader and Dimension Loader objects can be used to load data into fact and dimension tables, respectively.

Here is the dataflow that we’ve designed to load data into the Customer dimension table in the data warehouse:

On the left, we’ve used a Database Table Source object to fetch data from a table present in the source data model. On the right, we’ve used the Dimension Loader object to load data into a table present in the destination dimensional model.

You’ll recall that both models mentioned above were deployed to the server and made available for usage. While configuring the objects in this dataflow, we connected each of them to the relevant model via the Astera Data Model connection in the list of data providers.

The Database Table Source object was configured with the source data model’s deployment.

On the other hand, the Dimension Loader object was configured with the destination dimensional model’s deployment.

Note: ShopStop_Source and ShopStop_Destination represent the source data model and the dimensional model respectively.

We’ve designed separate dataflows to populate each table present in Shop-Stop’s data warehouse.

The dataflow that we designed to load data into the fact table is a bit different than the rest of the dataflows because the fact table contains fields from multiple source tables. The Database Table Source object that we saw in the Customer_Dimension dataflow can only extract data from one table at a time. An alternative to this is the Data Model Query Source object, which allows you to extract multiple tables from the source model by selecting a root entity.

To learn more about the Data Model Query Source object, click here.

Now that all the dataflows are ready, we’ll execute each of them to populate Shop-Stop’s data warehouse with their sales data. You can execute or start a dataflow through the Start Dataflow icon in the main toolbar.

To avoid executing all the dataflows individually, we’ve designed a workflow to orchestrate the entire process.

To learn about workflows, click here.

Finally, we’ll automate the process of refreshing this data through the built-in Job Scheduler. To access the job scheduler, go to Server > Jobs > Job Schedules in the main menu.

In the Scheduler tab, you can create a new schedule to automate the execution process at a given frequency.

For a detailed guideline on how to use the job scheduler, click here.

In this case, we’ve scheduled the sales data to be refreshed daily.

Step 4: Visualize and Analyze

Shop-Stop’s data warehouse can now be integrated with industry-leading visualization and analytics tools such as Power BI, Tableau, Domo, etc. through a built-in OData service. The company can use these tools to effectively analyze their sales data and gain valuable business insights from it.

This concludes our discussion on building a data warehouse with a step-by-step approach using Astera Data Stack.

Data Extraction

Reusing The Extraction Template for Similar Layout Files

Astera gives users the functionality of reusing a report model i.e., the extraction template for files of a similar layout. In this article, we will learn how to orchestrate the whole process.

A report model template contains the extraction logic to mine data from unstructured documents. The extraction process can be customized by the users via the properties and options available in Astera. To learn more about a report model, refer to this article.

Sample Use-Case

In Astera, there are various reusability methods of the report model. The reusability methods enable the users to obtain meaningful data from several unstructured documents of a similar layout using the same report model i.e., the extraction template. In this article, we will look at the techniques we can apply to achieve this goal.

By using a workflow with File System Item Source object in Loop mode: As part of this technique, we will cover the creation of a workflow using a File System Item Source object in Loop mode as the source.
By using the Context Information and the Job Scheduler: As part of this technique, we will cover the creation of a workflow using the Context Info object as the source. Moreover, we will also apply scheduling on the workflow.

In the following sections, let’s go over the whole process of how the above-mentioned techniques can be implemented.

Reusing the Report Model for Similar Layout Files

It is standard practice to create a report model containing all the extraction logic and settings so that it can be applied onto multiple files of a similar layout.

We are using an example extraction template. To learn more about how to extract data from an unstructured document, click here .

The example template we are using contains extraction logic to obtain details related to Accounts, the number of Orders placed by each account, and description of each Order. This is what the extraction template looks like:

To ensure better accessibility and manageability of the files, let’s create a project and add all the relevant documents to it.

Creating a Project

To create a project, go to Project > New > Integration Project.

The file explorer will open. Navigate to the path where you want to save the project and write the name of the project.

Right-click on the project in the Project Explorer panel and select the Add New Folder option.

Here, we have created a folder named Files to store the flows, a folder named Source to store the unstructured source documents, and a folder named Output to store the output files.

Right-click on the Files folder and select Add New Item option from the context menu. Add a dataflow and a workflow to the Files folder.

Similarly, use Add Existing Items to add unstructured source files in the Source folder and ‘SampleOrders.Rmd’ in the Files folder. ‘SampleOrders.Rmd’ is the extraction template.

We have successfully created a project. Now, let’s head towards designing the dataflow.

Creating a Dataflow

Double-click on the dataflow to open the empty designer.

Drag-and-drop the Report Source object onto the dataflow designer from Toolbox > Sources > Report Source. Right-click on the object’s header and select Properties from the context menu.

The Report Model Properties window will open. Here, you have to provide file paths for the Report Location and Report model Location.

Report Location: File path of the unstructured document.
Report Model Location: File path of the extraction template.

Click OK to proceed.

To store the extracted data, drag and drop a destination object onto the dataflow designer. Here, we are using a Delimited Destination object. Right-click on the destination object’s header and select Properties from the context menu.

This is the Configuration window.

Here, specify the destination File Path and utilize the relevant configuration options available according to your needs.

Now, click OK.

Map the data fields from the Report Source object to the Delimited Destination object.

Now, for the dataflow to use multiple source files, you have to parametrize the source file path. Similarly, to provide destination path for each source file, parametrize the destination file path. Parameterization will allow the source files and their respective destination files to be replaced at runtime.

In the following section, let’s see how we can achieve this.

Parameterizing the Dataflow

Drag-and-drop the Variables object from Toolbox > Resources > Variables onto the dataflow designer.

Right-click on the Variables object’s header and select Properties from the context menu.

The Properties window will open.

You have to create two fields. One field is for the source file path, while other is for the destination file path. Set the Variable Type of both as Input and provide their file paths in the Default Value column.

Note: Default Value is optionally added to verify if the dataflow is accurately configured. At runtime, parameters passed as blank are replaced by the Default Value entry. In other words, the Default Value entry will only take effect when no other value is available.

Now, click OK.

Double-click on the Report Source object’s header. Click Next. This is the Config Parameters window. Set the FilePath variable as the Value of FilePath field.

Double-click on the destination object’s header and click Next till you reach the Config Parameters window. Set the FilePathDest variable as the Value of the DestinationFile field. Click OK.

We have now created a final dataflow. Let’s proceed towards designing the workflow.

Designing a Workflow

Drag-and-drop the Run Dataflow object from Toolbox > Workflow Tasks > Run Dataflow onto the workflow designer.

Double-click on the Run Dataflow object’s header to open the Start Dataflow Job Properties window.

Specify the path to the dataflow that you want to execute in the Job-Info. To learn more about Run Dataflow object, click here.

Now, Click OK.

Drag-and-drop the File System Items Source object from Toolbox > Sources > File System onto the workflow designer.

Double-click on the object’s header. The File System Items Source Properties window will open.

Here, specify the path to the directory where the source files reside. You can add an entry to the Filter textbox if you want to read files in a specified format only. Additionally, you may choose to include all items of the subdirectories and/or include all items inclusive of entries for subdirectories by checking the options at the bottom. To learn more about the File System Items Source object, click here.

Click OK to proceed.

Note: Here, the unstructured files in the Output folder were of “.txt” extension. Hence, only “.txt” files are being processed.

Map the FilePath data field of the File System Items Source object to the FilePath data field of the Run Dataflow object.

Now, let’s proceed towards the construction of the dynamic destination path.

Drag-and-drop the Constant Value transformation object from Toolbox > Transformation > Constant Value onto the workflow designer.

Double-click on the Constant Value object’s header. The Constant Value Map Properties window will open. Provide the path to the directory or folder where you want to store the output files. Click OK.

Go to Toolbox > Transformation > Expression, drag-and-drop the Expression transformation object onto the workflow designer.

Right-click on the Expression object’s header and select Properties from the context menu.

This will open the Layout Builder window. Here, you must create three data fields.

FileDirectory, set as Input.
FileName, set as Input.
FilePathDest, set as Output.

Click OK.

Note: Write the expression of FilePathDest field as FileDirectory + FileName + “.csv”. At runtime, this expression will create the destination file path for each source file.

To learn more about how to utilize the Expression transformation object, click here.

To construct the dynamic destination path, it is necessary to make appropriate mappings for the fields of the objects. The directory that has been specified in the Constant Value object needs to be combined with the file name provided by the File System Items Source object. After the application of the defined expression on the fields, the resultant file path should be mapped to the Run Dataflow object. To achieve this, define the field mappings of the objects as follows:

Map the Value field of the Constant Value object to the FileDirectory field of the Expression transformation object.
Map the FileNameWithoutExtension field of the File System Items Source object to the FileName field of the Expression transformation object.
Map the FilePathDest field of the Expression transformation object to the FilePathDest field of the Run Dataflow object.

After mapping of the objects, this is what the final workflow looks like:

We have now created the final workflow. In the next section, let’s discuss the reusability techniques of the extraction template.

Reusing Methods

Let’s summarize what we have done so far. We have created a dataflow by using a Report Source object, Variables object, and a Delimited Destination object. The purpose of this dataflow is to apply the extraction logic on the unstructured document, parameterize the file paths using variables, and write the extracted data to a destination file.

We have also created a workflow and included objects such as File System Items Source, Constant Value, Expression transformation, and Run Dataflow. The purpose of this workflow is to execute the previously made dataflow and construct a dynamic destination path to store each source file at runtime.

Now, we will see how we can use the workflow and apply the extraction process on multiple unstructured files of similar format.

Applying Looping on File System Items Source Object

Right-click on the File System Items Source object’s header and select Loop from the context menu.

Note: By selecting the Loop option, we are ensuring that the File System Items Source object iterates through the entire folder. This will enable us to provide multiple source files to the Run Dataflow object. By default, the selected Singleton option only picks the first file in the folder.

Link the File System Items Source object to the Run Dataflow object.

Note: You can use Job Schedules on this workflow by following the steps defined under the Scheduling heading.

Using Context Info and Applying Scheduling

Instead of a File System Items Source object, you can use the Context Info object from Toolbox > Resources > Context Info to process a file whenever it is dropped at the path mentioned in the scheduled task.

Map the DroppedFilePath data field from Context Info object to the FilePath data field in the Run Dataflow object.

Scheduling

Go to Server > Job Schedules.

To add a new task for a schedule, click on the Add Scheduler Task.

Specify the Name of the task. After that, select the Schedule Type. In our case, we want to schedule a workflow. Hence, we will select the File type for scheduling. Provide the path of the workflow in the File Location. There are some other options for Server, Dataflow, Job, and Frequency, which you can select according to your requirements.

Click on the drop-down menu of Frequency and select When File is Dropped. To learn more about each Frequency type, click here.

Here, provide the path of directory you want the scheduler to watch in case of a file drop. You can use the File Filter option to process a specific type of file format. Other options, including Watch Subdirectories, Process Existing Files on Startup, Rename File, and Use Polling, are available to be used according to your requirements.

Save the task by clicking on the Save Selected Task icon in the top left corner.

To gain further insights on how to schedule a job on the server using Job Schedules, click here.

This is how you can automate the whole process of extracting data from multiple files with similar layout using the same report model in Astera.

Building a Data Warehouse – A Step by Step Approach

In this article, we will examine a use case that describes the process of building a data warehouse with a step-by-step approach using Astera Data Stack.

Use Case

Step 1: Create a Source Data Model

Once we’ve added a new data model to the project, we’ll reverse engineer Shop-Stop’s sales database using the Reverse Engineer icon on the data model toolbar.

To learn more about reverse engineering from an existing database, click here.

Here’s what Shop-Stop’s source data model looks like once we’ve reverse engineered it:

Note: Each entity in this model represents a table that contains Shop-Stop’s source data.

Next, we’ll verify the data model to perform a check for errors and warnings. You can verify a model through the Verify for Read and Write Deployment option in the main toolbar.

For more information on verifying a data model, click here.

For more information on deploying a data model, click here.

We’ve successfully created, verified, and deployed a source data model for Shop-Stop.

Step 2: Build and Deploy a Dimensional Model

However, in Shop-Stop’s case, they’ve already designed a database schema in a SQL database, which we had reverse engineered.

Note: Each entity in this model represents a table in Shop-Stop’s final data warehouse.

Alternatively, you can use the Build Dimensional Model option to automate the process of dimensional modelling. For more information on using the Build Dimensional Model option, click here.

Here, we have used the Build Dimensional Model option with the following configurations:

Here is a look at the dimensional model created:

Once the dimensions and facts are in place, we’ll configure each entity for enhanced data storage and retrieval by assigning specified roles to the fields present in the layout of each entity.

For dimension entities, the Dimension Role column in the Layout Builder provides a comprehensive list of options. These include the following:

Surrogate Key and Business Key.
Slowly Changing Dimension types (SCD1, SCD2, SCD3, and SCD6).
Record identifiers (Effective and Expiration dates, Current Record Designator, and Version Number) to keep track of historical data.
Placeholder Dimension to keep track of early arriving facts and late arriving dimensions.

As an example, here is the layout of the Employee entity in the dimensional model after we’ve assigned dimension roles to its fields.

To learn more about dimension entities, click here.

The fact entity’s Layout Builder contains a Fact Role column, through which you can assign the Transaction Date Key role to one of the fields.

Here is a look at the layout of the OrderDetails entity once we’ve assigned the Transaction Date Key role to a field:

To learn more about fact entities, click here.

Step 3: Populate the Data Warehouse

Here is the dataflow that we’ve designed to load data into the Customer dimension table in the data warehouse:

The Database Table Source object was configured with the source data model’s deployment.

On the other hand, the Dimension Loader object was configured with the destination dimensional model’s deployment.

Note: ShopStop_Source and ShopStop_Destination represent the source data model and the dimensional model respectively.

We’ve designed separate dataflows to populate each table present in Shop-Stop’s data warehouse.

To learn more about the Data Model Query Source object, click here.

To avoid executing all the dataflows individually, we’ve designed a workflow to orchestrate the entire process.

To learn about workflows, click here.

Finally, we’ll automate the process of refreshing this data through the built-in Job Scheduler. To access the job scheduler, go to Server > Jobs > Job Schedules in the main menu.

In the Scheduler tab, you can create a new schedule to automate the execution process at a given frequency.

For a detailed guideline on how to use the job scheduler, click here.

In this case, we’ve scheduled the sales data to be refreshed daily.

Step 4: Visualize and Analyze

This concludes our discussion on building a data warehouse with a step-by-step approach using Astera Data Stack.

Using Astera Data Stack to Create and Orchestrate an ETL Process for Partner Onboarding

In this article, we will be looking at a detailed use case that showcases the effectiveness of Astera Data Stack in Partner Onboarding.

Use Case

This implementation of an ETL process will employ the use of numerous items/features available in Astera Data Stack, such as the following:

Dataflow
- Excel Workbook Source
- Variables
- Data Cleanse Transformation
- Data Quality Rules
- Database Table Destination
- Record Level Log
Workflows
- File Transfer Action
- List FTP Directory Contents
- Expression
- Run Workflow
- Context Information
- Run Dataflow
Job Scheduler

A Dataflow will be used to create the ETL process, whereas two Workflows and the Job Scheduler will be used to orchestrate it.

Let’s take a detailed look at the methodologies being used.

Creating a Project

All of the items being used will be consolidated in one Project. To open a new Integration Project in Astera Data Stack, go to Project > New > Integration Project.

You can add new items to the Project by right-clicking the Project tab and selecting Add New Item.

Creating the ETL Process

Capturing Data

While the actual source files will be coming in from an FTP server, we will configure the source object through a template file. This will populate the object with all of the required fields.

First, create a file path variable using the Variables object from the Toolbox.

Enter this variable as a parameterized value in the FilePath section of the Config Parameters screen.

For more information on how you can configure an Excel Workbook Source object, click here.

Cleansing Data

To refine and cleanse Wheel Dealers’ data, we will be using two separate transformations.

Sort Transformation will be used to sort the data in ascending order.

A simple sort will make your data easier to query and analyze, which will consequently reduce burden on the integration server.

For more information on how you can configure and use the Sort Transformation object, click .

Data Cleanse Transformation will be used to further refine the data by removing unwanted characters and spaces.

The data received by Wheel Dealers is usually well-formatted. Therefore, we will only be using this transformation to remove leading and trailing whitespaces.

For more information on how you can use the Data Cleanse Transformation object, click .

Resolving Errors & Uploading to Database

Data Quality Rules can be used to define some conditions, known as data quality rules, for incoming records. The records that do not meet these conditions are flagged as errors or warnings.

In this case, we will be applying two data quality rules.

Rule 1: The mileage of a used car cannot be null.

Rule 2: The transmission cannot be null.

For more information on how to use the Data Quality Rules object, click .

Next, we will use a Database Table Destination object to access the dealer database and upload the dataset to it. For information on how to configure and use a Database Table Destination object, click .

Add a Synonym Dictionary file to the Project and create alternatives for element names, as needed.

Press Shift and map the parent node of the Data Quality Rules object to that of the Database Destination object.

For more information on the Synonym Dictionary feature, click .

Finally, a Record Level Log will be used to flag the errors that do not meet the data quality rules that were defined earlier.

Once you preview the output, you can see that the relevant erroneous records have been indicated.

Wheel Dealer’s business analysts can investigate this further and follow up with the concerned dealership.

For more information on how you can use the Record Level Log object, click .

Final Dataflow

The ETL process has now been created. This is what the final Dataflow should look like:

Orchestrating the Process for Automation

Workflow 1

Inner Workflow

First, we will create a file path variable using the Variables object from the Toolbox. This will be used to specify the path of the files coming in from the FTP server.

The File Transfer object in the Workflow Tasks section of the Toolbox will be used to download and delete the files. Two objects, one for each purpose, will be added to the Workflow designer.

File Transfer Task 1: Downloading files.

File Transfer Task 2: Deleting files.

Connect the File Transfer objects to run sequentially, hence the older files will be deleted once the new files have been downloaded.

Map the file path variable onto the RemotePath input element of both File Transfer Task objects. As a result, this is what the final inner Workflow should look like:

Main Workflow

First, we will use the List FTP Directory Contents object, present in the Sources section of the Toolbox, to extract data from the FTP server. The object will be used to specify the FTP Connection and the remote path where our required files are present.

A Loop icon will now appear on the top-left of the object header.

2. The Expression object will be used to concatenate the file names coming in from the FTP Server with the path of the directory, thus creating a complete path for each file after every loop.

First, map the FileName element from the List FTP Directory Contents object to the FileName element in the Expression object.

A new element by the name of FilePath will be created in the Expression object. Using the Expression Builder in the Layout Builder screen, concatenate the directory path with the file name. This will dynamically update the file path as new files are fetched from the FTP server.

The next step is to load the fetched files into the inner Workflow so that they may be downloaded or deleted. For this, we will be using the Run Workflow object from the Workflow Tasks section of the Toolbox.

Configure the Run Workflow object by providing the file Path for the inner Workflow.

The FP element in the Run Workflow object denotes the file path variable that we had defined in the inner Workflow. Map the FilePath element from the Expression object onto FP in the Run Workflow object.

This is what the final main Workflow should look like:

Workflow 2

Workflow 2 will feed the downloaded partner files to the integration flow that we had designed earlier for cleansing, validating, and consolidating data.

This object, paired with the Job Scheduler, will automate the process of running this Workflow whenever a new partner file is downloaded.

The Run Dataflow object in the Workflow Tasks section of the Toolbox will be used to run our Dataflow whenever a new partner file is downloaded.

Configure the Run Dataflow object by providing the file Path to our Dataflow.

The FilePath input element in the Run Dataflow object denotes the file path variable that we had defined in the Dataflow. Map the DroppedFilePath parameter from the Context Information object onto FilePath in the Run Dataflow object. As a result, this is what Workflow 2 should look like:

Job Scheduler

For detailed guidelines on how to open and use the Job Scheduler, click .

Scheduling Workflow 1

Scheduling Workflow 2

The files being fetched from the FTP server are added to a specific folder. Therefore, we will schedule Workflow 2 to run whenever a new file is dropped into that folder.

This concludes creating and orchestrating an ETL process for Partner Onboarding in Astera Data Stack.

Reusing The Extraction Template for Similar Layout Files

Astera gives users the functionality of reusing a report model i.e., the extraction template for files of a similar layout. In this article, we will learn how to orchestrate the whole process.

Sample Use-Case

By using a workflow with File System Item Source object in Loop mode: As part of this technique, we will cover the creation of a workflow using a File System Item Source object in Loop mode as the source.
By using the Context Information and the Job Scheduler: As part of this technique, we will cover the creation of a workflow using the Context Info object as the source. Moreover, we will also apply scheduling on the workflow.

In the following sections, let’s go over the whole process of how the above-mentioned techniques can be implemented.

Reusing the Report Model for Similar Layout Files

It is standard practice to create a report model containing all the extraction logic and settings so that it can be applied onto multiple files of a similar layout.

We are using an example extraction template. To learn more about how to extract data from an unstructured document, click here .

To ensure better accessibility and manageability of the files, let’s create a project and add all the relevant documents to it.

Creating a Project

To create a project, go to Project > New > Integration Project.

The file explorer will open. Navigate to the path where you want to save the project and write the name of the project.

Right-click on the project in the Project Explorer panel and select the Add New Folder option.

Here, we have created a folder named Files to store the flows, a folder named Source to store the unstructured source documents, and a folder named Output to store the output files.

Right-click on the Files folder and select Add New Item option from the context menu. Add a dataflow and a workflow to the Files folder.

Similarly, use Add Existing Items to add unstructured source files in the Source folder and ‘SampleOrders.Rmd’ in the Files folder. ‘SampleOrders.Rmd’ is the extraction template.

We have successfully created a project. Now, let’s head towards designing the dataflow.

Creating a Dataflow

Double-click on the dataflow to open the empty designer.

Drag-and-drop the Report Source object onto the dataflow designer from Toolbox > Sources > Report Source. Right-click on the object’s header and select Properties from the context menu.

The Report Model Properties window will open. Here, you have to provide file paths for the Report Location and Report model Location.

Report Location: File path of the unstructured document.
Report Model Location: File path of the extraction template.

Click OK to proceed.

To store the extracted data, drag and drop a destination object onto the dataflow designer. Here, we are using a Delimited Destination object. Right-click on the destination object’s header and select Properties from the context menu.

This is the Configuration window.

Here, specify the destination File Path and utilize the relevant configuration options available according to your needs.

Now, click OK.

Map the data fields from the Report Source object to the Delimited Destination object.

In the following section, let’s see how we can achieve this.

Parameterizing the Dataflow

Drag-and-drop the Variables object from Toolbox > Resources > Variables onto the dataflow designer.

Right-click on the Variables object’s header and select Properties from the context menu.

The Properties window will open.

Now, click OK.

Double-click on the Report Source object’s header. Click Next. This is the Config Parameters window. Set the FilePath variable as the Value of FilePath field.

Double-click on the destination object’s header and click Next till you reach the Config Parameters window. Set the FilePathDest variable as the Value of the DestinationFile field. Click OK.

We have now created a final dataflow. Let’s proceed towards designing the workflow.

Designing a Workflow

Drag-and-drop the Run Dataflow object from Toolbox > Workflow Tasks > Run Dataflow onto the workflow designer.

Double-click on the Run Dataflow object’s header to open the Start Dataflow Job Properties window.

Specify the path to the dataflow that you want to execute in the Job-Info. To learn more about Run Dataflow object, click here.

Now, Click OK.

Drag-and-drop the File System Items Source object from Toolbox > Sources > File System onto the workflow designer.

Double-click on the object’s header. The File System Items Source Properties window will open.

Click OK to proceed.

Note: Here, the unstructured files in the Output folder were of “.txt” extension. Hence, only “.txt” files are being processed.

Map the FilePath data field of the File System Items Source object to the FilePath data field of the Run Dataflow object.

Now, let’s proceed towards the construction of the dynamic destination path.

Drag-and-drop the Constant Value transformation object from Toolbox > Transformation > Constant Value onto the workflow designer.

Double-click on the Constant Value object’s header. The Constant Value Map Properties window will open. Provide the path to the directory or folder where you want to store the output files. Click OK.

Go to Toolbox > Transformation > Expression, drag-and-drop the Expression transformation object onto the workflow designer.

Right-click on the Expression object’s header and select Properties from the context menu.

This will open the Layout Builder window. Here, you must create three data fields.

FileDirectory, set as Input.
FileName, set as Input.
FilePathDest, set as Output.

Click OK.

Note: Write the expression of FilePathDest field as FileDirectory + FileName + “.csv”. At runtime, this expression will create the destination file path for each source file.

To learn more about how to utilize the Expression transformation object, click here.

To construct the dynamic destination path, it is necessary to make appropriate mappings for the fields of the objects. The directory that has been specified in the Constant Value object needs to be combined with the file name provided by the File System Items Source object. After the application of the defined expression on the fields, the resultant file path should be mapped to the Run Dataflow object. To achieve this, define the field mappings of the objects as follows:

Map the Value field of the Constant Value object to the FileDirectory field of the Expression transformation object.
Map the FileNameWithoutExtension field of the File System Items Source object to the FileName field of the Expression transformation object.
Map the FilePathDest field of the Expression transformation object to the FilePathDest field of the Run Dataflow object.

After mapping of the objects, this is what the final workflow looks like:

We have now created the final workflow. In the next section, let’s discuss the reusability techniques of the extraction template.

Reusing Methods

Now, we will see how we can use the workflow and apply the extraction process on multiple unstructured files of similar format.

Applying Looping on File System Items Source Object

Right-click on the File System Items Source object’s header and select Loop from the context menu.

Link the File System Items Source object to the Run Dataflow object.

Note: You can use Job Schedules on this workflow by following the steps defined under the Scheduling heading.

Using Context Info and Applying Scheduling

Instead of a File System Items Source object, you can use the Context Info object from Toolbox > Resources > Context Info to process a file whenever it is dropped at the path mentioned in the scheduled task.