Astera represents a new generation of data integration platforms designed for superior usability, productivity, and performance. Astera has worked closely with customers from a variety of industries, including financial services, healthcare, government, nonprofits, and others to build and continually refine Astera with the specific goal of creating and maintaining complex hierarchical Dataflows and workflows.
Over the years, we have assisted customers in successfully deploying Astera in a variety of usage scenarios. The Astera Best Practices Series represents a synthesis of these experiences, and each part will cover a specific key area of Astera's technologies and processes. The objective is to assist customers in developing world-class solutions by establishing and communicating these best practices.
Dataflows integral to any ETL (Extract, Transform, Load) integration project, find their cornerstone in the Astera Dataflow Designer, which leads the industry in power and ease of use. Our customers consistently give high marks to the visual interface, drag-and-drop capabilities, instant preview, and full complement of sources, targets, and transformations.
The Dataflow best practices outlined herein are intended to facilitate the initial creation and ongoing maintenance of Astera ETL Dataflows. Whether you are developing small Dataflows with a few steps or large ones comprising scores of steps, these practices will help you manage your ETL integration project more effectively.
Modularity is a key design principle in the development of integration projects. Modularity enhances the maintainability of your Dataflows by making them easier to read and understand. It also promotes reusability by isolating frequently used logic into individual components that can be leveraged as “black boxes” by other flows.
Astera supports multiple types of reusable components. These components and their usage recommendations are discussed in the following subsections.
Subflows are reusable blocks of Dataflow steps that have inputs and outputs. Subflows enable users to isolate frequently used logic in reusable components. Once created, Subflows can be used just like built-in Astera transformations. Examples of reusable logic that can be housed in Subflows include:
Validations that are applied to data coming from multiple sources, frequently in incompatible formats.
Transformation sequences such as a combination of lookup, expression, and function transformations that occur in multiple places in the project.
Processing of incoming data that arrives in different formats but must be normalized, validated, and boarded.
Shared Actions are similar to Subflows but only contain a single action. Shared actions should be employed when a source or destination is used in multiple places within the project. This way, if a field is added or removed from this source, all the projects inherit that change automatically.
Shared Connections contain database connection information that can be shared by multiple actions within a Dataflow. Shared connections can also be used to enforce transaction management across many database destinations. Use shared connections whenever multiple actions in a Dataflow use the same database connection information.
To make an API call, an API Connection object needs to be configured first. This object stores all the common information that can be shared across multiple API requests.
Cloud connection in Astera allows users to connect to various cloud data systems, such as Amazon S3, Azure Blob Storage, Data Lake, and SharePoint. This makes it easier to manage and integrate data across different platforms.
Detached Transformations are a capability within Astera developed for scenarios where a lookup or expression is used in multiple places within a dataflow. Detached Transformations enable you to create a single instance and use it in multiple places.
They are available in expressions as callable functions, enabling you to use them in multiple expressions. Additionally, Detached Transformations allow you to use lookups conditionally.
An example of an expression would be, “Utilize a single expression for both an Excel source field and a database source field. Instead of creating distinct expressions for each source field, streamline the process by invoking the expression in the field calculation.”
Astera has been designed as a parallel processing platform to deliver superior speed and performance, so designing dataflows to take advantage of the software’s abilities can significantly affect your data integration performance. The performance best practices discussed here can result in a major performance boost in many common situations.
Frequently, a Dataflow can be optimized by some fine-tuning at the data source. Some of the optimization techniques are discussed in this section.
When loading data from a database, enter a where clause to filter data at the source. Loading data in Astera and then filtering using the Filter transformation can significantly degrade performance.
For instance, if you add a WHERE clause that selects all the customers from the country “Mexico” in the Customers table.
Filter Transformation
The Database Table Source automatically creates a query to load only the mapped fields. To take advantage of this optimization, map only the fields that are used in subsequent actions.
The performance of joins improves by order of magnitude when working with previously sorted data. Where possible, avoid sorting data in Astera and sort instead in your database query by adding an Order By clause.
Astera Database and File Sources enable data partitioning, which speeds up reading by breaking a dataset into chunks and reading these chunks in parallel. Use partitioning if you are moving a large data table.
If you periodically transfer incremental data changes, consider using one of Astera's change data capture (CDC) patterns to ensure your data is as up-to-the-minute as you need it to be. Astera supports a variety of CDC strategies enabling you to select the appropriate strategy to fit your environment and requirements.
Incremental Load Based on Audit Fields relies on fields like created date time, modified date time, and version number. The process involves tracking the highest audit field value; during subsequent runs, only records with values surpassing the saved value are retrieved. This feature is valuable for maintaining synchronization between two applications without the need to read the entire table in every run.
Trigger-based CDC shares a similar concept with incremental load, but it offers greater flexibility. This method allows you to select multiple audit fields which can be of any data type, which are then saved in your database. This flexibility is particularly useful when working with both ETL and ELT methods.
Join transformations enable you to join multiple data sources. Joining often involves sorting incoming data streams, making it the most common reason for performance issues. Here are the practices to keep in mind when using joins:
Sort data in the source where appropriate. Joining sorted data streams is much faster.
Reducing the number of fields in the Join Transformation also improves performance, so be sure to remove any unnecessary fields especially when a large amount of data is involved.
Efficient data joining is crucial for optimal performance in Astera. Whenever possible, leverage ELT for enhanced speed and streamlined data integration. However, "Join in Database" is also a valuable alternative, providing flexibility based on specific use cases or preferences.
The Enable Sort Optimization option in Join and Tree Join transformations enables users to optimize ETL (Extract, Transform, and Load) performance and minimize job execution time. By enabling sort optimization, the system can efficiently choose the most suitable join order and algorithm, leading to faster query processing and resource utilization. It issues recommendations to optimize integration flows, at runtime in job trace. These recommendations can be removing repetitive sort operations on presorted data, apply an Order By clause on database sources to run sort operations in the database, or any other suggestion to optimize your flow. As a result, it helps bring a significant increase in performance by efficiently managing the RAM, CPU, network, and disk utilization while executing jobs.
Lookups are another source of performance issues in integration jobs. Astera offers several caching techniques to improve the performance of lookups. Experiment with lookup caching options and select the options that work best for specific situations. Some tips are:
If you work with a large dataset that does not change often, consider using the new Astera Persistent Lookup Cache, which stores a snapshot of the lookup table on the server’s local drive and uses it in subsequent runs. In situations where the lookup table is updated daily, a snapshot can be taken on the first run after the update and can be used throughout the day to process incremental data.
If you work with a large dataset but use only a small fraction of items in a single run, consider using the Cache on First Use option
If a lookup is used in multiple places within the same flow, consider using a detached lookup
Where appropriate, use a database join instead of a lookup function
Astera supports bulk load for popular databases. Where possible, use bulk load for database destinations and experiment with bulk load batch sizes to fine-tune performance.
Avoid unnecessary steps. Some steps such as Record Level Log may incur major performance overhead and should only be used during development to isolate data issues.
The Job Monitor window shows the execution time for each step. These are not precise numbers but provide a rough approximation of the time taken by each step and can be used to optimize specific steps.
Job Monitor
Astera offers Diff Processors (Source Diff Processor and Database Diff Processor) Transformation that can be used to compare an incoming data stream with existing data in the table and apply differences. In certain situations, this transformation can speed up Dataflow execution.
Input parameters and output variables make it possible to supply values to Dataflows at runtime and return output results from Dataflows. Well-designed parameter and output variable structures promote reusability and reduce ongoing maintenance costs.
Input parameter values can be supplied from a calling Workflow using data mapping. When designing Dataflows, analyze the values that could change between different runs of the flow and define these values as input parameters. Input parameters can include file paths, database connection information, and other data values.
If you would like to make decisions about subsequent execution paths based on the result of a Dataflow run, define output variables, and use an expression transformation to store values into output variables. These output variables can be used in subsequent Workflow actions to control the execution path.
The Toolbar offers several benefits for Dataflows and other artifacts, including simplifying and streamlining the design of integration jobs, increasing reusability, and resulting in an easier-to-understand overall diagram. Additionally, the Dataflow designer supports unlimited undo-redo capability, allowing users to undo or redo several actions, which can enhance the flexibility and ease of use in designing Dataflows. Furthermore, Dataflows created with Astera can be ported to any number of target environments that may use different connections to data, providing flexibility and reusability. Overall, the Toolbar provides a user-friendly interface and features that contribute to efficient data integration and management.
Pressing Ctrl+F opens the Find feature in the Dataflow, which allows users to search for different artifacts and optimize the dataflow by saving time and increasing efficiency. Instead of manually searching for specific objects or components in the Dataflow, users can quickly locate them using the search feature. This can be especially helpful in larger Dataflows with many objects, where finding a specific object can be time-consuming.
Instead of searching for the desired destination or transformation in the Toolbox, you can simply right-click on the node inside the object and select Write To or Transform and then select the desired destination/transformation which will then be added to the Dataflow designer with auto-mapped fields. This can save your time and effort.
Write To
Transform