Distinct Transformation
Overview
The Distinct transformation object in Astera Data Stack removes duplicate records from the incoming dataset. You can use all fields in the layout to identify duplicate records or specify a subset of fields, also called key fields, whose combination of values will be used to filter out duplicates.
Use Case
Consider a scenario where we have data coming in from an Excel Workbook Source and the dataset contains duplicate records. We want to filter out all the duplicate records from our source data and create a new dataset with distinct records from our source data. We can do this by using the Distinct transformation object in Astera. To achieve this, we will specify data fields with duplicate records as Key Values.
In order to add a separate node for duplicate records inside the Distinct transformation object, we will check the option: Add Duplicate Records. Then we will map both distinct and duplicate outputs to a Delimited File Destination.
Using the Distinct Transformation
Drag-and-drop an Excel Workbook Source from the Toolbox to the dataflow as our source data is stored in an Excel file.
To apply the Distinct transformation to your source data, drag and drop the Distinct transformation object from the Transformations section in the Toolbox. Map the fields from the source object by dragging the top node of the ExcelSource and to the top node of the Distinct transformation object. To do this, go to Toolbox > Transformations > Distinct.
Now, right-click on the Distinct transformation object and select Properties. This will open the Layout Builder window where you can modify fields (add or remove fields) and the object layout.
Click Next. The Distinct Transformation Properties window will now open.
Data Ordering:
Data is Presorted on Key Fields: Select this option if the incoming data is already sorted based on defined key fields.
Sort Incoming Data: Select this option if your source data is unsorted and you want to sort it.
Work with Unsorted Data: When this option is selected, the Distinct transformation object will work with unsorted data.
On this window, the distinct function can be applied to the fields containing duplicate records by adding them under the Key Field.
Note: In this case, we will specify the Name and Type fields as Key Fields.
You can now write the Distinct output to a destination object. In this case, we will write our output into a Delimited File Destination object.
Right-click on the Delimited File Destination object and click Preview Output.
Your output will look like this:
Adding Duplicate Records
To add duplicate records in your dataset check the Add Duplicates Output option in the Distinct Transformation Properties window.
When you check Add Duplicates Output, three output nodes will be added in the Distinct transformation object.
Input
Output_Distinct
Output_Duplicate
Note: When you check the Add Duplicate Records option, mappings from the source object to the Distinct transformation object will be removed.
Now, map the objects by dragging the top node of the ExcelSource object to the Input node of the Distinct transformation object.
You can now write the Output_Distinct and Output_Duplicate nodes to two different destination objects. In this case, we will write our output into a Delimited File Destination object.
Distinct output:
Duplicate output:
As evident, the duplicate records have been successfully separated from your source data.
Last updated