Distinct Transformation
Overview
The Distinct transformation object in Astera removes duplicate records from the incoming dataset. You can use all fields in the layout to identify duplicate records, or specify a subset of fields, also called key fields, whose combination of values will be used to filter out duplicates.
Video
Use Case
Consider a scenario where we have data coming in from an Excel Workbook Source and the dataset contains duplicate records. We want to filter out all the duplicate records from our source data and create a new dataset with distinct records from our source data. We can do this by using the Distinct transformation object in Astera Data Stack. To achieve this, we will specify data fields with duplicate records as Key Values.
In order to add a separate node for duplicate records inside the Distinct transformation object, we will check the option: Add Duplicate Records. Then we will map both distinct and duplicate outputs to a Delimited File Destination.
Let’s see how to do that.
How to work with Distinct Transformation
Drag-and-drop an Excel Workbook Source from the Toolbox to the dataflow as our source data is stored in an Excel file.
To apply the Distinct transformation to your source data, drag-and-drop the Distinct transformation object from the Transformations section in the Toolbox. Map the fields from the source object by dragging the top node of the ExcelSource and to the top node of the Distinct transformation object. To do this, go to Toolbox>Transformations>Distinct.
Now, right-click on the Distinct transformation object and select Properties. This will open the Layout Builder window where you can modify fields (add or remove fields) and the object layout.
Click Next. The Distinct Transformation Properties window will now open.
Data Ordering:
Data is Presorted on Key Fields: Select this option if the incoming data is already sorted based on defined key fields.
Sort Incoming Data: Select this option if your source data is unsorted and you want to sort it.
Work with Unsorted Data: When this option is selected, the Distinct transformation object will work with unsorted data.
On this window, the distinct function can be applied on the fields containing duplicate records by adding them under Key Field.
Note: In this case, we will specify the Name and Type fields as Key Fields
You can now write the Distinct output to a destination object. In this case, we will write our output into a Delimited destination object.
Right-click on Delimited Destination object and click Preview Output.
Your output will look like this:
To add duplicate records
To add duplicate records in your dataset check the Add Duplicates Output option in the Distinct Transformation Properties window.
When you check this option, three output nodes would be added in the Distinct transformation object.
Input
Output_Distinct
Output_Duplicate
Note: When you check the Add Duplicate Records option, mappings from the source object to the Distinct transformation object will be removed.
Now, map the objects by dragging the top node of ExcelSource object to the Input node of the Distinct transformation object.
You can now write the Output_Distinct and Output_Duplicate nodes to two different destination objects. In this case we will write our output into a Delimited destination object.
Distinct output:
Duplicate output:
As evident, the duplicate records have been successfully separated from your source data.
Last updated