# Distinct Transformation

## **Overview**

The *Distinct* transformation object in Astera Data Stack removes duplicate records from the incoming dataset. You can use all fields in the layout to identify duplicate records or specify a subset of fields, also called key fields, whose combination of values will be used to filter out duplicates.

## **Use Case**

Consider a scenario where we have data coming in from an *Excel Workbook Source* and the dataset contains duplicate records. We want to filter out all the duplicate records from our source data and create a new dataset with distinct records from our source data. We can do this by using the *Distinct* transformation object in Astera. To achieve this, we will specify data fields with duplicate records as Key Values.

In order to add a separate node for duplicate records inside the *Distinct* transformation object, we will check the option: *Add Duplicate Records*. Then we will map both distinct and duplicate outputs to a *Delimited File Destination.*

## **Using the Distinct Transformation**

1. Drag-and-drop an [*Excel Workbook Source*](https://documentation.astera.com/v/astera-data-stack-v8/dataflows/sources/excel-workbook-source) from the *Toolbox* to the dataflow as our source data is stored in an Excel file.
2. To apply the *Distinct* transformation to your source data, drag and drop the *Distinct* transformation object from the *Transformations* section in the *Toolbox*. Map the fields from the source object by dragging the top node of the *ExcelSource* and to the top node of the *Distinct* transformation object. To do this, go to *Toolbox > Transformations > Distinct*.

<figure><img src="https://750977703-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqHxyVNGb7tSdIWecl6Ru%2Fuploads%2FceWF0JKyFPyxC4AXcPmp%2FDistinct%20Transformation%20Gif%201.gif?alt=media&#x26;token=2e13cec0-e485-440b-9664-04d1ecaef0ad" alt=""><figcaption></figcaption></figure>

3. Now, right-click on the *Distinct* transformation object and select *Properties*. This will open the *Layout Builder* window where you can modify fields (add or remove fields) and the object layout.

<figure><img src="https://750977703-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqHxyVNGb7tSdIWecl6Ru%2Fuploads%2F85zQf5OqyX03Wej8JQfy%2Fimage.png?alt=media&#x26;token=25f8abe6-68d6-4977-9dbb-826aa5fa15ef" alt=""><figcaption></figcaption></figure>

4. Click *Next*. The *Distinct Transformation Properties* window will now open.

<figure><img src="https://750977703-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqHxyVNGb7tSdIWecl6Ru%2Fuploads%2Fu9MFJ7cFFB4PdQTNH6og%2Fimage.png?alt=media&#x26;token=c6e9dfaa-cdd4-4bcf-9b1e-964db697d352" alt=""><figcaption></figcaption></figure>

*Data Ordering*:

* *Data is Presorted on Key Fields:* Select this option if the incoming data is already sorted based on defined key fields.
* *Sort Incoming Data:* Select this option if your source data is unsorted and you want to sort it.
* *Work with Unsorted Data:* When this option is selected, the *Distinct* transformation object will work with unsorted data.

5. On this window, the distinct function can be applied to the fields containing duplicate records by adding them under the *Key Field*.

{% hint style="info" %}
**Note:** In this case, we will specify the *Name* and *Type* fields as *Key Fields.*
{% endhint %}

<figure><img src="https://750977703-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqHxyVNGb7tSdIWecl6Ru%2Fuploads%2FdLuRAJMPIn4fZWLCJ9x2%2Fimage.png?alt=media&#x26;token=213be1d5-f569-48e6-a9cb-368ac07e2329" alt=""><figcaption></figcaption></figure>

You can now write the *Distinct* output to a destination object. In this case, we will write our output into a [*Delimited File Destination*](https://documentation.astera.com/v/astera-data-stack-v8/dataflows/destinations/delimited-file-destination) object.

<figure><img src="https://750977703-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqHxyVNGb7tSdIWecl6Ru%2Fuploads%2FRJRn04dBpNYRFESglOL8%2Fimage.png?alt=media&#x26;token=623aa573-f220-40c3-916e-cc99d48bb3f3" alt=""><figcaption></figcaption></figure>

6. Right-click on the *Delimited File Destination* object and click *Preview Output*.

Your output will look like this:

<figure><img src="https://750977703-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqHxyVNGb7tSdIWecl6Ru%2Fuploads%2FhbtpWqutqGXf0qTDWUmC%2Fimage.png?alt=media&#x26;token=b900ea5e-7d54-446a-9439-6ef3ddbce44d" alt=""><figcaption></figcaption></figure>

## **Adding Duplicate Records**

1. To add duplicate records in your dataset check the *Add Duplicates Output* option in the *Distinct Transformation Properties* window.

<figure><img src="https://750977703-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqHxyVNGb7tSdIWecl6Ru%2Fuploads%2FNE6Dib1A2rIIBEFLG5zx%2Fimage.png?alt=media&#x26;token=f954107d-0561-480d-b37d-f7c88bab17cf" alt=""><figcaption></figcaption></figure>

2. When you check *Add Duplicates Output*, three output nodes will be added in the *Distinct* transformation object.
   1. *Input*
   2. *Output\_Distinct*
   3. *Output\_Duplicate*

{% hint style="info" %}
**Note:** When you check the *Add Duplicate Records* option, mappings from the source object to the *Distinct* transformation object will be removed.
{% endhint %}

<figure><img src="https://750977703-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqHxyVNGb7tSdIWecl6Ru%2Fuploads%2Fz9v7xiZ0mNK4nL8My0yQ%2Fimage.png?alt=media&#x26;token=40f3af5b-be3c-4d28-976f-450f58f84799" alt=""><figcaption></figcaption></figure>

3. Now, map the objects by dragging the top node of the *ExcelSource* object to the *Input* node of the *Distinct* transformation object.

<figure><img src="https://750977703-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqHxyVNGb7tSdIWecl6Ru%2Fuploads%2FcxoMYkYvd3Sh6s2CLTIV%2Fimage.png?alt=media&#x26;token=c6bb403b-a580-4073-a526-e638655a4d99" alt=""><figcaption></figcaption></figure>

4. You can now write the *Output\_Distinct* and *Output\_Duplicate* nodes to two different destination objects. In this case, we will write our output into a [*Delimited File Destination* ](https://documentation.astera.com/v/astera-data-stack-v8/dataflows/destinations/delimited-file-destination)object.

<figure><img src="https://750977703-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqHxyVNGb7tSdIWecl6Ru%2Fuploads%2F2gPdipo6I4m5ecxBh5nc%2Fimage.png?alt=media&#x26;token=53ba3e16-216f-4299-8cc0-1281e820b77a" alt=""><figcaption></figcaption></figure>

Distinct output:

<figure><img src="https://750977703-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqHxyVNGb7tSdIWecl6Ru%2Fuploads%2F8fu4AW1DWmujOr94rmBA%2Fimage.png?alt=media&#x26;token=9dd1922b-d131-4bfd-8f4e-243aafb3e925" alt=""><figcaption></figcaption></figure>

Duplicate output:

<figure><img src="https://750977703-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqHxyVNGb7tSdIWecl6Ru%2Fuploads%2FsWoftiVAh2ef0gXj9w2s%2Fimage.png?alt=media&#x26;token=1f4d7491-6268-4a86-a91b-ceb9baac1201" alt=""><figcaption></figcaption></figure>

As evident, the duplicate records have been successfully separated from your source data.
