1 of 7

Data Logging and Profiling

Creating Data Profile

The Data Profile feature provides a complete data field statistic – basic and detailed – containing information such as the data type, minimum/maximum values, data count, error count etc. The statistics are collected for each of the selected fields at the time the dataflow runs.

In this document, we will learn how to create a Data Profile in Astera.

Video

Using Data Profile

In this case, we will use data from a Customers Database Table Source.

We want to collect statistics on these fields of data. For this purpose, we will use Astera’s Data Profile feature.

To get the Data Profile object from the Toolbox, go to Toolbox > Data Profiling > Data Profile. If you are unable to see the Toolbox, go to View > Toolbox or press Ctrl + Alt + X.

Drag-and-drop the Data Profile object onto the dataflow designer.

You can see that the Data Profile object is empty right now. This is because we have not mapped any fields to it yet.

Auto-map the fields from the source object onto the profile object.

Note: A Data Profile object is designed to capture statistics for an entire field layout. For this reason, it should be linked to the main Output port of the object whose field statistics you wish to collect.

Configuring the Data Profile Object

To configure the Data Profile object, right-click on its header and select Properties from the context menu.

A configuration window will open. The first screen you will see is the Layout Builder. This is where we can create or delete fields, change field names, and their data type.

Click Next. This is the Properties window.

Here we will provide the Profile File path to specify where the profile should be stored.

Specify the type of Field Statistics to be collected.

Field Statistics dropdown allows you to choose detail levels of statistics to collect. Select among the following detail levels:

Basic Statistics: This is the default mode. It captures the most common statistical measures for the field’s data type.
No Statistics: No statistics is captured by the Data Profile.
Detailed Statistics – Case Sensitive Comparison: Additional statistical measures are captured by the Data Profile, for example Mean, Mode, Median etc. using case-sensitive comparison for strings.
Detailed Statistics – Case Insensitive Comparison: Additional statistics are captured by the Data Profile, using case insensitive comparison for strings.

In this case, we are collecting a Detailed Statistics – Case Sensitive Comparison

Click OK.

Executing the Task

After configuring the settings for the Data Profile object, click on the Start Dataflow icon from the toolbar located at the top of the window.

A Job Progress window will open at this instant and will show you the trace of the job.

2. Click on the Profile link provided in the Job Progress window and the profile will open in Astera. Expand the Profile node to see each field inside the object. Click on these fields to see the collected statistical values.

Creating Field Profile

The Field Profile feature captures statistics for selected fields from one or several objects. Field Profile is essentially a transformation object as it provides Input and Output ports similar to other transformations. These output ports make it possible to feed the statistics collected to another object on the dataflow.

In this document, we will learn how to create a Field Profile in Astera.

Using Field Profile

In this case, we have extracted data from a sample Invoices Database Table Source.

We want to collect detailed statistics from some of these fields of data and write it to a Delimited File Destination. For this purpose, we will use Astera's Field Profile feature.

To get a Field Profile object from the Toolbox, go to Toolbox > Data Profiling > Field Profile. If you are unable to see the toolbox, go to View > Toolbox or press Ctrl + Alt + X.

Drag-and-drop the Field Profile object onto the dataflow designer.

You can see that the dragged Field Profile object contains an Input node and an Output node. The Input node is empty as we have not mapped any fields to it yet.

One-by-one map ShipName, CustomerID, Country, OrderDate, ProductName, UnitPrice, and Quantity from the source object to the Field Profile object’s Input node.

Note: Statistics will be collected only for the fields linked to the Input node of the Field Profile object. This way, you can selectively collect statistics for a subset of fields from the selected field layout.

Configuring the Field Profile Object

To configure the Field Profile object, right-click on its header and select Properties from the context menu.

A configuration window will open. The first screen you will see the Layout Builder. This is where we can create or delete fields, change their name, and data type.

Click Next. On the Properties window, specify the Statistics Type from the dropdown list.

Field Statistics dropdown allows you to select detail levels of statistics to collect. Select among the following detail levels:

Basic Statistics: This is the default mode. It captures the most common statistical measures for the field’s data type.
No Statistics: No statistics are captured by the Data Profile.
Detailed Statistics – Case Sensitive Comparison: Additional statistical measures are captured by the Data Profile, for example Mean, Mode, Median etc. using case-sensitive comparison for strings.
Detailed Statistics – Case Insensitive Comparison: Additional statistics are captured by the Data Profile, using case insensitive comparison for strings.

In this case, we will select Detailed Statistics – Case Sensitive Comparison.

Click OK.

Right-click on Field Profile object’s header and select Preview Output from the context menu.

A Data Preview window will open and show you the statistics of each mapped field as a record.

Writing to a Destination

Observe that the Field Profile object contains an Output node. Once expanded, you will see various statistical measures as fields with output ports.

We can write these statistical measures to a destination file.

Right-click on the Output node and go to Write To > Delimited File Destination.

A Delimited File Destination object will be added to the dataflow designer with auto-mapped fields from the Output node.

Configure settings for your Delimited File Destination from here.

Executing the Task

After configuring the settings for the Delimited File Destination object, click on the Start Dataflow icon from the toolbar located at the top of the window.

A Job Progress window will open at this instant and will show you the trace of the job.

You can open the delimited file that contains field statistics from the link provided in the Job Progress window.

Data Quality Mode

In addition to the standard logging functionality, Astera provides a special Data Quality Mode option, useful for advanced profiling and debugging. When a dataflow is created/opened in Data Quality Mode, most objects on the dataflow show the Messages node with output ports.

In this document, we will learn how to use the Data Quality Mode in Astera.

Sample Use-Case

In this case, we have a simple dataflow designed to perform a data quality check. It contains customers’ data coming in from an Excel Workbook Source. A Data Quality Rule object is added to validate data for null values and perform warning checks.

If you preview the Customers dataset output at this stage, you will see that some of the records have missing values in the Region and Fax fields.

Data quality rules are set so that records with empty Region values are marked as errors and records with empty Fax values are marked as warnings. A red exclamation sign in the Data Preview window identifies the records that have failed to match the rule and returned an error or a warning as a result.

Now, for instance, we want to collect information regarding the number of errors/warnings in a single record, the error/warning messages attached to these records, and write this information to a destination. For this purpose, we will use Data Quality Mode.

Note: The Record Level Log feature also collects and records this information but we can not further process it in the dataflow.

Activating Data Quality Mode

To activate this feature, click on the Data Quality Mode icon located at the top of the dataflow designer.

Once Data Quality Mode is activated, a Messages node will be added to all the objects in the dataflow.

The Messages node captures the following statistical information:

TotalCount
ErrorCount
WarningCount
InfoCount
MessagesText
DbAction
Custom

In addition, FirstItem, LastItem, and Items sub-nodes provide a way to collect quality control data for each of the records. The quality control data includes ElementName, MessageType, or Action etc. and can be written to a destination object for record-keeping purposes.

Writing to a Destination

Connecting the Messages node’s output ports to another object’s input ports on the dataflow makes it possible to get both - summary statistics and record-level statistics for the dataset, which are useful for analysis and debugging. To do this:

Right-click on the Messages node inside the NoNullValues_Rule object and go to Write to > Delimited File Destination.

A Delimited Destination object is added to the designer with mapped fields.

Configure settings for the Delimited File Destination to save this data.
Right-click on the header of the destination object and select Preview Output from the context menu.

A Data Preview window will open, showing error and warning information.

Using Data Quality Rules in Astera

The Data Quality Rules object found in the Data Profiling section of the Toolbox is used to apply one or more conditions, called Data Quality Rules, against incoming records. Records that do not meet the data quality rule criteria will be assigned the ‘Error’ status and may be optionally excluded from processing by the downstream objects.

Data Quality Rules is a record-level component which means that it does not require the entire dataset to flow through it. In other words, you can map a single or a couple of fields to the Data Quality Rules component to set up quality validation criteria and the transformed records can be mapped further in the dataflow.

Video

Steps to Use the Data Quality Rules in a Dataflow

Let’s understand the application and usage of Data Quality Rules with the following example.

Here we have sample data of employees of a fictitious organization which we have retrieved using an Excel Workbook Source.

If we look at the preview of the Employee_Report dataset, the values in the SalariedFlag column specify whether an employee is salaried in terms of 0 and 1.

1 = the employee is salaried

0 = the employee is non-salaried and therefore is eligible for overtime.

We can apply data quality rules to these values and identify which employees are not salaried and therefore, are eligible for overtime. The Data Quality Rules object will process all records and those that do not match the criteria will be returned with an error. This means that in this example, the salaried employees with the salary flag ‘True’ will return an error, whereas the records of employees with the salary flag ‘False’ will pass the data quality rule.

To do this, drag the Data Quality Rules object from the Data Profiling section in the Toolbox and drop it onto the dataflow designer.

Now, map the SalariedFlag field to the Data Quality Rules object.

Right-click on the Data Quality Rules object and select Properties from the context menu.

This will open a new window. This is the Layout Builder, where you can see the ‘SalariedFlag’ field we have mapped from our source.

Click Next to proceed to the Data Quality Rules window.

Here, we will set rules or the data quality criteria. Click this button to add a new rule.

Once a new rule is added, the options on this window will activate and the rule will be added to the grid.

Let’s explore these options one by one:

Description: The Description field contains the name or description of the rule. By default, the rules are termed as Rule1, Rule2 and so on, depending on the number of rules you add. But you can also rename the rules for better understanding and convenience.

In our case, as we want to set a data quality criteria to identify non-salaried employees, we can rename the rule as “NonSalariedEmployeesRule.”

Attach rule to the field: This is a drop-down list using which you can attach a rule to a particular field. You can see that there is a root node named Data Quality Rules.

Listed within the Data Quality Rules node are the fields mapped to the Data Quality Rules object. Here we have only one field mapped to which we want to apply this rule. In case you want to apply a rule to the whole dataset, you can simply double-click on the Data Quality Rules root node and the rule will be applied to all fields mapped to the Data Quality Rules object.

In this case, we will map the rule to the SalariedFlag field.

Expression box: This is where you can type in the expression for your rule.

Or you can click this button to enter the Expression Builder window where you can choose an expression from Astera's library of built-in expressions, or you can write one of your own.

In this example, we want to validate records with the Salary Flag ‘False.’ To do this we will write the expression:

‘SalariedFlag = 0’ in the Expression field.

Observe that, simultaneously, Astera shows you a compile status of your expression below the expression box.

It says ‘Successful’ so we can click OK. Alternatively, it will give you an error if the expression is incorrect and you will have to correct the expression before clicking OK.

Show Message: We can also write a message to show up with the errors, which can also be written to the error log. Let’s write a message:

‘Salaried employees are not eligible for overtime.’

This message will help identify why a particular record was marked erroneous. And in case multiple rules are applied, the message will point out which rule was not qualified by a particular record.

Next, we have two checkboxes:

Active – to activate a rule.
Is Error – when this is checked, all records that return an error will not be written to a target. Which means that only the records that have passed the data quality rule will flow further in the dataflow pipeline.

However, if we uncheck this option, it will automatically check the Warning checkbox. This will return the records that failed to match the rule with a warning and will be written to a target.

In this case, let’s keep the errors as errors by checking the Is Error box.

Now we have set up a data quality rule.

You can add as many rules as you want by clicking this button and similarly, you can also delete a rule by pointing it in the grid and then right-click > Delete. In this example, we will work with a single rule which has been set, so let’s go ahead and click OK.

Now, let’s look at the preview. Right-click on the Data Quality Rules object and select Preview Output from the context menu.

You can see that the records that have matched the rule, the records with ‘False’ salary flag, have been validated. On the other hand, the records that failed to match the rule, the records with the ‘True’ flag, have returned an error, denoted by a red warning sign.

If you move the cursor over this warning sign, it will show the error message in the tooltip. This is especially useful in cases where you have applied more than one rule and you want to track which records have failed to match which rule or when you want to store the erroneous records in an error log.

So now that we have validated the records against our data quality rule, we can map it to a target which is a Delimited File Destination in this case. We will name this file ‘Employees eligible for overtime,’ so the records of employees with the ‘False’ salaried flag will be passing through the Data Quality Rules object and consequently be mapped to the destination file. Let’s do the mapping.

Now, if we open the Properties window of the destination file, you can see the option, Do Not Process Records With Errors on the last window. It is checked by default in all target formats in Astera. Therefore, when we run this dataflow, all records that have matched the data quality rule will be written to the destination file, whereas records that failed to match the rule and returned an error will be omitted.

The records that fail to match the data quality rule can be written and stored in a separate error log. Click here to learn how you can store erroneous records using a Record Level Log object.

Record Level Log

A Record Level Log captures the status (Success, Error, Warning, or Skip) for each of the records transferred, and includes snapshots of the source record and the destination record. It also provides additional details, such as error messages.

You can have any number of record level logs on the dataflow. Each record level log will collect the status of the records in the object that it is connected to.

In this document, we will learn how to use Record Level Log object in Astera.

Using Record Level Log

In this case we have a simple dataflow performing a data quality check process. It contains a Customers dataset stored in an Excel Workbook Source. Then, a Data Quality Rule is applied to validate data for error and warning checks and finally, data is written to a Database Table Destination.

If you Preview Output for the Customers dataset, you will see that some of the records for the Region and Fax fields are empty.

A Data Quality Rule is applied to identify null records in the Region field as errors, and empty records in the Fax field as warnings. Upon previewing its output you will see that the records that failed to match the rule have returned an error, denoted by a red warning sign.

If you hover on these warning signs, it will show you the error message.

Now, when we run this dataflow we want to know of records which passed the validation check, records that failed the validation check, records containing errors and records that ended in only warnings.

For this purpose, we will use Record Level Log.

Configuring a Record Level Log Object

To get a Record Level Log object from the Toolbox, go to Toolbox > Data Profiling > Record Level Log. If you are unable to see the Toolbox, go to View > Toolbox or press Ctrl + Alt + X.

Drag-and-drop the Record Level Log object onto the dataflow designer.

Another way to get Record Level Log object is to right-click on the Output node inside Database Table Destination and go to Write to > Record Level Log

You can see that the dragged Record Level Log object is empty right now. This is because we have not mapped any fields to it yet.

Auto-map the fields from the source object to the Log object.

To configure the Log object, right-click on its header and select Properties from the context menu.

A configuration window will open. The first window you will see is the Layout Builder window. This is where we can create or delete fields, change their name and data type.

Click Next, and you will be directed to a Properties window where you can configure settings for creating the log file.

Specify the Profile File path where Astera will save this log file. Log files are saved with .prof extension.

Specify the Log Level Type from the dropdown list.

All – all records (including Success records) are logged

Errors – only error records are logged

Warnings – only warning records are logged

Errors and Warnings – both error and warning records are logged

Off – no logging

In this case, we will select Errors and Warnings as our log type.

Stop Logging After … Records with Errors – allows you to limit excessive logging by setting a cap on the maximum number of errors to be logged. The logging stops after the cap has been reached.

The default value is 1000 errors.

Click OK.

Executing The Task

After configuring settings for the Log object, click on the Start Dataflow icon from the toolbar located at the top of the window. A Job Progress window will open at this instant and will show you the trace of the job.

Click on the log file link provided in the Job Progress window.

Record Level Log will open in Astera showing you the status of logged records.

Astera stores the error logs in XML format and if you expand each record, it will give you the Field Name to which the error/warning message was attached, the Processing Step of data check that resulted in the errors, as well as the error Message.

If you click on View Record, Astera will show you the field values of the record failing the data quality rule.

Quick Profile

The Quick Profile option in Astera gives users the ability to preview field statistics of any set-level object in the dataflow in design time. It provides information such as the data type, minimum/maximum values, data count, error count, etc which can be used to identify and correct data quality issues while designing flows.

Video

Using the Quick Profile feature:

The Quick Profile window can be viewed for an entire flow by clicking View > Quick Profile or using the shortcut key Ctrl+Alt+A.

Note: If there are multiple objects in a flow, using Quick Profile from the View tab or the shortcut key will provide information on the columns of the last object in the flow.

To view field statistics at a particular object in the dataflow, right-click on the object’s header and select Quick Profile.

In this case, we are using data from a Loans Excel File Source:

A window like this will slide up from the bottom of the screen:

Quick Profile provides an overview of the content and quality of all the fields, allowing us to determine whether the data is suitable for further transformation. When creating the flows, we may use this functionality at any point to identify any erroneous data that might be affecting the final results.

Note: The data processed by Quick Profile is in-memory and can only be stored permanently if exported to an Excel file. To do that, right-click anywhere in the Quick Profile window and select Export to Excel.

Enter a valid file name and click Save.

This concludes the use of Quick Profile feature in Astera.