Parquet File Source (Beta)

What is Apache Parquet File Format?

Apache Parquet is a column storage file format used by Hadoop systems such as Pig, Spark, and Hive. The file format is language-independent and has a binary representation. Parquet is used to efficiently store large data sets and has the extension of .parquet.

The key features of Parquet with respect to Astera Data Stack are:

  • It offers the option of compression with a lesser size post-compression.

  • It encodes the data.

  • It stores data in a column layout.

Using Parquet File Source in Astera

In Astera Data Stack, you can use a Parquet file in which the cardinality of the data is maintained, i.e., all columns must have the same number of fields.

Note: There should only be one row for each data field.

  1. Drag and drop the Parquet File Source from the Sources section of the Toolbox onto the dataflow designer.

  1. Right-click on the Parquet File Source object and select Properties from the context menu.

This will open a new window.

Let’s have a look at the options present here.

File Location

File Path: This is where you will provide the path to the .parquet file.

Data Load Option

If you wish to control memory consumption and increase read time, then the Data Load option can be used.

Batch Size: This is where the size of each batch is defined.

Advanced File Processing: String Processing

Treat empty string as null value: Checking this will give a null value on every empty string.

Trim strings: Checking this box will trim the strings.

  1. Once done, click Next and you will be led to the Layout Builder screen.

The layout will be automatically built. Otherwise, you can build it using the Build Layout from layout spec option at the top of the screen.

  1. Once done, click Next and you will be taken to the Config Parameters screen.

This allows you to further configure and define dynamic parameters for the Parquet source file.

Note: Parameters left blank will use their default values assigned on the properties page.

  1. Click Next and you will be taken to the General Options screen.

Here, you can add any comments that you wish to add.

  1. Click OK and the Parquet File Source object will be configured.

You can now map these fields to other objects as part of the dataflow.

Data Types Supported in Parquet Astera:

  • Integer

  • Time/Timestamp

  • Date

  • String

  • Float

  • Real

  • Decimal

  • Double

  • Byte Array

  • Guid

Data Types not Supported in Parquet Astera:

  • Base64

  • Integer96

  • Image

Limitations:

  • Hierarchy is not supported.

This concludes our discussion on the definition and configuration of the Parquet File Source object in Astera Data Stack.

Last updated

© Copyright 2023, Astera Software