Apache Parquet is a column storage file format used by Hadoop systems such as Pig, Spark, and Hive. The file format is language-independent and has a binary representation. Parquet is used to efficiently store large data sets and has the extension of .parquet.
The key features of Parquet with respect to Astera Data Stack are:
It offers the option of compression with a lesser size post-compression.
It encodes the data.
It stores data in a column layout.
In Astera Data Stack, you can use a Parquet file in which the cardinality of the data is maintained, i.e., all columns must have the same number of fields.
Note: There should only be one row for each data field.
Drag and drop the Parquet File Source from the Sources section of the Toolbox onto the dataflow designer.
Right-click on the Parquet File Source object and select Properties from the context menu.
This will open a new window.
Let’s have a look at the options present here.
File Location
File Path: This is where you will provide the path to the .parquet file.
Data Load Option
If you wish to control memory consumption and increase read time, then the Data Load option can be used.
Batch Size: This is where the size of each batch is defined.
Advanced File Processing: String Processing
Treat empty string as null value: Checking this will give a null value on every empty string.
Trim strings: Checking this box will trim the strings.
Once done, click Next and you will be led to the Layout Builder screen.
The layout will be automatically built. Otherwise, you can build it using the Build Layout from layout spec option at the top of the screen.
Once done, click Next and you will be taken to the Config Parameters screen.
This allows you to further configure and define dynamic parameters for the Parquet source file.
Note: Parameters left blank will use their default values assigned on the properties page.
Click Next and you will be taken to the General Options screen.
Here, you can add any comments that you wish to add.
Click OK and the Parquet File Source object will be configured.
You can now map these fields to other objects as part of the dataflow.
Integer
Time/Timestamp
Date
String
Float
Real
Decimal
Double
Byte Array
Guid
Base64
Integer96
Image
Hierarchy is not supported.
This concludes our discussion on the definition and configuration of the Parquet File Source object in Astera Data Stack.