# Parquet File Source (Beta)

### What is Apache Parquet File Format?

Apache Parquet is a column storage file format used by Hadoop systems such as Pig, Spark, and Hive. The file format is language-independent and has a binary representation. Parquet is used to efficiently store large data sets and has the extension of .parquet.

The key features of Parquet with respect to Astera Data Stack are:

* It offers the option of compression with a lesser size post-compression.
* It encodes the data.
* It stores data in a column layout.

### Using Parquet File Source in Astera

In Astera Data Stack, you can use a Parquet file in which the cardinality of the data is maintained, i.e., all columns must have the same number of fields.

{% hint style="info" %}
**Note:** There should only be one row for each data field.
{% endhint %}

1. Drag and drop the *Parquet File Source* from the *Sources* section of the Toolbox onto the dataflow designer.

![](https://3083465318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FsR50Wa7EwZGlmPSAMkkf%2Fuploads%2F2Z2yFxXPwdaWLSByA6C9%2F01-Drag-Drop-Parquet.PNG?alt=media\&token=c4762486-b56c-429f-9426-952c7f96b9f1)

2. Right-click on the *Parquet File Source* object and select *Properties* from the context menu.

![](https://3083465318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FsR50Wa7EwZGlmPSAMkkf%2Fuploads%2FvOL7awGZkxdC1sdBQ7d6%2F02-Right-Click-Parquet-1655366432309.PNG?alt=media\&token=dcea112d-e843-4f95-a324-4454538a80ca)

This will open a new window.

![](https://3083465318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FsR50Wa7EwZGlmPSAMkkf%2Fuploads%2F7jHJJRpwZ6L35M9Hgh9e%2F03-Parquet-Properties-Source.PNG?alt=media\&token=7b45072f-9e1e-4c6e-a045-005904c610c9)

Let’s have a look at the options present here.

![](https://3083465318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FsR50Wa7EwZGlmPSAMkkf%2Fuploads%2FxxZmqVzFUYb6JsjPuzWi%2F04-Parquet-File-Path.png?alt=media\&token=43fa028e-d258-4bb1-868b-aebabaf68252)

**File Location**

*File Path:* This is where you will provide the path to the .parquet file.

![](https://3083465318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FsR50Wa7EwZGlmPSAMkkf%2Fuploads%2FV9pOSEA7bqcECNkMLknx%2F05-01-Data-Load-Options-Parquet.PNG?alt=media\&token=48b48cc5-93c3-4f08-ad49-d7443b6d07f8)

**Data Load Option**

If you wish to control memory consumption and increase read time, then the Data Load option can be used.

*Batch Size:* This is where the size of each batch is defined.

![](https://3083465318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FsR50Wa7EwZGlmPSAMkkf%2Fuploads%2FjHkckqzLcEXC9FlsuhYT%2F05-Checkboxes-Parquet.PNG?alt=media\&token=3a3b5e75-b27d-4eaa-9dcc-51ea155f7cb8)

**Advanced File Processing: String Processing**

*Treat empty string as null value:* Checking this will give a null value on every empty string.

*Trim strings:* Checking this box will trim the strings.

3. Once done, click *Next* and you will be led to the *Layout Builder* screen.

![](https://3083465318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FsR50Wa7EwZGlmPSAMkkf%2Fuploads%2F6WUyy4VPgV05pjGHgFOI%2F06-Layout-Builder-Parquet.PNG?alt=media\&token=ee4dee9d-141d-49ed-8607-30333024caa3)

The layout will be automatically built. Otherwise, you can build it using the *Build Layout from layout spec* option at the top of the screen.

![](https://3083465318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FsR50Wa7EwZGlmPSAMkkf%2Fuploads%2FiYBzbKw9LeuJqh2umQ3p%2F07-Build-Layout-Spec-Parquet.png?alt=media\&token=059b2666-c368-4cd4-81c7-6ba3ddcc1167)

4. Once done, click *Next* and you will be taken to the *Config Parameters* screen.

![08-Config-Paramet](https://3083465318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FsR50Wa7EwZGlmPSAMkkf%2Fuploads%2Fs1oSvA8UKF1ijMeVlpeZ%2F08-Config-Parameters-Parquet.PNG?alt=media\&token=4f9e2b26-3772-4c5f-813d-55eabdd3dad5)

This allows you to further configure and define dynamic parameters for the Parquet source file.

{% hint style="info" %}
**Note:** Parameters left blank will use their default values assigned on the properties page.
{% endhint %}

5. Click *Next* and you will be taken to the *General Options* screen.

![](https://3083465318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FsR50Wa7EwZGlmPSAMkkf%2Fuploads%2Fby7WHYkxGNNeYUsHCwjA%2F09-General-Options-Parquet.PNG?alt=media\&token=26c16dcd-821e-4aec-898f-d689a6e694b0)

Here, you can add any comments that you wish to add.

6. Click *OK* and the *Parquet File Source* object will be configured.

![](https://3083465318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FsR50Wa7EwZGlmPSAMkkf%2Fuploads%2FRLQKtB2BXO9ZOKT1u8ir%2F10-Parquet-Source-Configured.PNG?alt=media\&token=3b4d90c4-a74f-4e78-938e-425b6c5584ae)

You can now map these fields to other objects as part of the dataflow.

### Data Types Supported in Parquet Astera:

* Integer
* Time/Timestamp
* Date
* String
* Float
* Real
* Decimal
* Double
* Byte Array
* Guid

### Data Types not Supported in Parquet Astera:

* Base64
* Integer96
* Image

### Limitations:

* Hierarchy is not supported.

This concludes our discussion on the definition and configuration of the *Parquet File Source* object in Astera Data Stack.
