# How to Work with Microsoft Word (Doc/Docx) Files in a Report Model

Astera supports the extraction of data from a wide range of unstructured file formats including PDF files, PDF forms, TXT, XLS/XLSX, PRN and RTF.

By integrating Astera with a third-party tool, a support for extracting unstructured data from a bulk load of MS Word doc/docx files can be established.

In this document, we will use an open-source tool *OfficeToPDF.exe* to convert MS Word documents into PDF files. We will orchestrate this process of conversion through the workflow component in Astera.

Following system requirements should be met in order to use *OfficeToPDF.exe*:

* .Net Framework 4
* Office 2016, 2013, 2010 or Office 2007

Read up more on *OfficeToPDF.exe* from [here](https://github.com/cognidox/OfficeToPDF).

## **Integration with Astera**

1. Download the zip folder: [*WordtoPDF.zip*](https://www.astera.com/Downloads/Misc/WordtoPDF.zip) and extract all three files shown below.

![](https://627607815-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6xzBT0roYJkfVS5klkLl%2Fuploads%2FbrD48q9YdzpFfJB7FEhz%2F0.png?alt=media)

2. On the Astera client, go to *File* menu in the menu bar at the top, click *Open* and point the path towards *Sample\_Workflow\.Wfs* extracted in the first step.

![](https://627607815-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6xzBT0roYJkfVS5klkLl%2Fuploads%2FBMUj9iocVsIKiZalt9U5%2F1.png?alt=media)

The *Sample\_Workflow\.Wfs* will open in your application as shown below.

![](https://627607815-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6xzBT0roYJkfVS5klkLl%2Fuploads%2Fe94dF8e54tzvPHmcdF1P%2F2.png?alt=media)

Follow the steps below to configure this workflow.

3. Right-click on the *FilesToConvert* object and select *Properties* from the context menu. A configuration window will open.

![](https://627607815-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6xzBT0roYJkfVS5klkLl%2Fuploads%2FJfSmkeFFQ6W7hoaQcm0n%2F3.gif?alt=media)

Here you need to provide the path to the source folder that contains all the doc/docx files. Apply the filter "**\*.doc\***" and click OK.

![](https://627607815-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6xzBT0roYJkfVS5klkLl%2Fuploads%2Fs55M5nz7fkrEcFenanSF%2F4.png?alt=media)

4. Right-click on the header of *Exe\_FilePath* object and select *Properties* from the context menu. In the constant value box, paste the local path to *OfficeToPDF.exe*, extracted in the first step and click *OK*.

![](https://627607815-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6xzBT0roYJkfVS5klkLl%2Fuploads%2FTVFkh0QLeDkw7Gt8z65g%2F5.jpeg?alt=media)

5. Right-click on the header of *Bat\_FilePath* object and select *Properties* from the context menu . In the constant value box, paste the local path to *officetopdf.bat* Windows batch file, extracted in the first step.

![](https://627607815-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6xzBT0roYJkfVS5klkLl%2Fuploads%2FBXhRVgJ7ZhnQuE8VdVRG%2F6.jpeg?alt=media)

6. Right-click on the header of *RunExe* object to open its configuration window. Here, point the *Program Path* to the local path for *officetopdf.bat* Windows batch file.

![](https://627607815-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6xzBT0roYJkfVS5klkLl%2Fuploads%2FSwffpbJ7mq0KMFRUfIpo%2F7.jpeg?alt=media)

7. Click on the *Run Workflow* icon to execute this workflow. This will generate the PDF files for all .doc/.docx files residing in the folder specified in *FileSystem* object.

![](https://627607815-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6xzBT0roYJkfVS5klkLl%2Fuploads%2F4r4eP7sR3o5CJ3ANzpXW%2F8.png?alt=media)

Now, these PDF files can be loaded onto the Astera designer for creating an extraction template (report model) to extract the data.
