How to Work with Microsoft Word (Doc/Docx) Files in a Report Model
Last updated
Last updated
© Copyright 2023, Astera Software
Astera supports the extraction of data from a wide range of unstructured file formats including PDF files, PDF forms, TXT, XLS/XLSX, PRN and RTF.
By integrating Astera with a third-party tool, a support for extracting unstructured data from a bulk load of MS Word doc/docx files can be established.
In this document, we will use an open-source tool OfficeToPDF.exe to convert MS Word documents into PDF files. We will orchestrate this process of conversion through the workflow component in Astera.
Following system requirements should be met in order to use OfficeToPDF.exe:
.Net Framework 4
Office 2016, 2013, 2010 or Office 2007
Read up more on OfficeToPDF.exe from here.
Download the zip folder: WordtoPDF.zip and extract all three files shown below.
On the Astera client, go to File menu in the menu bar at the top, click Open and point the path towards Sample_Workflow.Wfs extracted in the first step.
The Sample_Workflow.Wfs will open in your application as shown below.
Follow the steps below to configure this workflow.
Right-click on the FilesToConvert object and select Properties from the context menu. A configuration window will open.
Here you need to provide the path to the source folder that contains all the doc/docx files. Apply the filter "*.doc*" and click OK.
Right-click on the header of Exe_FilePath object and select Properties from the context menu. In the constant value box, paste the local path to OfficeToPDF.exe, extracted in the first step and click OK.
Right-click on the header of Bat_FilePath object and select Properties from the context menu . In the constant value box, paste the local path to officetopdf.bat Windows batch file, extracted in the first step.
Right-click on the header of RunExe object to open its configuration window. Here, point the Program Path to the local path for officetopdf.bat Windows batch file.
Click on the Run Workflow icon to execute this workflow. This will generate the PDF files for all .doc/.docx files residing in the folder specified in FileSystem object.
Now, these PDF files can be loaded onto the Astera designer for creating an extraction template (report model) to extract the data.