Skip to main content

Extracting data from Word (.docx) files into Power Query

Word and Excel don’t usually get along too well so it's no surprise that Power Query isn't directly compatible with its estranged cousin Word either. If you are presented with the need to import data from Word into Power Query you'll be please to hear it is possible however it requires a couple of manual steps to make it work. 

The manual steps could fairly easily be completed by a batch file which would automate the process further.

Here is the Excel data pasted 'as values' in a Word file which i'll use for the first example

Here is the Excel data pasted with 'keep source formatting' which i'll reference a couple of times in the article.

Although the steps I've covered below aren't complex, this whole process has some unknowns around it so you may find the result in your instance varies from mine. The Word file I've used contains the contents of a range of excel cells that I deliberately pasted as values into Word to create a test file for this example. I've just repeated the process with another file where I pasted using the default paste option (keep source formatting) and the below example didn’t work because the data was located in a different place in the XML structure of the document.xml file. After a bit of exploratory work I managed to locate the data which would allow this method to work. To summarise, if your word file is formatted differently to this example, you may need to open the xml file in a browser to locate the values from the table and then you'll be able to navigate to it in Power Query.

Here is the XML Mapping from the two examples i've mentioned

This is how I found the data in the XML structure for the 2 examples I mentioned earlier

Word file containing a table with data pasted as values

body, p, r  (to reveal column 't' which contains the values from the table)

Word file containing table with data pasted with 'keep source formatting'

body, tbl, tr, tc, p, r (to reveal column 't' with the values from the table)

Even with these caveats, this process is one that demonstrates some of the potential of Power Query when combined with some creative manipulation of source data.

Warning! This method will make the file unopenable in Word, I'd suggest making a copy of the file and working with the copy.

I must give a nod to Matt Allington who documented a similar process with PDF files https://exceleratorbi.com.au/import-tabular-data-pdf-using-power-query/ which possibly had some influence in Microsoft providing a PDF connector into Power Query in 2019. I would assume that the demand for a Word connector would be significant so I expect this article to have a limited lifespan.

The M code from these queries is at the bottom of this article.

The manual steps

  • In windows Explorer, change the file extension from .docx to .zip and then extract (unzip) the file to a new folder. This will create a folder with the following data within it
  • In the 'word' folder there will be additional items, if you look at it you'll notice the 'document.xml' file is larger than the other files. This is the file we're interested in

Connecting Power Query to the document.xml file

In Power Query, use the 'new source, more' option to display the available connectors

Select the XML option, navigate to the 'document.xml' file and select it. Then you'll be presented with the navigator window which can be used to load only portions of the file. See the comment earlier about the XML location in the file. In this example, the data in the Word file has been pasted as values so I'll need the following selection.

Once loaded, column 'r' needs to be expanded which can be done by clicking the 'expand' option in the header row

This will then display the 't' column which contains the values from the table in the word file

Then we'll need to remove all the other columns apart from the 't' column using the 'Remove Other Columns' step

For some reason there are some 'null' values in the first few rows which I know are not columns so we'll remove them using the filter options

This will leave the following data in the 't' column

This post could end here as the column displayed above is (technically) the contents of the word file. However, the structure isnt the same as the table in the word file. To make it the same we need to do some transformations.

Transforming the data to match the word file

I suspect each application of this approach may vary so this method is tailored for the file I'm using. It is possible your attempt creates something identical, if that’s the case then the following should work for you too

To transform the data we need to do the following

  • Create a column that groups each data point into a row
  • Create a column that groups each data point into a column
  • Pivot the data so it is displayed in columns and rows
  • Tidy up any headers/helper columns so we're left with just the data we need

Column to group each data point into rows

The first 6 items in the above list are the items in the first row with the next 6 items in the next row. We need to have something in an adjacent column that can be used to group these items accordingly

First we'll add a new index column from 0 using the following ribbon option

Next we'll select the 'Index' column and then divide it by 6 using the 'Standard -> divide' options in the 'Transform' ribbon. Note these options are also in the 'new column' ribbon but we want to transform an existing column (not add a new column) so we use the ones in the 'transform' ribbon.

Next we'll use the 'round down' option in the same ribbon under 'rounding'

Now we have a column, that identifies the row number of the datapoint which matches the table in the word file. The last step is to rename the column to 'RowID'

Column to group each data point into columns

Now we have a column to identify the 

Add another index column from zero

Next we'll add another column that multiplies the RowID column by 6


Next we'll need to select both the new index and the Multiplication columns and then in the 'add column' ribbon, select 'Standard, Subtract' which will create a new column which will contain the column number

Rename the 'Subtraction' column to 'ColumnID'

Lastly delete the 'index' and 'multiplication columns


Pivot the data so its displayed in columns and rows

Now we have 3 columns, one with the data to display in each cell, another with the RowID and the third with the ColumnID we can pivot the data into a table.

Rename column 't', 'Values'. This isn't strictly part of the process but it'll make sense after the next step.

Select 'ColumnID' and then click 'Pivot Column' which is located in the 'Transform' ribbon

Note, when pivoting data, Its useful to remember the selected column is the one that will become the column headers. 

This will present the following selection

As you can see there is a single drop down option which is titled 'Values Column', we want to have the column where the information to be displayed in each cell is located, in this instance the 'Values' column (see what I did there!)

By default, the pivot function assumes that the data in the values column needs to be counted. As we just want to display the single value in the 'values' column we need to change the aggregation option. To do this, click on the 'Advanced options' toggle to display the advanced option… Once it is displayed, change the 'Aggregate Value Function' drop down to 'Don’t Aggregate' and click 'ok'.

This will return the following table, notice how the values in the ColumnID column (0 to 6) are now displayed, once, in the header row.

All that’s left is to remove the 'RowID' column and then promote the first row as headers and the table is identical to the one we saw at the start of the article in Word.


M Code used to create these queries

Excel data pasted in Word as values

let

    Source = Xml.Tables(File.Contents("C:\Users\...\Documents\Data\SampleData\TableInWordDoc\word\document.xml")),

    Table0 = Source{0}[Table],

    Table1 = Table0{0}[Table],

    #"Expanded r" = Table.ExpandTableColumn(Table1, "r", {"t", "tab"}, {"t", "tab"}),

    #"Removed Other Columns" = Table.SelectColumns(#"Expanded r",{"t"}),

    #"Filtered Rows" = Table.SelectRows(#"Removed Other Columns", each ([t] <> null))

in

    #"Filtered Rows"


Excel data pasted in Word as Values with transformation steps

Let

    Source = Xml.Tables(File.Contents("C:\Users\...\Documents\Data\SampleData\TableInWordDoc\word\document.xml")),

    Table0 = Source{0}[Table],

    Table1 = Table0{0}[Table],

    #"Expanded r" = Table.ExpandTableColumn(Table1, "r", {"t", "tab"}, {"t", "tab"}),

    #"Removed Other Columns" = Table.SelectColumns(#"Expanded r",{"t"}),

    #"Filtered Rows1" = Table.SelectRows(#"Removed Other Columns", each ([t] <> null)),

    #"Added Index" = Table.AddIndexColumn(#"Filtered Rows1", "Index", 0, 1, Int64.Type),

    #"Divided Column" = Table.TransformColumns(#"Added Index", {{"Index", each _ / 6, type number}}),

    #"Rounded Down" = Table.TransformColumns(#"Divided Column",{{"Index", Number.RoundDown, Int64.Type}}),

    #"Renamed Columns" = Table.RenameColumns(#"Rounded Down",{{"Index", "RowID"}}),

    #"Added Index1" = Table.AddIndexColumn(#"Renamed Columns", "Index", 0, 1, Int64.Type),

    #"Inserted Multiplication" = Table.AddColumn(#"Added Index1", "Multiplication", each [RowID] * 6, type number),

    #"Inserted Subtraction" = Table.AddColumn(#"Inserted Multiplication", "Subtraction", each [Index] - [Multiplication], type number),

    #"Renamed Columns1" = Table.RenameColumns(#"Inserted Subtraction",{{"Subtraction", "ColumnID"}}),

    #"Removed Columns" = Table.RemoveColumns(#"Renamed Columns1",{"Index", "Multiplication"}),

    #"Renamed Columns2" = Table.RenameColumns(#"Removed Columns",{{"t", "Values"}}),

    #"Pivoted Column" = Table.Pivot(Table.TransformColumnTypes(#"Renamed Columns2", {{"ColumnID", type text}}, "en-AU"), List.Distinct(Table.TransformColumnTypes(#"Renamed Columns2", {{"ColumnID", type text}}, "en-AU")[ColumnID]), "ColumnID", "Values"),

    #"Removed Columns1" = Table.RemoveColumns(#"Pivoted Column",{"RowID"}),

    #"Promoted Headers" = Table.PromoteHeaders(#"Removed Columns1", [PromoteAllScalars=true])

in

    #"Promoted Headers"

Excel data pasted in Word as 'keep source formatting' with transformation steps

let

    Source = Xml.Tables(File.Contents("C:\Users\...\Documents\Data\SampleData\TableInWord2\word\document.xml")),

    Table = Source{0}[Table],

    Table1 = Table{0}[Table],

    Table2 = Table1{2}[Table],

    #"Expanded tc" = Table.ExpandTableColumn(Table2, "tc", {"tcPr", "p"}, {"tcPr", "p"}),

    #"Expanded p" = Table.ExpandTableColumn(#"Expanded tc", "p", {"pPr", "proofErr", "r", "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}, {"pPr", "proofErr", "r", "http://schemas.openxmlformats.org/wordprocessingml/2006/main.1"}),

    #"Expanded r" = Table.ExpandTableColumn(#"Expanded p", "r", {"rPr", "t", "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}, {"rPr", "t", "http://schemas.openxmlformats.org/wordprocessingml/2006/main.2"}),

    #"Removed Other Columns" = Table.SelectColumns(#"Expanded r",{"t"}),

    #"Filtered Rows1" = Table.SelectRows(#"Removed Other Columns", each ([t] <> null)),

    #"Added Index" = Table.AddIndexColumn(#"Filtered Rows1", "Index", 0, 1, Int64.Type),

    #"Divided Column" = Table.TransformColumns(#"Added Index", {{"Index", each _ / 6, type number}}),

    #"Rounded Down" = Table.TransformColumns(#"Divided Column",{{"Index", Number.RoundDown, Int64.Type}}),

    #"Renamed Columns" = Table.RenameColumns(#"Rounded Down",{{"Index", "RowID"}}),

    #"Added Index1" = Table.AddIndexColumn(#"Renamed Columns", "Index", 0, 1, Int64.Type),

    #"Inserted Multiplication" = Table.AddColumn(#"Added Index1", "Multiplication", each [RowID] * 6, type number),

    #"Inserted Subtraction" = Table.AddColumn(#"Inserted Multiplication", "Subtraction", each [Index] - [Multiplication], type number),

    #"Renamed Columns1" = Table.RenameColumns(#"Inserted Subtraction",{{"Subtraction", "ColumnID"}}),

    #"Removed Columns" = Table.RemoveColumns(#"Renamed Columns1",{"Index", "Multiplication"}),

    #"Renamed Columns2" = Table.RenameColumns(#"Removed Columns",{{"t", "Values"}}),

    #"Pivoted Column" = Table.Pivot(Table.TransformColumnTypes(#"Renamed Columns2", {{"ColumnID", type text}}, "en-AU"), List.Distinct(Table.TransformColumnTypes(#"Renamed Columns2", {{"ColumnID", type text}}, "en-AU")[ColumnID]), "ColumnID", "Values"),

    #"Removed Columns1" = Table.RemoveColumns(#"Pivoted Column",{"RowID"}),

    #"Promoted Headers" = Table.PromoteHeaders(#"Removed Columns1", [PromoteAllScalars=true])

in

    #"Promoted Headers"


Comments

Popular posts from this blog

How to combine multiple files with Power Query (with no VBA and just 10 mouse clicks!)

The need to combine information from multiple files is one that most users of Excel will have come across at some point in the use of Excel. I've personally spent far too many hours aggregating data from multiple files, that are identical in structure, so that I can analyse larger datasets and provide insights into products and processes. For anyone who has also done this and not yet discovered Power Query you'll probably be amazed how simple the process has become. I realise there might be some who will say "just use VBA, its easy once you learn how to code..." and they would be right. The method using Power Query provides a zero code solution that is an evolution of the Excel interface that many will already be familiar with. In this example, I've created a sample file and created a number of duplicates of the file which I've saved in a folder. The folder contains only these files and i'd recommend you do the same if you're looking to try out this pr

How to automate the import of all files in a Google Drive folder to PowerBI, now updated, please read the first paragraphs!! (PowerQuery)

Update!!! This method no longer works although there is a new Google Sheets connector for PowerQuery that is currently in beta. If you're using PowerBI, you'll need to enable the preview features to enable it. https://docs.microsoft.com/en-us/power-query/connectors/googlesheets In the interest if demonstrating how it 'did' work, the original post is provided below. If you attempt to replicate this, you'll quickly realise the website doesn't behave as it used to. (here is the original article)  Using PowerQuery to access multiple files within the same folder on a local or network drive is a game-changing feature that will almost certainly save many people hours of effort. This functionality is great if your data exists in a place that is easily accessible but what do you do if your data is somewhere less accessible like Sharepoint, OneDrive or even Google Docs? I have previously connected to data on SharePoint and found it fairly straight forward which raises the