Word and Excel don’t usually get along too well so it's no surprise that Power Query isn't directly compatible with its estranged cousin Word either. If you are presented with the need to import data from Word into Power Query you'll be please to hear it is possible however it requires a couple of manual steps to make it work.
The manual steps could fairly easily be completed by a batch file which would automate the process further.
Here is the Excel data pasted 'as values' in a Word file which i'll use for the first example
Here is the Excel data pasted with 'keep source formatting' which i'll reference a couple of times in the article.
Although the steps I've covered below aren't complex, this whole process has some unknowns around it so you may find the result in your instance varies from mine. The Word file I've used contains the contents of a range of excel cells that I deliberately pasted as values into Word to create a test file for this example. I've just repeated the process with another file where I pasted using the default paste option (keep source formatting) and the below example didn’t work because the data was located in a different place in the XML structure of the document.xml file. After a bit of exploratory work I managed to locate the data which would allow this method to work. To summarise, if your word file is formatted differently to this example, you may need to open the xml file in a browser to locate the values from the table and then you'll be able to navigate to it in Power Query.
Here is the XML Mapping from the two examples i've mentioned
This is how I found the data in the XML structure for the 2 examples I mentioned earlier
Word file containing a table with data pasted as values
body, p, r (to reveal column 't' which contains the values from the table)
Word file containing table with data pasted with 'keep source formatting'
body, tbl, tr, tc, p, r (to reveal column 't' with the values from the table)
Even with these caveats, this process is one that demonstrates some of the potential of Power Query when combined with some creative manipulation of source data.
Warning! This method will make the file unopenable in Word, I'd suggest making a copy of the file and working with the copy.
I must give a nod to Matt Allington who documented a similar process with PDF files https://exceleratorbi.com.au/import-tabular-data-pdf-using-power-query/ which possibly had some influence in Microsoft providing a PDF connector into Power Query in 2019. I would assume that the demand for a Word connector would be significant so I expect this article to have a limited lifespan.
The M code from these queries is at the bottom of this article.
The manual steps
- In windows Explorer, change the file extension from .docx to .zip and then extract (unzip) the file to a new folder. This will create a folder with the following data within it
- In the 'word' folder there will be additional items, if you look at it you'll notice the 'document.xml' file is larger than the other files. This is the file we're interested in
Connecting Power Query to the document.xml file
In Power Query, use the 'new source, more' option to display the available connectors
Select the XML option, navigate to the 'document.xml' file and select it. Then you'll be presented with the navigator window which can be used to load only portions of the file. See the comment earlier about the XML location in the file. In this example, the data in the Word file has been pasted as values so I'll need the following selection.
Once loaded, column 'r' needs to be expanded which can be done by clicking the 'expand' option in the header row
This will then display the 't' column which contains the values from the table in the word file
Then we'll need to remove all the other columns apart from the 't' column using the 'Remove Other Columns' step
For some reason there are some 'null' values in the first few rows which I know are not columns so we'll remove them using the filter options
This will leave the following data in the 't' column
This post could end here as the column displayed above is (technically) the contents of the word file. However, the structure isnt the same as the table in the word file. To make it the same we need to do some transformations.
Transforming the data to match the word file
I suspect each application of this approach may vary so this method is tailored for the file I'm using. It is possible your attempt creates something identical, if that’s the case then the following should work for you too
To transform the data we need to do the following
- Create a column that groups each data point into a row
- Create a column that groups each data point into a column
- Pivot the data so it is displayed in columns and rows
- Tidy up any headers/helper columns so we're left with just the data we need
Column to group each data point into rows
The first 6 items in the above list are the items in the first row with the next 6 items in the next row. We need to have something in an adjacent column that can be used to group these items accordingly
First we'll add a new index column from 0 using the following ribbon option
Next we'll select the 'Index' column and then divide it by 6 using the 'Standard -> divide' options in the 'Transform' ribbon. Note these options are also in the 'new column' ribbon but we want to transform an existing column (not add a new column) so we use the ones in the 'transform' ribbon.
Next we'll use the 'round down' option in the same ribbon under 'rounding'
Now we have a column, that identifies the row number of the datapoint which matches the table in the word file. The last step is to rename the column to 'RowID'
Column to group each data point into columns
Now we have a column to identify the
Add another index column from zero
Next we'll add another column that multiplies the RowID column by 6
Next we'll need to select both the new index and the Multiplication columns and then in the 'add column' ribbon, select 'Standard, Subtract' which will create a new column which will contain the column number
Rename the 'Subtraction' column to 'ColumnID'
Lastly delete the 'index' and 'multiplication columns
Pivot the data so its displayed in columns and rows
Now we have 3 columns, one with the data to display in each cell, another with the RowID and the third with the ColumnID we can pivot the data into a table.
Rename column 't', 'Values'. This isn't strictly part of the process but it'll make sense after the next step.
Select 'ColumnID' and then click 'Pivot Column' which is located in the 'Transform' ribbon
Note, when pivoting data, Its useful to remember the selected column is the one that will become the column headers.
This will present the following selection
As you can see there is a single drop down option which is titled 'Values Column', we want to have the column where the information to be displayed in each cell is located, in this instance the 'Values' column (see what I did there!)
By default, the pivot function assumes that the data in the values column needs to be counted. As we just want to display the single value in the 'values' column we need to change the aggregation option. To do this, click on the 'Advanced options' toggle to display the advanced option… Once it is displayed, change the 'Aggregate Value Function' drop down to 'Don’t Aggregate' and click 'ok'.
This will return the following table, notice how the values in the ColumnID column (0 to 6) are now displayed, once, in the header row.
All that’s left is to remove the 'RowID' column and then promote the first row as headers and the table is identical to the one we saw at the start of the article in Word.
M Code used to create these queries
Excel data pasted in Word as values
let
Source = Xml.Tables(File.Contents("C:\Users\...\Documents\Data\SampleData\TableInWordDoc\word\document.xml")),
Table0 = Source{0}[Table],
Table1 = Table0{0}[Table],
#"Expanded r" = Table.ExpandTableColumn(Table1, "r", {"t", "tab"}, {"t", "tab"}),
#"Removed Other Columns" = Table.SelectColumns(#"Expanded r",{"t"}),
#"Filtered Rows" = Table.SelectRows(#"Removed Other Columns", each ([t] <> null))
in
#"Filtered Rows"
Excel data pasted in Word as Values with transformation steps
Let
Source = Xml.Tables(File.Contents("C:\Users\...\Documents\Data\SampleData\TableInWordDoc\word\document.xml")),
Table0 = Source{0}[Table],
Table1 = Table0{0}[Table],
#"Expanded r" = Table.ExpandTableColumn(Table1, "r", {"t", "tab"}, {"t", "tab"}),
#"Removed Other Columns" = Table.SelectColumns(#"Expanded r",{"t"}),
#"Filtered Rows1" = Table.SelectRows(#"Removed Other Columns", each ([t] <> null)),
#"Added Index" = Table.AddIndexColumn(#"Filtered Rows1", "Index", 0, 1, Int64.Type),
#"Divided Column" = Table.TransformColumns(#"Added Index", {{"Index", each _ / 6, type number}}),
#"Rounded Down" = Table.TransformColumns(#"Divided Column",{{"Index", Number.RoundDown, Int64.Type}}),
#"Renamed Columns" = Table.RenameColumns(#"Rounded Down",{{"Index", "RowID"}}),
#"Added Index1" = Table.AddIndexColumn(#"Renamed Columns", "Index", 0, 1, Int64.Type),
#"Inserted Multiplication" = Table.AddColumn(#"Added Index1", "Multiplication", each [RowID] * 6, type number),
#"Inserted Subtraction" = Table.AddColumn(#"Inserted Multiplication", "Subtraction", each [Index] - [Multiplication], type number),
#"Renamed Columns1" = Table.RenameColumns(#"Inserted Subtraction",{{"Subtraction", "ColumnID"}}),
#"Removed Columns" = Table.RemoveColumns(#"Renamed Columns1",{"Index", "Multiplication"}),
#"Renamed Columns2" = Table.RenameColumns(#"Removed Columns",{{"t", "Values"}}),
#"Pivoted Column" = Table.Pivot(Table.TransformColumnTypes(#"Renamed Columns2", {{"ColumnID", type text}}, "en-AU"), List.Distinct(Table.TransformColumnTypes(#"Renamed Columns2", {{"ColumnID", type text}}, "en-AU")[ColumnID]), "ColumnID", "Values"),
#"Removed Columns1" = Table.RemoveColumns(#"Pivoted Column",{"RowID"}),
#"Promoted Headers" = Table.PromoteHeaders(#"Removed Columns1", [PromoteAllScalars=true])
in
#"Promoted Headers"
Excel data pasted in Word as 'keep source formatting' with transformation steps
let
Source = Xml.Tables(File.Contents("C:\Users\...\Documents\Data\SampleData\TableInWord2\word\document.xml")),
Table = Source{0}[Table],
Table1 = Table{0}[Table],
Table2 = Table1{2}[Table],
#"Expanded tc" = Table.ExpandTableColumn(Table2, "tc", {"tcPr", "p"}, {"tcPr", "p"}),
#"Expanded p" = Table.ExpandTableColumn(#"Expanded tc", "p", {"pPr", "proofErr", "r", "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}, {"pPr", "proofErr", "r", "http://schemas.openxmlformats.org/wordprocessingml/2006/main.1"}),
#"Expanded r" = Table.ExpandTableColumn(#"Expanded p", "r", {"rPr", "t", "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}, {"rPr", "t", "http://schemas.openxmlformats.org/wordprocessingml/2006/main.2"}),
#"Removed Other Columns" = Table.SelectColumns(#"Expanded r",{"t"}),
#"Filtered Rows1" = Table.SelectRows(#"Removed Other Columns", each ([t] <> null)),
#"Added Index" = Table.AddIndexColumn(#"Filtered Rows1", "Index", 0, 1, Int64.Type),
#"Divided Column" = Table.TransformColumns(#"Added Index", {{"Index", each _ / 6, type number}}),
#"Rounded Down" = Table.TransformColumns(#"Divided Column",{{"Index", Number.RoundDown, Int64.Type}}),
#"Renamed Columns" = Table.RenameColumns(#"Rounded Down",{{"Index", "RowID"}}),
#"Added Index1" = Table.AddIndexColumn(#"Renamed Columns", "Index", 0, 1, Int64.Type),
#"Inserted Multiplication" = Table.AddColumn(#"Added Index1", "Multiplication", each [RowID] * 6, type number),
#"Inserted Subtraction" = Table.AddColumn(#"Inserted Multiplication", "Subtraction", each [Index] - [Multiplication], type number),
#"Renamed Columns1" = Table.RenameColumns(#"Inserted Subtraction",{{"Subtraction", "ColumnID"}}),
#"Removed Columns" = Table.RemoveColumns(#"Renamed Columns1",{"Index", "Multiplication"}),
#"Renamed Columns2" = Table.RenameColumns(#"Removed Columns",{{"t", "Values"}}),
#"Pivoted Column" = Table.Pivot(Table.TransformColumnTypes(#"Renamed Columns2", {{"ColumnID", type text}}, "en-AU"), List.Distinct(Table.TransformColumnTypes(#"Renamed Columns2", {{"ColumnID", type text}}, "en-AU")[ColumnID]), "ColumnID", "Values"),
#"Removed Columns1" = Table.RemoveColumns(#"Pivoted Column",{"RowID"}),
#"Promoted Headers" = Table.PromoteHeaders(#"Removed Columns1", [PromoteAllScalars=true])
in
#"Promoted Headers"
Comments
Post a Comment