How to automate the import of all files in a Google Drive folder to PowerBI, now updated, please read the first paragraphs!! (PowerQuery)
Update!!! This method no longer works although there is a new Google Sheets connector for PowerQuery that is currently in beta. If you're using PowerBI, you'll need to enable the preview features to enable it.
https://docs.microsoft.com/en-us/power-query/connectors/googlesheets
In the interest if demonstrating how it 'did' work, the original post is provided below.
If you attempt to replicate this, you'll quickly realise the website doesn't behave as it used to.
(here is the original article)
Using PowerQuery to access multiple files within the same folder on a local or network drive is a game-changing feature that will almost certainly save many people hours of effort. This functionality is great if your data exists in a place that is easily accessible but what do you do if your data is somewhere less accessible like Sharepoint, OneDrive or even Google Docs?
I have previously connected to data on SharePoint and found it fairly straight forward which raises the question, is the process for accessing multiple files on Google Drive as simple? Is it even possible? It's feasible that Google might deliberately hinder the use of PowerQuery to access the data but would this make it impossible to achieve? I just had to find out.
Unsurprisingly, the initial attempts didn’t quite work as easily as I'd hoped and I was surprised to find a limited amount of guidance after some googling (could I trust the results?!). Ruth from Curbal has some of the most informative information on this topic and how you can create a function to access the web pages (she's done a great video on it here) but I couldn’t find anything detailing the process to automatically obtain a list of the files so the process could be automated to a one-click refresh.
After some perseverance, I was able to find a process that worked. It consists of the following stage:
1. Creating a parameter to parametrise the input (from a single file)
2. Create a custom function to import and transform the data from a single file
3. Identifying the HTML tag to locate the keys for each file
4. Creating a list of keys associated with the files in the folder
5. Extracting the data from each file in the folder
Important note! The process below works with the google drive data shared with anyone who has a link. This means it is publicly available to anyone who either has the link or the appetite to try and access your data. This isn't advisable with sensitive data. If you don’t know how to share your google drive folder, right-click on the folder, select 'share' and then configure the options to 'anyone with the link'.
How to import data from a folder on Google Drive (aka Google Docs)
1. Create a parameter to parameterise an input
a. Open PowerBI and click the 'transform' button to access PowerQuery (or click 'Get Data' if you're using Excel)
b. Create a parameter by clicking on the 'manage parameters' drop down and selecting 'New Parameter'
c. Give the parameter a name, In this example, I've used XLS_key
d. Give the parameter a 'Current Value', you can use a value I made earlier
1s1jRzufCCmcqJ_EyYAHaP3SUr8LjtuRf
e. Click 'ok'
2. Create a custom function to process each file's key
a. Create a new blank query by selecting 'New Source' then 'Blank Query'
b. In the function bar, enter the following function
= Excel.Workbook(Web.Contents("https://docs.google.com/spreadsheets/d/" & XLS_key &"/export?format=xlsx"), null, true)
c. Apply any transformations you wish to apply, in this example, we'll leave the function as-is for simplicity
d. Rename the function to something more meaningful, I've used GdriveXLS in this example
e. Right-click on the query and select 'Create Function'
f. Give the function a meaningful name, in this example, I've used FnGdriveXLS
g. Once this is done the Queries list will look like this
So now we have a query that can transpose the 'key' value associated with any spreadsheet (of XLS format) is stored in Google Drive and transform it to display the contents of the file. That’s the easy bit done!
Next, we need to devise a method of obtaining all the keys for a given folder so that they can be passed to the function we've created. The preferred method for this is to use Xpath/JSpath tags however, after multiple attempts this wasn’t possible, as the 'key' is not visible on the page and therefore inaccessible… I'd love someone to prove this wrong as it would be a much more elegant method!
Before attempting to do that, there is a need to identify a couple of key parts of the code. Fortunately, the new 'Add Table Using Examples' will be able to do the hard work for us.
3. Identifying the HTML tag to locate the keys for each file
a. Click 'New Source' then 'From web'
b. Paste in the URL of the Google Drive folder that has been set to shareable (https://drive.google.com/drive/folders/1eCWS5pSTJiLVdK0p0AQN5wCFo8ZZMyDG) and click 'ok'
c. At the following screen, there should be a table listed under 'Suggested Tables', select it and then click 'ok'
d. The subsequent query will then contain the tag that we can use to identify the beginning of the string that contains the key for the file. Once the query has loaded the formula in the formula bar will display something similar to…
ii. The string we're interested in here is the row selector pmHCK
4. Creating a list of keys associated with the files in the folder
a. Create a new query from blank (as before)
b. In the formula bar type the following and hit enter
= Web.BrowserContents ("https://drive.google.com/drive/folders/1eCWS5pSTJiLVdK0p0AQN5wCFo8ZZMyDG")
ii. The URL above is one I've created and should work if you use the same one. Once this is done you should see a page full of code that looks like the following:
c. In the ribbon, select 'To Table' to convert the string to a table
d. Next, in the ribbon under 'Transform' select 'Split Column', 'By Delimiter'.
e. Select '--custom--' from the first dropdown and then enter the row selector tag obtained previously in the text box (pmHCK)
f. Check 'Split at' is set to 'Each occurrence of the delimiter' and then under 'Advanced options' set 'Split into' to 'Rows' and click 'OK'. Splitting the contents into rows is one way of overcoming the variable nature of the number of files in the location.
g. In the ribbon select 'Transform', 'Extract', 'Text between delimiters'
h. In the 'Start delimiter' enter 'div data-id=' and in the 'End delimiter' field enter a single space then click 'OK'
ii. You should then see a list of keys associated with all the files in the folder you wish to access
5. Extracting the data from each file in folder
a. Next we need to add a custom column to incorporate the function created previously to do this, in the ribbon click 'Add column' then 'Invoke Custom Function'
b. Select 'FnGdriveXLS' as the 'Function query', set 'XLS_key' as 'Column1' and click 'ok'
c. If there is a request to confirm credentials, click 'continue', then tick the 'Ignore privacy levels….' tick box and click 'Save'
d. After a short pause the results of the functions will be displayed, remove errors from the 'custom' column by right-clicking on the title bar and selecting 'Remove errors'
ii. Note this is worth doing even if there aren't any errors as it will ensure the reliability of the function.
e. Expand the Custom column by clicking on the expand icon in the title bar
f. Ensure all columns are selected, untick 'use original column name as prefix' and click 'ok'
g. Remove all but the 'Name' and 'Data' columns by selecting both, right-clicking and selecting 'remove other columns'
h. Expand the data using the expand icon in the title bar and then you should see all the contents of all the files in the folder
i. All that is left is to promote the headers, filter out the subsequent header rows, change the data types to the desired formats, rename the query (Gdrive) and the data is ready to load into the data model
Once the query has loaded once, it can be tested by creating a table to demonstrate the number of rows loaded and then add some more files to the folder on google drive and refresh the query to see the amount of rows increase. This will confirm that the refresh is capable of incorporating more data as it is loaded to Google Drive
Note the number of rows displayed in the bottom left of the following window. Once additional files are added to the folder and the query is refreshed this number will increase accordingly. This will confirm the query has identified the additional files and imported them into the table.
Limitations - In the example I've created, the folder contains a mix of CSV and XLS files. There is no reason why both formats cannot be imported into the data model however, the example above works only with XLS files. Anything that doesn’t fit this format is filtered as an error in the last few steps.
Although parsing the HTML code from the website isn't the most reliable method, it does prove that it is possible to extract the data automatically. If you have a need to do this then the method described above will reduce the time taken to combine the data
Occasionally, the automatic table detection functionality in PowerQuery can return no suggested tables. For some reason, this functionality can give different results if attempted a second or third time. If there isn't a table identified automatically, my advice would be to cancel the attempt and try it again with the same URL.
Disclaimer - as with any automated process, it is beneficial to include some sort of check to ensure the dataset is as concise as possible. A check on the latest date in the dataset is often sufficient to provide reassurance that the dataset is complete.
Great! Does that work if the folder inside dgdrive has subfolders?
ReplyDeleteUnfortunately not, this method will only work for the contents on a single folder.
DeleteThats sad. Anyway, your method helped me out a lot, thanks so much!
ReplyDeleteI don't know if it is a coincidence or not, but when I tried your method, my PQ was not getting the table even with 20 tries so I couldnt learn what the HTML rowselector code was.
So I tried yours, pmHCK, and it worked. Maybe it is a generic code that works for any folder?
Excellent, great to hear that. I suspected the pmHCK element might be the same bu wasn't sure.
DeleteHello there!
ReplyDeleteIt seems google changed a lot of stuff inside page HTML... Stopped working and didn't find a new pattern to get the key. Have you tried it?