Download a file with Headless Chrome, Node.js and Puppeteer
I recently had a go with Headless Chrome and Puppeteer to download
bank account statements.
Browser scripting has never been that easy, up to date and closer to a modern development stack.
One thing has been harder to coin though: handling the download of a file and hand it over to Node.js.
This blog post documents how to achieve it.
§Some Context
The content I was headed to automate the download is tricky to obtain:
- there is no direct nor predictable download URL
- it is placed behind a login screen
- the download is bound to a multi page process
- each page writes something in a server session
The download eventually starts when one has submitted the various forms in the right order.
§Puppeteer Page and Browser API
I found Puppeteer implementation quite clever: the browser is manipulated directly from the Node.js app itself thanks to the DevTools Protocol.
I find this move interesting because it provides a better feedback loop to the software.
Our browser scripts are now closer to the headless browser.
Puppeteer has several concepts but 2 of them are of our interest when
automating browser actions:
- Browser API: it’s what happens at a browser level
- Page API: it’s what happens in a browser tab
We can navigate in a page, intercept browser requests before they even reach a page and click on elements.
The Promise-based flow makes it is easy to script alongside async
/await
.
§The Download Issue
One thing seemed quite different though: the download of the bank statement
triggered a download.
I could not see the download starting by looking at the browser events:
1 |
|
Likewise with page events.
My script would end up nicely: the download would have been triggered but no data were written on disk.
§Fetch Forest, Fetch!
I saw other people reporting the same issue.
Download would not be triggered in headless mode, no matter what is attempted.
This comment led me to think the answer was… to not submit the form per say.
But rather to evaluate code in context and use fetch()
to submit the form and pass the resulting response to Node.
So instead of banging my head around these two lines:
1 |
|
I had to instead evaluate these ones:
1 |
|
It is more verbose than the initial “submit” but it works:
- the
FormData
API helps capture the form values; fetch
sends the values and the session cookies;- the download data are part of the
fetch()
response.
The resulting data can be parsed within the same scripting application
with CSV parsing npm modules. Seemlessly.
The full script to download the bank statements can be found in this repository.