Using Puppeteer to convert HTML documents to PDFs with lower memory and CPU footprint and better layout/styling control.
Leapbit is a company that works in the betting business. In the Balkan region, our partners are obliged to provide a printed sports betting offer. They display it on the board in the betting shops and give it to players.
Printing an offer is not an easy task in the betting industry in the Balkans - having in mind the following circumstances:
We tried using some of the existing HTML to PDF tools. But we were not happy with the results.
PHP libraries like TCPDF need to render the whole HTML in memory and then translate it to PDF. It uses so much memory that even machines with 64GB of RAM are insufficient. On top of that, the library is slow - sometimes the process takes up to 30 minutes.
That solution is obviously not adequate for our needs. For that reason, we created our own implementation using libHARU C lib to generate PDFs. It worked with lower memory and CPU footprint. But libHARU does not have an HTML to PDF option as it supports simple commands like; write text, write a line at… So, we developed our custom implementation of this library, extended its use, and eventually achieved reasonably good results.
We still had a problem - we lacked the freedom to manage the design of the offer.
After three years of using the previous solution, we decided to develop a new module that would create PDFs from HTMLs but with a higher speed and low memory footprint. We went on looking at different implementations.
Eventually, we noticed that printing HTML from Chrome is actually very fast. To test it, we created a simple HTML 100-pages offer. Printing it in Chrome on a regular machine resulted in about 4GB of RAM used and rendering lasted 10 seconds, which we considered pretty awesome.
We decided to go in that direction.
Like any other solution - Chrome does not solve all issues. It even adds new ones.
We listed the pros and cons.
To overcome the cons, we used the Puppeteer library, a headless implementation of Chrome that does not require a desktop or X server.
This is how Puppeteer GitHub describes the library:
"Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium."
For simplex/duplex we used pdf-lib.js.
We removed the table of contents as this was not mandatory for our clients. We probably could have done something with pdf-lib.js or even Puppeteer.
After requiring the Puppeteer, we create a new browser instance by calling puppeteer.launch. Then we make a page, tell it to go to the address we want to convert to PDF, ask it to wait until the network is idle. Then we call page.pdf to create the PDF.
We use headless mode and no-sandbox. Please look at the official docs for a complete explanation of the arguments.
In the example above, we used standard page size A4 in portrait mode and told it where to save the PDF file.
We are saving the PDF file to the server as we need to run through it once more with pdf-lib.js. You could pipe it directly to HTTP response or something like that.
Once we have created a PDF, we close the browser instance.
Then, we need to tell it if it is a duplex:
Bear in mind that the duplex option is a flag inside the PDF file and is supported as long as the PDF viewer supports it and can set up a printer to print it according to the flag given.
This blog post showed only one part of the process - just the part that is responsible for actually generating the PDF. There is a lot in the background of the service that renders the HTML. There, our clients can define the look of the PDF including:
Also, what was not covered in the article is that the service provides caching of generated PDFs as more than 2000 clients (shops) are connecting simultaneously, requesting the same PDF. It is a strain on our backend services as generated PDF can be over 10MB in size.