Leveraging, generation, and printing of large PDF documents with Chrome

Author: Dario Filković, CTO

Using Puppeteer to convert HTML documents to PDFs with lower memory and CPU footprint and better layout/styling control.

Leapbit is a company that works in the betting business. In the Balkan region, our partners are obliged to provide a printed sports betting offer. They display it on the board in the betting shops and give it to players.

Why is this a problem? Printing a PDF seems like a simple task?

Printing an offer is not an easy task in the betting industry in the Balkans - having in mind the following circumstances:

  • Offer must not be older than 5 minutes (odds for sports matches change every minute, so the offer needs to show as relevant data as possible).
  • All betting shops, which is 2000 shops in our case, ask for an offer in a single minute when the shops open. After that, every 5 minutes, some changes need to be printed.
  • Operators want to squeeze in as much information on a sheet of paper - printing offers in 2000 shops every day is a significant financial burden having in mind that they can be over 300 A4 pages in length.
  • Clients want to be able to define how and what is printed.

Current solutions

We tried using some of the existing HTML to PDF tools. But we were not happy with the results.

PHP libraries like TCPDF need to render the whole HTML in memory and then translate it to PDF. It uses so much memory that even machines with 64GB of RAM are insufficient. On top of that, the library is slow - sometimes the process takes up to 30 minutes.

That solution is obviously not adequate for our needs. For that reason, we created our own implementation using libHARU C lib to generate PDFs. It worked with lower memory and CPU footprint. But libHARU does not have an HTML to PDF option as it supports simple commands like; write text, write a line at… So, we developed our custom implementation of this library, extended its use, and eventually achieved reasonably good results.

We still had a problem - we lacked the freedom to manage the design of the offer.

Chrome to the rescue

After three years of using the previous solution, we decided to develop a new module that would create PDFs from HTMLs but with a higher speed and low memory footprint. We went on looking at different implementations.

Eventually, we noticed that printing HTML from Chrome is actually very fast. To test it, we created a simple HTML 100-pages offer. Printing it in Chrome on a regular machine resulted in about 4GB of RAM used and rendering lasted 10 seconds, which we considered pretty awesome.

We decided to go in that direction.

Pros and cons

Like any other solution - Chrome does not solve all issues. It even adds new ones.

We listed the pros and cons.

Pros:

  • Speed
  • Memory footprint
  • HTML to PDF possible

Cons:

  • Needs desktop environment or X server
  • No simplex/duplex support (printing on both sides of the paper)
  • No table of contents

To overcome the cons, we used the Puppeteer library, a headless implementation of Chrome that does not require a desktop or X server.

This is how Puppeteer GitHub describes the library:

"Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium."

For simplex/duplex we used pdf-lib.js.

We removed the table of contents as this was not mandatory for our clients. We probably could have done something with pdf-lib.js or even Puppeteer.

Our solution

const puppeteer = require('puppeteer');

const browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
});

const page = await browser.newPage();
await page.setDefaultNavigationTimeout(0);
await page.goto('https://www.google.com', {
waitUntil: 'networkidle0',
});
await page.pdf({
format: 'A4',
landscape: false,
path: ... // path where to save file
});

await browser.close();

After requiring the Puppeteer, we create a new browser instance by calling puppeteer.launch. Then we make a page, tell it to go to the address we want to convert to PDF, ask it to wait until the network is idle. Then we call page.pdf to create the PDF.

We use headless mode and no-sandbox. Please look at the official docs for a complete explanation of the arguments.

In the example above, we used standard page size A4 in portrait mode and told it where to save the PDF file.

We are saving the PDF file to the server as we need to run through it once more with pdf-lib.js. You could pipe it directly to HTTP response or something like that.

Once we have created a PDF, we close the browser instance.

Then, we need to tell it if it is a duplex:

const fs = require('fs');
const { PDFDocument, Duplex } = require('pdf-lib');

const pdfDoc = await PDFDocument.load(fs.readFileSync('./file.pdf'));
const viewerPrefs = pdfDoc.catalog.getOrCreateViewerPreferences();
viewerPrefs.setDuplex(Duplex.DuplexFlipLongEdge);

const pdfBytes = await pdfDoc.save();
fs.writeFileSync('./file.pdf', pdfBytes);

await browser.close();

Bear in mind that the duplex option is a flag inside the PDF file and is supported as long as the PDF viewer supports it and can set up a printer to print it according to the flag given.

What’s next

This blog post showed only one part of the process - just the part that is responsible for actually generating the PDF. There is a lot in the background of the service that renders the HTML. There, our clients can define the look of the PDF including:

  • Fonts
  • Colors
  • Number of columns
  • Layout of elements
  • Border, margins, padding
  • Etc.

Also, what was not covered in the article is that the service provides caching of generated PDFs as more than 2000 clients (shops) are connecting simultaneously, requesting the same PDF. It is a strain on our backend services as generated PDF can be over 10MB in size.