Hey everyone! I'm working on a background worker that processes invoice emails. If an email doesn't have a PDF attached, we grab the HTML content, sanitize it using DOMPurify, and convert it to PDF with Puppeteer to display for our users. We're careful with several security measures, including disabling JavaScript in Puppeteer, intercepting network requests to allow only data URLs, and sanitizing the HTML to remove any harmful tags or attributes. I'm considering implementing more restrictions, like limiting inline image sizes and blocking file URIs. We're also thinking about switching to an API service like DocRaptor or API2PDF to lower operational risks and enhance security. I'm curious, how does everyone else handle the conversion of untrusted HTML to PDF? Do you prefer using an API or self-hosted solution? How do you tackle SSRF, inline-image DoS, or any other security threats? For those using an API, which ones have you found reliable in terms of security, cost, and overall performance? I'd appreciate any real-world experiences or insights. Thanks!
5 Answers
Instead of converting the HTML to PDF directly, consider taking a screenshot of it, converting that to PDF, and then running OCR on it. This way, you can avoid some of the risks associated with loading untrusted HTML in a browser.
Doesn't that come with similar challenges? You still need a method to ensure it's rendered safely.
I recently moved from API2PDF to a self-hosted setup using Gotenberg. It's faster, cheaper, and more reliable. I just send it a URL of the page I want to convert along with an authorization token. Plus, it can handle linked Word and Excel documents, merging them into the final PDF seamlessly.
Running in an isolated Docker container might sound appealing, but Docker isn't designed as a secure sandbox and might lead to unexpected issues. It's better to consider other isolation techniques to truly harden your setup.
I agree! Docker can be risky for this sort of application if not managed properly.
When it comes to inline images, it's better to limit their dimensions rather than just file sizes. I once had a small PNG that expanded to nearly 1 GB when uncompressed! Also, make sure your worker only has local network access—if it does escape the sandbox, it shouldn't reach the internet.
If I were really concerned about security, I'd avoid directly converting the raw HTML of an email to PDF. Instead, I'd use sanitized text to produce clean HTML. Additionally, using a rendering engine that operates in a controlled JavaScript environment is crucial. I've had good luck with wkhtmltopdf for over a decade, though I'm not sure if there's a Node wrapper for it.

That sounds interesting! But how do you capture the HTML without potentially executing untrusted scripts? I'd be wary of that.