I'm working on a backend service where users upload PDFs, and I need to extract each page as individual PNG images saved to Google Cloud Storage. For example, a 7-page PDF should be split into 7 separate PNGs. The issue is that this extraction process is really resource-heavy. I'm using pypdfium, but even for a simple 7-page PDF, it requires around 1GB of RAM. Larger files often cause the job to fail, making auto-scaling kick in. I initially tried an instance with 8GB of RAM and 4 vCPUs, but it failed until I switched to a 16GB RAM instance. How do others manage PDF page extraction in a production environment without running into out-of-memory errors? Here's a snippet of my code:
`import pypdfium2 as pdfium`
`from PIL import Image`
`from io import BytesIO`
`def extract_pdf_page_to_png(pdf_bytes: bytes, page_number: int, dpi: int = 150) -> bytes:`
` """Extract a single PDF page to PNG bytes."""`
` scale = dpi / 72.0 # PDFium uses 72 DPI as base`
` pdf = pdfium.PdfDocument(pdf_bytes)`
` page = pdf[page_number - 1] # 0-indexed`
` bitmap = page.render(scale=scale)`
` pil_image = bitmap.to_pil()`
` buffer = BytesIO()`
` pil_image.save(buffer, format="PNG", optimize=False)`
` page.close()`
` pdf.close()`
` return buffer.getvalue()`
6 Answers
Have you thought about using a Java library? I remember using one for splitting large PDFs before, and it worked well, so it might be worth exploring alternatives.
In my experience processing a batch of 50 PDFs with each containing over 20 pages, I only used about 1GB of memory thanks to using the MuPDF library instead. If you can, give it a shot—it's been great for me!
You might want to run `gc.collect()` between iterations to clear out any large buffers that might be lingering in memory. It’s not a perfect solution, but it can help with memory management without heavy code changes!
It sounds like you might be leaking some resources. While you're closing the `page` and `pdf` correctly, make sure to check if you need to close the `bitmap` and `pil_image` too. Plus, wrapping your cleanup code in a `finally` block would help ensure everything gets released properly, even if exceptions occur. Also, instead of re-parsing the PDF for each page, consider reusing the same `PdfDocument` object for better efficiency.
Reusing the `PdfDocument` sounds much smarter; I'll definitely try that!
I wonder if Gemini could handle this type of task efficiently?
You could also try modifying your code to save the PDF to a temporary file before loading it. For instance, instead of directly passing the bytes to `PdfDocument`, write them to a temporary location first. This method might help reduce memory usage significantly.

Good point about explicitly closing those objects! I'll start using a `finally` block for cleanup.