I'm working with a website that uses an embedded tool to display PDFs, likely using pdf.js. The pages are drawn on a canvas, but I can't select any text. I've found I can download the canvas image using the toDataURL() function in the browser console. What I'm trying to do is manipulate the website to extract the text before it's drawn on the canvas and render it differently. In my search, I've identified that I might use CanvasRenderingContext2D or alter the source code of the browser itself. Can anyone suggest the best approach to achieve this?
4 Answers
Have you considered just downloading the original PDF? If that's not an option, using OCR on the canvas image might be a simpler approach than hacking your browser.
You should find out where the drawing to the canvas happens. If the PDF library draws directly to the canvas without interception, you might need to use an OffscreenCanvas for processing before it goes to the main canvas. If the library allows interception, it'll make your task much easier.
The website uses a different method for displaying PDFs, and these files aren't downloaded at all. Everything shown is just images on the canvas without a separate text layer.
What makes you think the text is drawn as an image on the client side? Have you confirmed that the PDF itself is using actual text rather than just displaying images of text?
Why not just locate the code responsible for the drawing and modify it to render the text instead of treating it like an image? Seems more straightforward than manipulating everything else.

Unfortunately, the site doesn't let me download PDFs. I've thought about OCR, but it feels too tedious since the text overlaps with other elements. Manipulating the canvas directly might be my best bet.