I'm looking for reliable tools that can accurately scan and redact personally identifiable information (PII) from a large collection of documents stored on a Windows file share. Many of the tools I've tried so far primarily work with text-based files, but our library consists of a lot of scanned PDFs, images, and a mix of document formats that contain sensitive client information, such as IDs and banking details.
We frequently handle documents like Australian driver's licenses and passports, so having effective detection capabilities is critical. I tested the PII-tools software and it seemed promising, but the on-prem version that fits our needs is priced at around $18,000 annually, which is a significant cost, even considering its security benefits.
I'd love to know if anyone has used other solutions that can effectively identify and redact PII in non-text PDFs. Ideally, the tool should have strong OCR capabilities to handle scanned documents. I've heard of the platform Redactable being mentioned in privacy and legal discussions, but I'd prefer to gather recommendations from practical experiences before making a decision.
6 Answers
There’s a growing interest in using AI for this purpose. We’re a small business, but I'm considering setting up an AI-based system since there are open-source solutions capable of the basics.
You might want to check out Lightbeam. I've heard good things about its ability to handle PII.
Netwrix Data Classification is a tool that can handle PII redaction well. I've seen it used effectively for similar needs.
It's crucial to evaluate the ROI on any tool you consider. Look at the potential costs of a PII breach versus the tool's price. If the best solution is $18,000 a year and management thinks it’s too much, then you’ve done your job by highlighting the risks involved.
Many professionals are in the same boat searching for effective solutions. It’s a balancing act between capability and cost.
$18k a year for a solid data loss prevention (DLP) solution that includes advanced OCR features is actually on the cheaper side. These systems can get really expensive quickly, so be ready for that. If it's a requirement for your work, you'll have to either adjust all your workflows to fit a cheaper tool or pay up for a robust solution that meets your current format needs.

That’s the tool we use, but I think it might be pricier than what you're hoping for.