I'm looking for effective solutions to scan and redact personally identifiable information (PII) from a large Windows file share. The challenge is that many tools only handle text-based files, but we have an extensive collection of scanned PDFs, images, and mixed-format documents containing sensitive information like IDs and banking details. We often deal with Australian driver's licenses and passports as well, so it's crucial that the detection accuracy is spot on.
I recently tried out PII-tools, which seemed promising, but the air-gapped on-premise version we would need to use costs around $18,000 annually. While I get that there's a value in security, that's still a significant financial commitment.
Has anyone here successfully used other tools that can reliably detect and redact PII from non-text PDFs? I'm especially interested in options with strong OCR capabilities for handling scanned documents. I've come across Redactable mentioned in legal and privacy discussions for permanent redaction, but I'd love to get recommendations from real users before making a decision.
5 Answers
I've noticed that many are interested in utilizing AI for this purpose. As a smaller operation, I was thinking of developing my own solution on an AI platform since there are open-source options that cover the basics.
Make sure to consider the return on investment when weighing your options. It's important to analyze what it costs to secure data versus the potential cost of a PII breach. Sometimes, management will see that a higher-priced solution could save a lot in the long run.
Lightbeam is another tool worth considering for redacting PII. It might fit your needs.
Netwrix Data Classification is a solid option for redacting PII. It handles various document formats effectively.
$18k a year for a comprehensive data loss prevention solution with deep inspection capabilities is actually on the lower side of the spectrum. You often have to choose between restructuring all your workflows to make it easier (and cheaper) to manage or sticking with your current setup and investing in a reliable but costly solution.

We use Netwrix as well, but be aware it might be pricier than the $18k option you're considering.