Need Help Searching PDFs with PSWritePDF for Specific Text

0
31
Asked By CuriousCat42 On

Hey everyone! I'm trying to search through PDFs for specific text occurrences. I downloaded a PDF from VMware and I'm looking to find sentences that mention "esxi". I can convert the PDF into an array of objects, but when I use Select-String to pipe in the object, it just prints out the entire content of the PDF instead of the specific matches I'm looking for. I'm also trying to loop through the pages, but that's returning the same result. Here's what I've got:

```powershell
Import-Module PSWritePDF

$myPDF = Convert-PDFToText -FilePath $file

# $matches = $myPDF | Select-String "esxi" -Context 1

$matches = [System.Collections.Generic.List[string]]::new()
$pages = $myPDF.length
for ($i=0; $i -le $pages; $i++) {
$pageMatches = $myPDF[$i] | Select-String "esxi" -Context 1
foreach ($pageMatch in $pageMatches) {
$matches.Add($pageMatch)
}
}
```

Has anyone tackled something similar? Any tips would be really appreciated!

3 Answers

Answered By PowerScriptWiz On

You’ve nailed it with your loop structure! The issue is that `Select-String` sees each page in `$myPDF[$i]` as a whole. Consider splitting it first:

```powershell
$lines = $myPDF[$i] -split "`r?`n"
$pageMatches = $lines | Select-String "esxi" -Context 1
```

This lets you find just the matching lines including context. Give it a shot and see if that helps!

Answered By PDFMasterPro On

I think the issue here is with how the PDF is being converted. The underlying library seems to return a string per page rather than per line. You might want to split each line manually or use a tool like Ghostscript to convert the PDF into a text file that you can manipulate in PowerShell.

Answered By DevDude99 On

It sounds like you're running into a common issue. Each item in `$myPDF` is a multi-line string for a full page, so when you use `Select-String`, it considers the whole string as one unit. If "esxi" shows up anywhere in that string, it outputs the entire page.

To search line-by-line, you can split each page string into individual lines. Try this:

```powershell
$lines = $myPDF[$i] -split 'r?n'
$pageMatches = $lines | Select-String "esxi" -Context 1
```

This way, you'll get just the matching lines plus their context without dumping the whole page. Just keep in mind that you might lose your page numbers unless you implement a solution to track them as well.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.