Hey everyone! I'm trying to search through PDFs for specific text occurrences. I downloaded a PDF from VMware and I'm looking to find sentences that mention "esxi". I can convert the PDF into an array of objects, but when I use Select-String to pipe in the object, it just prints out the entire content of the PDF instead of the specific matches I'm looking for. I'm also trying to loop through the pages, but that's returning the same result. Here's what I've got:
```powershell
Import-Module PSWritePDF
$myPDF = Convert-PDFToText -FilePath $file
# $matches = $myPDF | Select-String "esxi" -Context 1
$matches = [System.Collections.Generic.List[string]]::new()
$pages = $myPDF.length
for ($i=0; $i -le $pages; $i++) {
$pageMatches = $myPDF[$i] | Select-String "esxi" -Context 1
foreach ($pageMatch in $pageMatches) {
$matches.Add($pageMatch)
}
}
```
Has anyone tackled something similar? Any tips would be really appreciated!
3 Answers
You’ve nailed it with your loop structure! The issue is that `Select-String` sees each page in `$myPDF[$i]` as a whole. Consider splitting it first:
```powershell
$lines = $myPDF[$i] -split "`r?`n"
$pageMatches = $lines | Select-String "esxi" -Context 1
```
This lets you find just the matching lines including context. Give it a shot and see if that helps!
I think the issue here is with how the PDF is being converted. The underlying library seems to return a string per page rather than per line. You might want to split each line manually or use a tool like Ghostscript to convert the PDF into a text file that you can manipulate in PowerShell.
It sounds like you're running into a common issue. Each item in `$myPDF` is a multi-line string for a full page, so when you use `Select-String`, it considers the whole string as one unit. If "esxi" shows up anywhere in that string, it outputs the entire page.
To search line-by-line, you can split each page string into individual lines. Try this:
```powershell
$lines = $myPDF[$i] -split 'r?n'
$pageMatches = $lines | Select-String "esxi" -Context 1
```
This way, you'll get just the matching lines plus their context without dumping the whole page. Just keep in mind that you might lose your page numbers unless you implement a solution to track them as well.

Related Questions
How To: Running Codex CLI on Windows with Azure OpenAI
Set Wordpress Featured Image Using Javascript
How To Fix PHP Random Being The Same
Why no WebP Support with Wordpress
Replace Wordpress Cron With Linux Cron
Customize Yoast Canonical URL Programmatically