I'm looking for a way to copy text from an image and lightly edit it to remove unwanted characters, especially newline characters. I've set up the following command in my i3 config file:
bindsym $mod+Mod1+t exec flameshot gui --raw | tesseract -l eng+ita stdin stdout | sed -r 's/(n|r)/s/g' | xclip -selection clipboard
However, I'm running into a problem where the OCR output still contains newline characters, and it even adds more newlines than the original text. For example:
This is a normal text.
Here I have gone on a newline.
But the OCR produced this:
This is a normal text.
Here | have gone on a newline.
It adds an empty line where there shouldn't be one. How can I adjust my command to get cleaner output? I also want to know if I can further edit the text, like turning all instances of E' into È. Would adding another 's/../.../g' command be enough, or is there more to it?
2 Answers
You might want to simplify your approach a bit. Instead of the complex sed command, consider using `tr -d 'n'` to delete newlines. That could do the trick! Also, regarding the issues with characters being misinterpreted, check if the output from Tesseract is accurate. Sometimes it confuses similar-looking characters, like 'I' becoming '|'. Stick to simpler commands until they don’t work anymore; sed and awk can be overkill for most text processing tasks!
While your method works in removing newlines, it tends to mash words together at the ends of lines. That's why I preferred substituting `n` with a space. I want the words on either end of the newline to remain separate, not jumbled together.
If you want to make additional changes to the text (like turning E' into È), you can chain multiple `tr` commands for different character replacements. Just keep adding them like this:
| tr 'E' 'È'
Also, if you're using sed, yeah, it's totally fine to just add another 's/.../.../g'. Just make sure to separate them with a `;` if you chain commands in a single line.
Thanks for clarifying! I appreciate the tips on chaining commands. I'll try them out!
Just keep an eye out for other formatting issues as you edit! Tesseract can sometimes misinterpret certain characters.
I found a solution that worked well for me. Here's the final command that also handles extra spaces and replaces '|' with 'I':
bindsym $mod+Mod1+t exec flameshot gui --raw | tesseract -l eng+ita stdin stdout | tr 'n' ' ' | tr -s ' ' | tr '|' 'I' | xclip -selection clipboard