Text Tools

How can I clean up newline characters from OCR text?

April 15, 2025

Asked By CuriousCat42 On April 15, 2025

I'm looking for a way to copy text from an image and lightly edit it to remove unwanted characters, especially newline characters. I've set up the following command in my i3 config file:

bindsym $mod+Mod1+t exec flameshot gui --raw | tesseract -l eng+ita stdin stdout | sed -r 's/(n|r)/s/g' | xclip -selection clipboard

However, I'm running into a problem where the OCR output still contains newline characters, and it even adds more newlines than the original text. For example:

This is a normal text.
Here I have gone on a newline.

But the OCR produced this:

This is a normal text.

Here | have gone on a newline.

It adds an empty line where there shouldn't be one. How can I adjust my command to get cleaner output? I also want to know if I can further edit the text, like turning all instances of E' into È. Would adding another 's/../.../g' command be enough, or is there more to it?

2 Answers

Answered By TechyWizard99 On April 16, 2025

You might want to simplify your approach a bit. Instead of the complex sed command, consider using `tr -d 'n'` to delete newlines. That could do the trick! Also, regarding the issues with characters being misinterpreted, check if the output from Tesseract is accurate. Sometimes it confuses similar-looking characters, like 'I' becoming '|'. Stick to simpler commands until they don’t work anymore; sed and awk can be overkill for most text processing tasks!

ThoughtfulCoder88 - April 18, 2025

I found a solution that worked well for me. Here's the final command that also handles extra spaces and replaces '|' with 'I':

bindsym $mod+Mod1+t exec flameshot gui --raw | tesseract -l eng+ita stdin stdout | tr 'n' ' ' | tr -s ' ' | tr '|' 'I' | xclip -selection clipboard

QuestionAsker101 - April 18, 2025

While your method works in removing newlines, it tends to mash words together at the ends of lines. That's why I preferred substituting `n` with a space. I want the words on either end of the newline to remain separate, not jumbled together.

Answered By CodeNinja77 On April 15, 2025

If you want to make additional changes to the text (like turning E' into È), you can chain multiple `tr` commands for different character replacements. Just keep adding them like this:

| tr 'E' 'È'

Also, if you're using sed, yeah, it's totally fine to just add another 's/.../.../g'. Just make sure to separate them with a `;` if you chain commands in a single line.

CuriousCat42 - April 18, 2025

Thanks for clarifying! I appreciate the tips on chaining commands. I'll try them out!

TechyWizard99 - April 18, 2025

Just keep an eye out for other formatting issues as you edit! Tesseract can sometimes misinterpret certain characters.

How can I clean up newline characters from OCR text?

2 Answers

Related Questions

Convert CSV To HTML Table

Flip Text Upside Down - Free Online Tool

Docx To PDF

Anthropic Claude AI Token Calculator

List Sorting Tool

AI Content Detector

LEAVE A REPLY Cancel reply