How can I remove whitespaces inside words from a German text file?

0
1
Asked By WhimsicalDingo99 On

Hey everyone! I'm dealing with a German text file that has some annoying whitespaces inserted between characters in words, like this: "i n t e r e s t i n g". I'd like to find a way to remove those whitespaces from within the words without affecting the spaces between words. My initial thought was to create a large text file with all possible German words in that spaced format to replace them, but that seems a bit cumbersome. Does anyone have a more effective or elegant solution for cleaning up this text? Also, I'm curious—why do some questions like this tend to get downvoted?

5 Answers

Answered By HonestChipmunk91 On
Answered By SillyPineapple42 On

You could use a regex pattern like `([a-z]s){5,}` in a text editor or script. Just replace '5' with your threshold for how many letters can be spaced. This method should help you pinpoint those spaced letters pretty effectively!

CuriousBee82 -

Good point! German does have its quirks with letters like 'I' and 'a' that can trip you up in English. But at least you don't have to deal with single-letter words messing up the cleaning process!

TechWiz76 -

I tried a similar approach using `egrep` to identify bad words, but manually replacing them was such a hassle. Automating it would be much better!

Answered By PracticalPanda83 On

Just check if the whitespaces are actually common spaces or something like non-breaking spaces. If they're just regular spaces, then it should be pretty straightforward to fix.

WittyOtter17 -

Unfortunately, in this case, they're just plain old spaces!

Answered By CodeNinja57 On

Using bash may not be the best route, especially if this isn't a regular task for you. You might consider using a locally hosted language model (LLM) for parsing the text, as it can handle the formatting without adding spaces mistakenly. But be careful with this approach—it can be easy to waste time fiddling with bash scripts instead of using a more efficient method!

FrustratedCoder45 -

I totally agree! Using a LLM could add unnecessary complexity, especially if it sometimes generates inaccurate results.

Answered By CleverFox34 On

Here's a bash script that may help:
```bash
#! /usr/bin/bash

function gen_sed_cmd {
mapfile -t bwords < <(grep -oP 'b(([A-Z]s)?([a-z]s|-s)+([a-z])?+)b' < "${G_FILE}")
mapfile -t gwords < <(grep -oP 'b(([A-Z]s)?([a-z]s|-s)+([a-z])?+)b' < "${G_FILE}" | tr -d ' ')

local buf=()
local cwords="${#bwords[@]}"
for ((i=0;i<=cwords;i++)); do
buf+=("s/${bwords[i]}/${gwords[i]}/")
done

IFS=';' buf="${buf[*]}"

printf '%sn' "${buf}"
}

function main {
readonly G_FILE="${1}"
shift 1

if [ -z "${G_FILE}" ] || [ ! -f "${G_FILE}" ]; then
printf 'usage: fix_words.sh [comp-file]'
exit 1
fi

read -r sed_cmd < <(gen_sed_cmd "${G_FILE}")

if [ -n "${1}" ]; then
echo "Diff output - file"
diff --color=always --text <(sed "${sed_cmd}" "${G_FILE}") "${1}"
else
echo "Edit in place"
sed -i "${sed_cmd}" "${G_FILE}"
fi
}

main "${@}"
```
This should solve most of your issues! Let me know if it works for you.

Related Questions

OpenAI Token Calculator

List Sorting Tool

AI Content Detector

Generate TLDR From Text

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.