Hey everyone! I'm dealing with a German text file that has some annoying whitespaces inserted between characters in words, like this: "i n t e r e s t i n g". I'd like to find a way to remove those whitespaces from within the words without affecting the spaces between words. My initial thought was to create a large text file with all possible German words in that spaced format to replace them, but that seems a bit cumbersome. Does anyone have a more effective or elegant solution for cleaning up this text? Also, I'm curious—why do some questions like this tend to get downvoted?
5 Answers
You could use a regex pattern like `([a-z]s){5,}` in a text editor or script. Just replace '5' with your threshold for how many letters can be spaced. This method should help you pinpoint those spaced letters pretty effectively!
I tried a similar approach using `egrep` to identify bad words, but manually replacing them was such a hassle. Automating it would be much better!
Just check if the whitespaces are actually common spaces or something like non-breaking spaces. If they're just regular spaces, then it should be pretty straightforward to fix.
Unfortunately, in this case, they're just plain old spaces!
Using bash may not be the best route, especially if this isn't a regular task for you. You might consider using a locally hosted language model (LLM) for parsing the text, as it can handle the formatting without adding spaces mistakenly. But be careful with this approach—it can be easy to waste time fiddling with bash scripts instead of using a more efficient method!
I totally agree! Using a LLM could add unnecessary complexity, especially if it sometimes generates inaccurate results.
Here's a bash script that may help:
```bash
#! /usr/bin/bash
function gen_sed_cmd {
mapfile -t bwords < <(grep -oP 'b(([A-Z]s)?([a-z]s|-s)+([a-z])?+)b' < "${G_FILE}")
mapfile -t gwords < <(grep -oP 'b(([A-Z]s)?([a-z]s|-s)+([a-z])?+)b' < "${G_FILE}" | tr -d ' ')
local buf=()
local cwords="${#bwords[@]}"
for ((i=0;i<=cwords;i++)); do
buf+=("s/${bwords[i]}/${gwords[i]}/")
done
IFS=';' buf="${buf[*]}"
printf '%sn' "${buf}"
}
function main {
readonly G_FILE="${1}"
shift 1
if [ -z "${G_FILE}" ] || [ ! -f "${G_FILE}" ]; then
printf 'usage: fix_words.sh [comp-file]'
exit 1
fi
read -r sed_cmd < <(gen_sed_cmd "${G_FILE}")
if [ -n "${1}" ]; then
echo "Diff output - file"
diff --color=always --text <(sed "${sed_cmd}" "${G_FILE}") "${1}"
else
echo "Edit in place"
sed -i "${sed_cmd}" "${G_FILE}"
fi
}
main "${@}"
```
This should solve most of your issues! Let me know if it works for you.
Good point! German does have its quirks with letters like 'I' and 'a' that can trip you up in English. But at least you don't have to deal with single-letter words messing up the cleaning process!