Hey everyone! I'm new here and trying to get all my source files (like .sh, .js, .py, and config files) formatted as pure UTF-8 plain text. What I mean is that I want to avoid any hidden artifacts—like BOMs, null bytes, non-breaking spaces, zero-width spaces, LRM/RLM characters, and tabs. Essentially, I'm looking for a way to guarantee that my files are strictly UTF-8 and not just ASCII or ASCII-compatible, which seems to show up with some commands like "file --mime" and "iconv -f". Is there a way to verify that my files are purely UTF-8? I'm on Ubuntu 22.04, and I'm wondering if this is even achievable. Thanks for any help!
3 Answers
It's tricky because you can't always rely on tools to tell you whether a file is purely UTF-8, especially since ASCII files are also valid UTF-8. But if you're concerned about hidden characters, maybe spend some time reviewing the Unicode standard and filtering out characters you don't want. regex implementations can help with identifying invisible characters too. Sometimes a good ol' manual review can work wonders!
It's important to note that ASCII is actually a subset of UTF-8, so technically, a UTF-8 file may include ASCII characters. However, if your goal is to avoid hidden characters or artifacts, you might want to write a script that checks for specific code points that you want to exclude. Tools like `iconv` can help you convert files, but consider also using regular expressions or a Unicode library to pinpoint exactly what you want to filter out. Just keep in mind, the way these encodings are designed means that while you can generally achieve a pure UTF-8 format, you might face limitations with ASCII's compatibility.
That's a good point about ASCII being a subset! But I get what you're saying about wanting to strip out those extra characters. Maybe checking for non-visible characters using regex patterns could be helpful in your case.
If you're looking specifically for pure UTF-8 without any ASCII, it sounds like a bit of a misunderstanding about character sets. Most modern files are in UTF-8 these days, and they often include ASCII because it's a universal standard. If you really want to ensure that your files meet your strict criteria, you might need a custom validation script to check for those hidden characters and formats that don’t meet your standards.
Yes! I think writing a script that runs through your files looking for those unwanted character types could really help you clean up your codebase. Just make sure to define which characters to include or exclude clearly!
Totally agree! Having a reliable regex can really streamline your hunt for those pesky invisible characters. It might also be worth looking at tools designed for code formatting!