Whitespace Normalizer
How it works
Whitespace normalisation is a fundamental text cleaning step in data processing, ETL pipelines, and document editing. Inconsistent whitespace — multiple spaces, non-breaking spaces, zero-width characters, tabs, mixed line endings — causes failures in string comparison, database lookups, CSV parsing, and display rendering. The Whitespace Normalizer detects and fixes all whitespace anomalies.
**Problem categories** 1. **Multiple spaces**: "word gap" → "word gap". Common in copy-pasted text from PDFs or Word documents. 2. **Non-breaking spaces** (Unicode U+00A0, ): visually identical to regular spaces but not treated as word boundaries by many parsers. Common in HTML copy-paste. 3. **Zero-width characters** (U+200B, U+FEFF, U+200C, U+200D): invisible characters that cause string comparison failures and can hide inside copied text. Malicious actors sometimes use them to watermark leaked documents. 4. **Mixed line endings**: Windows uses CRLF (\r\n), Unix uses LF (\n), old Mac used CR (\r). Mixing them breaks line-counting tools, diffs, and compilers. 5. **Trailing whitespace**: spaces or tabs at line ends — invisible but cause diff noise in version control and can break indentation-sensitive languages (YAML, Python). 6. **Tab/space mixing**: indentation inconsistency that breaks Python and YAML.
**Unicode whitespace** Unicode defines 25+ whitespace characters beyond space and tab: em space, en space, hair space, thin space, figure space, ideographic space, and more. The normaliser targets all Unicode whitespace categories (Zs, Zl, Zp) plus control characters.
**Pipeline use** Before inserting text into a database, run whitespace normalisation — a field with trailing spaces may fail a UNIQUE constraint check against the same value without trailing spaces, causing duplicate detection to miss matches.
Privacy: all processing runs in the browser. No data is transmitted.
Frequently Asked Questions
- A non-breaking space (Unicode U+00A0, HTML ) looks identical to a regular space but behaves differently: it prevents line breaks at that position in HTML, and most parsers treat it as a different character from the regular space (U+0020). When users copy text from web pages into forms, databases, or code, non-breaking spaces appear where line break prevention was used in the HTML source. This causes: string comparison failures (user typed 'John Smith' with regular space; database has 'John Smith'); CSV parsing errors; pattern matching failures in regular expressions that match only \s but not \xa0.
- Zero-width characters are invisible in most editors and text displays but appear in hex editors or when you inspect the raw bytes. In JavaScript, detect with: `/[\u200B-\u200D\uFEFF\u2060]/g.test(str)`. In Python: check for chr(0x200B), chr(0x200C), chr(0x200D), chr(0xFEFF). Visually: paste suspected text into a hex dump viewer — zero-width characters appear as unexplained byte sequences (e2 80 8b for U+200B in UTF-8). Copy text into Notepad++ and enable View → Show All Characters to see them as dots or small marks.
- LF (\n, byte 0x0A, 'linefeed'): Unix/Linux/macOS standard. CRLF (\r\n, bytes 0x0D 0x0A, 'carriage return + linefeed'): Windows standard, inherited from typewriter conventions (carriage return = move print head to start; line feed = advance paper one line). CR (\r, byte 0x0D only): old Mac OS 9 and earlier. In mixed-OS environments, files may have inconsistent endings. Python's universal newline mode handles all three. Git normalises line endings on checkout (.gitattributes `text=auto`). CRLF in a Unix file appears as ^M at line ends in some editors.
- Python uses indentation (spaces or tabs) as block delimiters — mixing tabs and spaces causes IndentationError in Python 3. YAML uses indentation for structure — a misaligned key causes a parse error. Makefile rules must use tabs (not spaces) for recipe lines. Haskell uses layout rules where indentation determines scope. Go requires gofmt-standard formatting (but enforces it via tooling, not the parser). HTML/CSS: multiple spaces and newlines in HTML source render as a single space (collapsed whitespace). The whitespace normaliser lets you choose exactly the normalisation needed for each target format.