Duplicate Line Collapser
Removed
2
How it works
Duplicate line removal is one of the most common text processing operations in developer workflows: deduplicating URL lists, email lists, log entries, dictionary files, tag lists, and any line-separated data where uniqueness matters. The Duplicate Line Collapser removes repeated lines, with options for case sensitivity, trimming whitespace, and showing duplicate counts.
**Algorithm** Using a hash set (ES6 `Set`): O(n) time — insert each line into the set, which automatically discards duplicates while maintaining insertion order. Alternatively, sort then compare adjacent lines: O(n log n) but loses original order. The tool offers both modes: preserve-order deduplication (Set-based) and sorted deduplication.
**Variation options** - **Case-insensitive**: "Apple" and "apple" treated as duplicates. Useful for email lists (email addresses are case-insensitive by RFC 5321 standard). - **Trim whitespace**: " word " matches "word". Essential for CSV-derived data where trailing spaces differ between exports. - **Empty line handling**: strip all blank lines, or keep one blank line between sections. - **Show count**: output each unique line followed by "× N" showing how many times it appeared — useful for analysing word frequency lists or log analysis.
**Real-world uses** Developers maintaining `.gitignore` files accumulate duplicates over time. SEO professionals deduplicating keyword lists before analysis. Data engineers cleaning import files. Log analysts removing repeated error messages to find distinct error types. The tool processes millions of lines in seconds since all logic runs as a browser script, limited only by available memory.
**Sort after deduplication** Alphabetical sort after deduplication produces canonical output for dictionary files, configuration lists, and any file where line order should be deterministic for version control diffs.
Privacy: all processing runs in the browser. No list data is transmitted.
Frequently Asked Questions
- Hash-set deduplication (JavaScript Set): O(n) time, O(n) space. For each line, check if it's in the Set — if not, add it and keep it; if yes, discard. Maintains insertion order. Sort-based deduplication: O(n log n) time, O(1) extra space — sort the array, then iterate keeping only lines different from the previous. This loses original order. For datasets larger than available memory ('external sort'), use sort-based deduplication with merge sort on disk. For billions of records, Bloom filters can pre-filter likely duplicates before the expensive hash comparison.
- Per RFC 5321 (SMTP), the domain part of an email address (after @) is case-insensitive — EXAMPLE.COM and example.com are the same domain. The local part (before @) is technically case-sensitive by the RFC, but in practice virtually all email providers treat it as case-insensitive: gmail.com, Outlook, Yahoo all normalise local parts to lowercase. Therefore, for email list deduplication, lowercasing the entire address before comparison is the correct approach. JOHN@EXAMPLE.COM, john@example.com, and John@Example.COM all represent the same address at any major provider.
- The tool operates on Unicode text (UTF-8/UTF-16 as handled by the browser's text decoding). The output preserves: Unicode characters including emoji, CJK characters, accented letters, RTL text. The output uses the line ending format you select (LF/CRLF/CR). BOM (Byte Order Mark) handling: the tool strips leading BOM from input to avoid treating it as part of the first line, and optionally re-adds it to output. For binary files, hex dump rather than text deduplication — this tool is for text content.
- CSV fields can legitimately contain newlines if the field is enclosed in double quotes (per RFC 4180). Naively splitting a CSV on newlines and deduplicating would break multi-line fields. For proper CSV deduplication: parse using a CSV-aware library that respects quoting rules, hash each parsed row, and filter duplicate rows. The Text Line Randomizer and Duplicate Line Collapser are designed for line-delimited text where each line is an independent item — for CSV, use a CSV-specific tool. As a workaround, export to a format without multi-line fields first.