Hamming Distance Calculator
Hamming distance
2
How it works
Hamming distance between two strings of equal length is the number of positions at which the symbols differ. For binary strings: 1010 vs. 1001 → positions 2 and 4 differ → Hamming distance = 2.
**Hamming distance in error detection and correction** A code with minimum Hamming distance d can: detect up to d-1 errors (any pattern of ≤d-1 errors changes the received word to a non-codeword). Correct up to ⌊(d-1)/2⌋ errors (the corrupted word is closest to the original codeword). Simple parity has minimum distance 2: detects 1 error, corrects 0. Hamming(7,4) code has minimum distance 3: detects 2 errors, corrects 1.
**Applications in DNA sequencing** Hamming distance measures genetic similarity between DNA sequences. Pairs of sequences with low Hamming distance are more closely related. Sequence alignment algorithms optimize for Hamming distance and related metrics (edit distance, which allows insertions/deletions).
**Applications in cryptography and security** Fuzzy matching: biometric authentication compares fingerprint or iris hash codes using Hamming distance — matches within a threshold are accepted despite minor capture variations. Locality-sensitive hashing uses Hamming distance properties to efficiently find similar items in large datasets.
**Normalized Hamming distance** Divide raw Hamming distance by string length to get a fractional similarity measure (0 = identical, 1 = completely different). Two 100-bit codes with 10 differences have normalized Hamming distance 0.1 — 90% similar. This normalization enables comparison across different string lengths.
Frequently Asked Questions
- A code that can correct t errors must have minimum Hamming distance d_min ≥ 2t + 1. The sphere of radius t around each codeword (all words within t bit flips) must not overlap. To correct 1 error: d_min ≥ 3. To correct 2 errors: d_min ≥ 5. To detect (but not correct) t errors: d_min ≥ t + 1. Hamming(7,4): d_min = 3, corrects 1 error. BCH codes can achieve any desired d_min. Reed-Solomon codes operate on multi-bit symbols and are widely used in CDs, DVDs, QR codes, and satellite communications for burst error correction.
- DNA sequences use a 4-letter alphabet (A, T, G, C). Hamming distance between two sequences of equal length counts positions where the nucleotides differ. Example: ATCG vs. AACG → position 2 differs → Hamming distance = 1. For sequences of different lengths, Levenshtein (edit) distance is used instead (counts substitutions, insertions, deletions). Hamming distance is used in: primer specificity analysis (ensure primers bind only to target sequences), CRISPR off-target prediction (guide RNA vs. genome), and phylogenetic distance computation for closely related sequences.
- The Hamming bound limits how many codewords can exist in a code of length n with minimum distance d: M ≤ 2^n / V(n, t), where V(n,t) = Σ C(n,i) for i=0 to t, is the volume of a Hamming sphere of radius t = ⌊(d-1)/2⌋. Codes that meet this bound with equality are 'perfect codes' — they pack spheres with no wasted space. The only nontrivial perfect binary codes are Hamming codes, the Golay code (length 23, d=7, corrects 3 errors), and repetition codes. Most practical codes are not perfect — there's wasted space in code space.
- K-nearest neighbors on binary features: Hamming distance efficiently finds neighbors by counting bit differences — a popcount CPU instruction computes this in one cycle on modern CPUs. Random hashing with Hamming distance: SimHash and MinHash approximate nearest-neighbor search in high dimensions. Binary neural network activations: Hamming distance between activation patterns measures functional similarity of neural network layers. Genome-wide association studies (GWAS): Hamming distance between SNP (single nucleotide polymorphism) profiles measures genetic distance. In all these cases, Hamming distance is computationally efficient compared to Euclidean distance for high-dimensional binary data.