Stratified Shuffle
How it works
Stratified shuffling randomly reorders a dataset while preserving the proportional representation of each class or group in the output. This is essential in machine learning train/test splits, where a naive random shuffle might concentrate all rare-class examples in one partition, making training data non-representative.
**Why stratification matters** Consider a dataset with 95% class A and 5% class B (1000 rows total, 50 class B rows). A random 80/20 split might put 45 class B rows in training and only 5 in the test set — or vice versa. With stratified splitting, the 80% training set contains exactly 40 class B rows (80% of 50) and the test set has 10 (20% of 50), maintaining the exact 95/5 class ratio in both partitions.
**Implementation** For each unique stratum value: collect all row indices belonging to that stratum, shuffle them independently, then distribute them proportionally to each output partition. Remainder rows (when stratum size × split_ratio is non-integer) can be allocated by rounding rules. Scikit-learn's train_test_split(stratify=y) implements this; this tool provides a browser-based equivalent for CSV data.
**Multi-column stratification** Stratifying on multiple columns simultaneously (e.g., class × gender × age_group) requires creating a composite stratum key. With many stratification variables, individual strata may be very small, making proportional splitting infeasible — in that case, combine rare strata before splitting.
Frequently Asked Questions
- For a dataset with 95% class A and 5% class B (1000 rows, 50 class B): a random 80/20 split might put all 50 class B rows in training (leaving 0 in test) or only 3 in test by chance. With stratified split: training has exactly 40 class B rows, test has 10 — both sets have the exact 95/5 ratio. Models trained without stratification may never see minority-class examples in training, or the test set may not reflect the true class distribution, making evaluation metrics misleading.
- sklearn.model_selection.train_test_split(X, y, test_size=0.2, stratify=y, random_state=42). The stratify=y parameter triggers stratification on the label array y. For multi-class problems, each class is split proportionally. For regression tasks (continuous y), sklearn.model_selection.StratifiedKFold does not apply — use KFold directly, or bin the target into quantiles and stratify on the bins. stratify=None disables stratification (plain random split).
- Multi-column stratification creates a composite stratum key by combining values from multiple columns: e.g., class + gender + age_group → 'A_male_25-34'. Each unique combination forms a stratum. The challenge: with many stratification variables, individual strata can be very small (1–2 rows), making proportional splitting infeasible. Solution: combine rare strata (groups with < 10 rows) into an 'other' category before splitting. Target 10+ rows per stratum as a minimum.
- Stratification guarantees proportional class representation but cannot control for all possible confounds. If class A rows happen to be concentrated in a specific date range (seasonality) or geographic region, stratifying only on class leaves the temporal or geographic imbalance unaddressed. For maximum representativeness, stratify on all major known confounders. For causal inference tasks, stratified splitting is insufficient — use blocked randomization designs.