Full-scale PCA pushes scikit-learn past 128GB
A Reddit discussion in r/MachineLearning highlights a familiar pain point in representation learning: trying to compute a full PCA basis for a roughly 40k × 40k covariance matrix with `sklearn.decomposition.PCA(svd_solver="full")` can fail even on a 128GB machine. The post captures the gap between convenient ML APIs and the brutal memory, runtime, and numerical demands of exact dense eigendecomposition at modern embedding sizes.
This is less a scikit-learn failure than a reality check on how quickly “standard” PCA turns into HPC once you demand the full spectrum. For AI researchers, the interesting part is not the crash itself but where the abstraction boundary breaks and lower-level linear algebra choices start to matter more than the model code.
- –scikit-learn’s docs confirm that `svd_solver="full"` uses exact LAPACK SVD, while `covariance_eigh` explicitly warns that materializing the covariance becomes impractical for large feature counts
- –The request for the full PCA basis is the key constraint: top-k PCA has practical randomized and truncated options, but full decomposition removes most of the usual escape hatches
- –In practice, this kind of workload often pushes teams toward lower-level `eigh` workflows, GPU solvers, or distributed/out-of-core numerical libraries instead of a high-level sklearn wrapper
- –It is a useful signal for representation-learning teams that feature dimensionality alone can become a first-class systems problem long before model training does
DISCOVERED
33d ago
2026-03-09
PUBLISHED
33d ago
2026-03-09
RELEVANCE
AUTHOR
nat-abhishek