Here is a brief post about our recent paper on using PCA to visualise large pedigrees. This is the result of a collaboration between Hanbin Lee (University of Michigan) and colleagues at the Roslin Institute (University of Edinburgh). Reference and links at the bottom.
Pedigrees indicate the relationships between relatives. When hearing the word pedigree, many think of a family tree listing the ancestors of one or a small group of individuals, as you would see for some royal family or a race horse. Such a tree has one or a few individuals in the present generation and it grows as one goes back in time. A tool for studying the past. But pedigrees are also extremely important forward-looking tools used in animal and plant breeding and in species conservation. Pedigrees of breeding programs can easily contain thousands or even millions of individuals, many of them in the current generation. Harder to picture than a family tree.
Pedigrees can be used to predict the phenotype of an individual if phenotype data exist for that individual’s relatives. This is because a pedigree indicates the expected sharing of genetic material between individuals. If genotype data are available for relatives, it is also possible to make reasonable guesses about an individual’s genotype. This helps breeders decide which parents to choose to produce the next generation of dairy cows or laying hens, and how to minimise inbreeding in endangered species and conserved populations. So how can these large pedigrees be visualised? One way is via principal component analysis (PCA) of a pedigree matrix.
What is PCA and how is it used on pedigrees?
Principal component analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data (a matrix) into a lower-dimensional space while preserving as much variance as possible. One famous example is the PCA plot by Novembre et al. (2008). This PCA plot generated from genotype data of Europeans roughly mirrored the map of Europe, highlighting where people come from (I’ve added a link at the bottom of this post). PCA identifies the directions (principal components) along which a dataset varies the most. The first principal component captures the most variance, the second captures the second most, and so on. By projecting the data onto these principal components, PCA allows one to visualize complex datasets often in two or three dimensions, making it easier to identify patterns, clusters, and outliers.
Any pedigree can be represented as a matrix, such as the additive relationship matrix (aka the numerator relationship matrix). This matrix quantifies the genetic relationships between individuals based on their pedigree information. Each entry in the matrix represents the expected proportion of shared genetic material between pairs of individuals. For example, full siblings share approximately 50% of their genetic material, while half-siblings share about 25%. By constructing this matrix, one can capture the genetic structure of a population, which can then be analysed using PCA to visualize relationships and patterns among individuals.
About our paper
The paper accompanies our user-friendly R package randPedPCA, which implements this approach. The package is designed to handle large pedigrees efficiently by leveraging sparse matrix representations and optimized algorithms for matrix-vector products. This allows users to perform PCA on pedigrees with millions of individuals without the need to compute the full additive relationship matrix, which can be computationally prohibitive. The paper was published open-access in Genetics Selection Evolution (links at the bottom).
In the paper, we also compare PCA plots generated from pedigree-based relationship matrices to plot generated from genomic relationship matrices (calculated from genotype data). We also discuss whether and how pedigree PCAs should be centred (see plot below). The paper further contains a PCA plot of a pedigree with almost 1.5 million individuals of the UK’s favourite dog breed, the Labrador Retriever.

Feel free to reach out if you have any questions or comments about the paper or the package!
Further reading
- Novembre et al’s 2008 PCA paper on human genotype data of European descent available here
- Our paper:
- Lee H, Craddock RF, Gorjanc G, & Becher H (2025)
randPedPCA: Rapid approximation of principal components from large pedigrees
Genetics Selection Evolution, 57(1), 46 - CRAN: https://cran.r-project.org/package=randPedPCA
- GitHub: https://github.com/HighlanderLab/RandPedPCA
- Lee H, Craddock RF, Gorjanc G, & Becher H (2025)