Introduction
Livestock genomics has transformed agriculture by providing powerful tools to improve traits such as production efficiency, disease resistance, feed utilization, and adaptability. Among livestock, cattle are central to global food security and economic growth, with their genetic diversity shaped by domestication, selective breeding, and adaptation to diverse environments. Cattle, including Bos taurus and Bos indicus, exhibit extensive variation affecting milk production, meat quality, and resilience to disease. Unlocking the genetic basis of this diversity is essential for sustainable breeding programs, which require a full understanding of genetic variation. Genetic studies have made significant progress in identifying single-nucleotide polymorphisms (SNPs) associated with cattle production and health traits [1]. However, more complex forms of variation—structural variants (SVs ≥ 50 bp), including insertions, deletions, inversions, and translocations—cover larger genomic regions (Fig. 1A) and often exert stronger functional effects, such as altering gene dosage, modifying regulatory elements, or unmasking recessive alleles [2,3,4,5]. While SNP-based genome-wide association studies (GWAS) have identified thousands of variants linked to complex traits, they typically explain only a fraction of genetic variance, leaving much of the so-called “missing heritability” unresolved [6, 7]. Integrating SVs alongside SNPs offers a more complete understanding of the genomic architecture underlying complex traits [8]. In humans, SVs explain a substantial share of gene expression differences and are enriched at GWAS loci, particularly when larger than 20 kb [9,10,11]. Yet many SVs lie in repetitive regions, making them difficult to detect with short-read sequencing or SNP arrays [12].

Investigating structural variation in cattle with long reads and pangenome graphs. A Types of structural variants (SVs) identified in cattle, including deletions, insertions, inversions, and translocations, highlighting the complexity of genomic variation. B Comparison of long-read versus short-read sequencing, illustrating the superior ability of long reads to detect and resolve SVs across the genome. C Conceptual representation of bovine species, subspecies, and breed relationships within a pangenome framework. Single-nucleotide variations and structural variations are modeled as alternative paths, capturing diversity across multiple lineages (modified from Smith et al. [13]). D Icon representing the Ruminant T2T Consortium, emphasizing the collaborative effort in generating near-complete ruminant genome assemblies
Early cattle studies used microarrays and short-read sequencing to identify copy number variants and other SVs [14,15,16,17,18,19,20,21]. Our studies revealed that ~ 3.1% of the bovine genome comprises recent duplications (≥ 1 kb, ≥ 90% identity), often clustered in tandem arrays [22,23,24]. Genome-wide surveys across diverse breeds, supported by orthogonal validation methods like FISH and qPCR, linked SVs to parasite resistance, feed efficiency, and milk traits [25,26,27]. Roughly 75% of deletions and duplications were found in linkage disequilibrium with SNPs, but ~ 25% were not captured by SNP arrays, underscoring the need for SV-aware genotyping [19]. However, short-read sequencing approaches were limited, detecting only 30%–70% of SVs and often misclassifying variants in repetitive regions, with false discovery rates as high as 85%. Standard approaches such as read-pair, read-depth, and split-read analyses could not resolve repetitive or structurally complex regions—precisely where SVs are enriched. Moreover, most SV callers did not assign variants to haplotypes, restricting downstream association with complex traits.
Recognition of the limits of short-read sequencing in complex regions, as emphasized by the Genome in a Bottle Project [28,29,30], set the stage for long-read sequencing. Using Pacific Biosciences (PacBio) HiFi, Oxford Nanopore Technologies (ONT), and Hi-C technologies, the Telomere-to-Telomere (T2T) consortium has delivered the first human gapless genomes, resolving repetitive and structurally complex regions with unprecedented clarity [31,32,33,34]. The recent proliferation of high-quality, chromosome-level cattle assemblies mirrors trends in human genomics and provides new opportunities to close gaps and capture population diversity, opening the door to a deeper understanding of structural variation, adaptation, and trait biology. The pangenome concept, first introduced in microbial genomics, offers a powerful framework [35]. A pangenome captures both core sequences shared across individuals and variable sequences found in subsets of populations. Built from phased, haplotype-resolved assemblies, pangenomes improve SV discovery and support comparative genomics, evolutionary studies, and functional analyses such as epigenomics and metagenomics [10, 36,37,38]. While cattle CNVs and SVs have previously been reviewed using both short- and long-read data, this article focuses on cattle-specific advances while also synthesizing lessons from human genomics and placing cattle research within a broader comparative genomics framework that highlights recent pangenome and T2T developments.
Long-read sequencing
The field of genome research has been transformed by long-read sequencing and complementary long-range mapping approaches, which together now deliver nearly complete genome assemblies and unprecedented SV resolution (Fig. 1B). PacBio HiFi sequencing, with error rates near 0.1% through circular consensus sequencing (CCS), has become a standard for variant discovery, assembly, and epigenomics [39, 40]. The recent PacBio Revio platform further expanded throughput (360 Gb/d), enabling sequencing of ~ 1,300 genomes per year at ~ $1,000 each, and incorporates DeepConsensus to improve accuracy [41]. ONT continues to complement this with ultra-long reads, often hundreds of kilobases in length, while Hi-C provides long-range chromatin contacts that aid phasing and scaffolding [42].
Two main strategies have emerged for SV detection using long reads: read-based and assembly-based approaches. Read-based methods map long reads to a reference genome using aligners such as Minimap2, NGMLR, or lra [43,44,45], followed by SV calling with programs like cuteSV [46], SVision [47], Sniffles2 [48], or pbsv. These approaches perform well at low coverage (~ 5 × HiFi) and handle heterozygous SVs and duplications, but are constrained by reference bias. In contrast, assembly-based approaches use de novo genome assemblies followed by whole-genome alignment with tools such as Hifiasm [49], SVIM-asm [50], and PAV [51]. These methods excel at discovering large insertions and novel sequences but require higher coverage (~ 20 ×) and more computational resources. The recent two benchmarking papers [52, 53] have outlined their advantages, limitations, input requirements, and examples of applications. They show that read-based methods achieve high recall at low coverage, while assembly-based methods provide broader variant classes and greater stability across datasets. Both reviews emphasized that integrating read- and assembly-based calls is essential for comprehensive SV discovery.
The same technologies are now being applied in livestock, enabling highly contiguous and accurate assemblies that improve SV detection and functional annotation. In cattle, such methods will advance the characterization of breed-specific variation, illuminate adaptive responses, and strengthen genomic prediction by capturing variants previously inaccessible with short-read sequencing. Together, long-read sequencing represents a major shift toward comprehensive catalogs of genomic diversity across livestock species.
Pangenome graphs
The traditional reliance on single reference genomes, often derived from specific breeds, has introduced significant biases in livestock genomics, excluding breed-specific or rare SVs and limiting our understanding of within-species diversity. This limitation, together with the analytical challenges of high-resolution long-read data (PacBio, ONT), has driven a paradigm shift toward pangenomic approaches.
A pangenome captures the collective genomic diversity of a species, comprising both a core genome shared across all individuals and a variable genome present only in some (Fig. 1C). Unlike a single reference genome, which provides only a partial view, a pangenome provides a richer representation of structural, single-nucleotide, and insertion–deletion variation. When represented as a graph, shared sequences form nodes and alternative haplotypes form paths, with “bubbles” reflecting structural variants. Such frameworks improve both variant discovery and genotyping accuracy, particularly in repetitive and structurally complex regions.
The past few years have seen rapid progress in graph-based methodologies. Inspired by initiatives like the Human Pangenome Reference Consortium (HPRC), which aims to generate a diverse set of high-quality assemblies [34, 54], new computational frameworks have been developed, including vg (the Variation Graph toolkit) [55], Minigraph [56], Minigraph-Cactus (MC) [57], and the PanGenome Graph Builder (PGGB) [58]. These tools enable the construction of pangenome graphs with increasing sensitivity and scalability [59, 60]. Variation-aware genotyping tools such as PanGenie [51] and Giraffe [61] further extend these resources to short-read datasets, allowing efficient SV genotyping at a population scale [62]. They genotype SNPs and SVs against large graph references, achieving higher accuracy than linear mappers and enabling imputation panels derived from long-read-based catalogs. For long reads, emerging frameworks such as the Sequence Alignment Graph Algorithm (SAGA) [63] and graph-aware pipelines in Dynamic Read Analysis for GENomics (DRAGEN) [64] support SV calling, annotation, and genotyping within graph-based genomes, including the ability to augment graphs with novel alleles. These approaches, still maturing, promise to unify read-based and assembly-based methods in a graph framework that reduces reference bias and improves SV classification, especially in repetitive or complex genomic regions.
A recent graph-based human SV study used ONT to sequence 1,019 long-read genomes from 26 populations in the 1000 Genomes Project, identifying 167,291 sequence-resolved SVs and revealing mechanisms like LINE-1 and SVA transductions [63, 65]. It provides crucial insights into SV formation, especially involving repeat sequences and homology-mediated rearrangements, demonstrating the impact of long-read sequencing on understanding genomic architecture and aiding disease research. A companion paper reported a multi-ancestry SV imputation panel from 888 of the 1,019 samples [66]. They integrated their SVs with ~ 45 M variants from the 1000 Genomes Project Phase 3 and evaluated imputation accuracy using the UK Biobank. Metrics varied based on minor allele frequency, GIAB genomic region type (confident vs. difficult), and variant type, with simple insertions and deletions showing high imputation quality (mean concordance of 0.718 and 0.721; mean r2imp = 0.921 and 0.924) in confident regions as compared to complex SVs in difficult regions. Although SVs had slightly lower imputation quality than SNVs, the difference was minimal. The SV reference panel provides a strong foundation for SV imputation and GWAS, identifying hundreds of independent SV associations and novel insights [66]. This demonstrates the value of incorporating SV analyses into workflows using the imputation panel.
Parallel progress is being made in cattle and other livestock. For example, the Pausch Lab used the Variation Graph toolkit and developed breed-specific and pan-genome reference graphs in cattle, showing their superior accuracy over traditional linear references, and uncovering 70 Mb of novel sequences [67,68,69]. Leonard et al. [70, 71] showed SV-based pangenomes from haplotype-resolved assemblies were highly consistent across platforms and algorithms, creating multi-species super-pangenomes with good consensus. They also constructed a pangenome from 16 PacBio HiFi cattle assemblies to identify SNPs and SVs [72]. After SV genotyping using short reads by PanGenie, researchers conducted molQTL mapping with testis transcriptome data, identifying 92 potential causal SV candidates. These studies collectively demonstrate the power of using variation-aware graph-based approaches in cattle genomics, providing a more accurate and comprehensive mapping of genetic variants compared to traditional linear references. These findings demonstrate the potential value of integrating pangenomic data into breeding programs, enhancing marker-assisted selection and genomic prediction models by accounting for SVs associated with desirable traits. Applications extend beyond trait discovery. By incorporating data from diverse breeds, cattle pangenomes reveal population-specific variation underlying environmental adaptation, such as heat tolerance in tropical breeds or cold resistance in temperate populations [73, 74]. Conservation also benefits: the Prendergast Lab integrated 116 Mb of novel African cattle sequences into reference assemblies, improving read mapping and SV detection, and helping preserve diversity in indigenous breeds [75]. Together, these advances show that graph-based pangenomes are transformative for cattle genomics, offering more complete and accurate variant catalogs than linear references.
Advances in genome assemblies
Long-read sequencing has greatly advanced genome assembly quality, enabling highly accurate de novo assemblies across species. When combined with complementary methods such as Hi-C, which provides long-range scaffolding for detecting large SVs, these platforms deliver near-complete genomes with unprecedented contiguity and accuracy. Long reads bridge repetitive regions, allowing reconstruction of complex SVs such as tandem duplications and inversions that were previously unresolved. As a result, SV detection has markedly improved: high-quality assemblies support unbiased comparisons across individuals and breeds, capturing variants that short reads or single linear references often miss. Nearly complete assemblies now resolve centromeres, telomeres, and segmental duplications, uncovering SVs with important functional roles. Population-specific assemblies reveal adaptations such as disease resistance, while hybrid strategies combining HiFi, ONT, short reads, and Hi-C balance accuracy with cost-effectiveness. Haplotype phasing has advanced in parallel. Long reads and Hi-C enable phasing of both parental haplotypes across entire genomes, resolving heterozygous SNPs and SVs into contiguous haplotype blocks. Tools like HapCut2 [76], together with long, accurate HiFi reads, have increased the median length of phased blocks, while Hi-C extends them even further [33, 77]. The result is fully phased variation panels that improve detection of heterozygous SVs and enhance interpretation of complex traits, with direct applications to cattle breeding.
The landmark human T2T assemblies of CHM13 and HG0002 opened previously inaccessible genomic regions to SV discovery [32, 33, 78, 79]. In livestock, assemblies like goat ARS1 and cattle ARS-UCD2.0 achieved contig N50 sizes of ~ 20 Mb with near-complete fidelity [80, 81], setting benchmarks for animal genomics. Dozens of chromosome-level cattle assemblies have been released, including T2T or near-complete genomes for Holstein, sheep, and goat that filled reference gaps, especially in immunogenomic regions [82,83,84,85]. Pangenome efforts have expanded to sheep, Bos indicus, and yaks [86,87,88], reflecting a broader shift toward T2T and pangenomic frameworks. In cattle, three initiatives are spearheading progress:
-
1)
Ruminant T2T (RT2T) Project – Led by Tim Smith, this project is generating complete diploid assemblies across ruminants, including cattle and sheep Y chromosomes, and nearly finished assemblies for multiple cattle breeds and relatives such as bison and river buffalo (Fig. 1D) [83, 89].
-
2)
Bovine Pangenome Consortium (BPC) – Initiated by Ben Rosen, BPC is building a comprehensive bovine pangenome using ~ 15 breed-specific assemblies to improve SV and SNP detection at the genus level (Fig. 1C) [13].
-
3)
Bovine Long Read Consortium (BLRC) – Led by Amanda Chamberlain, Ben Hayes, and colleagues, BLRC is extending the 1000 Bull Genomes Project into the long-read era to generate population-scale SV and SNP catalogs for genomic selection [4].
Similarly, in our recent pangenome study of 20 Holsteins and 10 Jerseys sequenced at 20 × HiFi coverage, we applied both read-based (cuteSV, SVision, Sniffles2, SVIM, pbsv) and assembly-based (SVIM-asm) approaches [90]. After filtering, we identified an average of ~ 28,500 high-confidence SVs per sample, predominantly insertions and deletions, with smaller numbers of duplications and inversions. This was a remarkable increase from short-read approaches, which usually detect 5,000 to 10,000 SVs per sample. Coverage experiments showed that 10 × HiFi achieves ~ 90% recall with a false positive rate of ~ 9%, balancing cost and accuracy. Cross-validation with orthogonal short-read SV calls supported ~ 74% of events [21]. Importantly, the inclusion of the Jersey genomes disproportionately increased the number of unique SVs, demonstrating the value of multi-breed sampling and the presence of breed-specific variation. These results highlight two key points: (1) population-scale SV catalogs require sequencing dozens of individuals per breed, not just a handful, to avoid missing substantial variation; and (2) long-read sequencing provides stable, high-confidence SV discovery, positioning cattle genomics to build resources comparable to those available in human genomics.
Future perspectives and challenges
Recent advances in long-read sequencing, haplotype-resolved assemblies, and pangenome construction have fundamentally expanded our ability to characterize SV in cattle, moving beyond the limitations of short-read data. The Holstein- and Jersey-specific SV catalogs provide a strong foundation for exploring breed-specific variation [90]. Building on these resources, breed-specific phased pangenome graphs and large-scale SV imputation panels are poised to transform downstream applications, from more accurate variant genotyping to robust association studies across thousands of individuals. Looking ahead, integrating SV datasets into Artificial Intelligence (AI)-driven models could further improve predictive accuracy for complex traits, while cross-species pangenomes may uncover conserved and lineage-specific genetic variation important for adaptation and productivity. Coupled with functional genomics tools such as transcriptomics, epigenomics, and single-cell profiling, these strategies are expected to identify novel functional variants, refine trait mapping, and accelerate genomic selection, thereby enhancing genetic improvement and livestock management strategies.
Despite these advances, several challenges remain. Constructing and maintaining high-quality, breed-specific resources requires substantial sequencing and computational investment, which may restrict their broad adoption across diverse cattle populations. Unlike SNPs, which have vast public catalogs, SVs still lack comprehensive validated databases. Future studies must therefore develop shared SV resources to identify variant commonality across populations and to enable functional annotation. These resources will fill major knowledge gaps while providing direct applications in conservation and breeding, such as linking SVs to disease resistance, feed efficiency, or local adaptation. SV imputation and graph-based genotyping approaches, while powerful, must be further optimized to ensure accuracy across populations with varying ancestry and to integrate seamlessly with existing SNP-based genomic selection pipelines. Moreover, functional validation of SV–trait associations remains a bottleneck, demanding integration of multi-omics datasets, experimental models, and crossbreed comparisons. Overcoming these challenges will be essential for translating SV discoveries into practical breeding tools. Future efforts will also benefit from advances in AI approaches, as well as careful attention to ethical, regulatory, and data-sharing challenges.
Conclusions
SVs are a critical driver of genetic diversity in cattle, influencing health, productivity, and adaptability. Advances in long-read sequencing, pangenome technology, and genome assemblies have revolutionized the study of SVs, enabling precise insights into genetic variations. These findings underscore the transformative potential of genomic research for improving cattle breeding and management strategies. Integrating SV studies into breeding programs and conservation efforts promises to address challenges like disease resistance and sustainability. Recent breakthroughs in sequencing and computational tools are bridging the gap between research and practical applications, paving the way for targeted genetic interventions. However, sustainable practices must guide these advancements to balance production goals with biodiversity conservation. The future of cattle genomics lies in comprehensive, collaborative, and innovative efforts. By harnessing multi-omics approaches, AI-driven analytics, and genome editing technologies, researchers can drive sustainable and resilient improvements in livestock populations, safeguarding genetic heritage and meeting the evolving needs of agriculture.