Background
Under long-term adaptation to diverse environments and artificial selection, goats have evolved various phenotypes and specific genomic patterns compared with their wild ancestors [1]. In addition, domestic goats have incorporated genetic elements from multiple wild species, such as Caucasian tur (Capra caucasica) and Bezoar (Capra aegagrus) during the domestication and dispersal process [2]. Thus, the genomes of goats exhibit complex genetic characteristics [3]. Unveiling genetic diversity features and understanding the genetic mechanisms of diverse phenotypes of goats are pivotal in facilitating the preservation and utilization of these genetic resources.
Over the last decade, reference genomes have provided a roadmap and a fundamental framework for discovering genomic features. By aligning short reads to a reference genome, the genomic variations, such as single nucleotide polymorphisms (SNPs) and small insertions and deletions (INDELs), were identified, which elucidated the genetic mechanisms of the diverse phenotypes in humans [4], animals [5], and plants [6]. The current goat reference genome was generated from one individual and used in most goat genomic studies [7, 8]. However, the genetic diversity and genomic patterns within a species can’t be fully captured by the reference genome from a single individual [9]. The pan-genome is the nonredundant set of all DNA sequences in a specific species [10]. Pan-genome research in humans [10], animals [11], and plants [12] has provided novel scientific insights into the genetic diversity features and genetic mechanisms of diverse phenotypes. Compared with a reference genome from a single individual, a pan-genome could be applied to detect structural variations (SVs), such as copy number variations (CNVs) and presence/absence variations (PAVs) [13, 14]. These variations may significantly impact phenotype more than SNPs by altering gene dosage, interrupting coding sequence (CDS), and affecting long-range gene regulation [15,16,17]. The first goat pan-genome was assembled by Li et al. [18] using 10 individuals, including 5 samples from Capra genus and 5 individuals from sibling genus. However, the limited sample size can’t fully represent goats’ genetic and phenotypic diversity and constrain the detection of SVs, such as PAVs.
To obtain the comprehensive genome sequence and improve our understanding of the genomic characteristics of goats, the map-to-pan strategy was used to construct a goat pan-genome based on 723 domestic goats and 90 samples from their wild relatives, covering genetically, phylogenetically and geographically diverse samples (Additional file 1: Table S1). To identify the previously unknown SNPs, we aligned the reads that couldn’t be mapped to ARS1.2 to the nonreference sequences. PAVs were detected among the wild, native, and improved populations to reveal genetic changes through domestication and breeding history. This research will enhance our understanding of the changes in the genetic diversity and genomic architecture during domestication and improvement in goats.
Methods
Genome data collection, processing, and pan-genome construction
A total of 723 whole genome sequencing data of domestic goats from 85 populations and 90 samples from 6 wild goat species of Capra genus were downloaded from the National Center for Biotechnology Information Sequence Read Archive database (NCBI, https://www.ncbi.nlm.nih.gov) (Fig. 1, Table 1, Additional file 1: Table S1). The map-to-pan strategy was used to construct a goat pan-genome [19]. More specifically, raw Illumina reads were filtered to remove adapters and low-quality sequences using Trimmomatic (v0.39) [20] with parameters “SLIDINGWINDOW:4:15 MINLEN:50”. The high-quality reads of each individual were mapped to ARS1.2 using BWA (v0.7.12) [21] with the default parameters. The reads unmapped to ARS1.2 were extracted using SAMtools (v1.9) [22] with parameters "-b -f 4", "-b -f 68 -F 8" and "-b -f 132 -F 8", respectively. Subsequently, the unmapped reads were merged into a unified BAM file by SAMtools (v1.9) [22], followed by conversion of the BAM file to FASTQ format using bamtools (v2.5.2) (https://github.com/pezmaster31/bamtools). The FASTQ file was divided into paired FASTQ files corresponding to the forward and reverse strands utilizing an in-house Python script. Then, the paired unmapped reads were assembled using MaSuRCA (v4.1.0) [23] with parameters “USE_LINKING_MATES = 1, JF_SIZE = 5,000,000,000, FLYE_ASSEMBLY = 0”. Contigs assembled with a length exceeding 500 bp were retained for further analysis. We polished the contigs following 4 steps. Firstly, the Mummer (v4.0.0-beta2) [24] was employed to filter the initial contigs by aligning them to the ARS1.2. Contigs exhibiting an identity of ≥ 90% and a coverage of ≥ 80% were subsequently excluded. Secondly, the CD-HIT (v4.8.1) [25] was applied for filtering, where the assembly contigs were aligned against themselves, and contigs displaying an identity ≥ 90% and coverage ≥ 90% were removed. Thirdly, all contigs were carried out a search within the NCBI-NT database using BLASTN (v 2.14.0) [26] to discern sequences associated with archaea, viruses, bacteria, fungi, and viridiplantae, and identified sequences from these categories were eliminated. Finally, the Kraken2 (v2.0.9-beta) [27] was employed to classify and filter the remaining contigs using the kraken2-microbial database, encompassing sequences of archaea, bacteria, fungi, protozoa, viruses, and humans. The unclassified contigs were retained and integrated with ARS1.2 to assemble a goat pan-genome.

Geographical distribution of samples used for pan-genome construction
Annotation of the nonreference sequences
The nonreference sequences were annotated using a pipeline that integrated three distinct methodologies: ab initio gene prediction, RNA-Seq, and protein homology [28]. Firstly, the repetitive sequences were identified by scanning a custom repeat library and Bovidae repeat library in RepeatMasker (v4.1.2-p1) [29] and masked in subsequent annotation analysis, in which a custom repeat library was built by screening the nonreference sequence using RepeatModeler (v2.0.3) [30]. Secondly, ab initio gene prediction was conducted using Augustus (v3.1.0) [31] under the Bovidae model with default parameters, and SNAP (2006-07-28) [32] with the guidance of MAKER2 (v2.31.8) [28]. Thirdly, we collected 37 RNA-Seq data from 28 tissues as transcript evidence (Additional file 1: Table S2). The RNA-Seq reads were filtered to remove the adapter and low-quality sequences using Trimmomatic (v0.39) [20]. The clean reads were aligned to the nonreference sequences using Hisat2 (v2.2.1) [33], and the gene models were subsequently constructed from the result of alignments using StringTie (v2.1.7) [34]. We further downloaded protein sequences of Homo sapiens, Mus musculus, Canis lupus familiaris, Bos taurus, Sus scrofa, Ovis aries, and Capra hircus from RefSeq, and the sequences were aligned to the nonreference sequences using Spaln (v3.0.3) [35]. Finally, we integrated gene predictions based on evidence of ab initio gene prediction, RNA-Seq, and protein homology following the MAKER2 pipeline [28], and a set of high-confidence gene models were obtained using MAKER2 (v2.31.8) [28]. We further investigated gene description by aligning the protein sequences to the UniProt database (https://ftp.uniprot.org/pub/databases) using BLASTP (v2.11.0) [26].
Novel SNP calling and population genetic analysis
The reads of each sample that couldn't be mapped to ARS1.2 were mapped to the nonreference sequence using BWA-MEM (v0.7.17-r1188) [21] with the default parameters. The duplicated mapped reads were filtered using MarkDuplicates module in GATK (v4.1.9.0) [36]. Short variations (SNPs and INDELs) were identified by HaplotypeCaller module in GATK (v4.1.9.0) [36]. GenomicsDBImport module in GATK (v4.1.9.0) [36] was used to generate a database for GVCF files of all samples. And then, genotypes of variations were detected using the GenotypeGVCFs module in GATK (v4.1.9.0) [36]. SNPs were selected by the SelectVariants module in GATK (v4.1.9.0) [36] and filtered out potential false-positive loci using the VariantFiltration module of GATK (v4.1.9.0) [36] under the parameters “QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < −12.5 || ReadPosRankSum < −8.0”. Biallelic SNPs were extracted using VCFtools (v0.1.15) [37] with parameters “--minDP 2 --min-alleles 2 --max-alleles 2”. We further filtered out SNPs with the following criteria using PLINK (v1.90b6.26) [38]: (1) minor allele frequency (MAF) < 0.01; (2) SNP call rate < 0.5; (3) individual call rate < 0.5. After the quality control, 3,190 novel SNPs were retained for subsequent analysis.
To investigate features of these novel SNPs, we performed population genetic analysis using novel SNP datasets, such as principal component analysis (PCA), an approximately maximum likelihood phylogenetic tree, and model-based clustering. We performed the above analysis based on two datasets: one comprising 723 domestic goats and the other including all the 813 individuals (Additional file 1: Table S1). PCA was performed using PLINK (v1.90b6.26) [38]. An approximately maximum likelihood phylogenetic tree was constructed using the FastTree (v2.1.11) [39] with default parameters. The final tree topology was visualized using iTOL tool [40]. We performed model-based clustering to estimate the ancestry of each individual using the ADMIXTURE (v1.3.0) [41] with the number of ancestry kinships (K) set to 2–8.
PAV calling and analysis
To investigate the characteristics of PAV in different populations, individuals with a minimum average sequencing depth of 10 × were retained for PAV calling [9, 13]. A total of 277 individuals were retained for PAV calling, including 85 wild, 121 native, and 71 improved goats (Additional file 1: Table S3). The presence or absence of each gene in each individual was determined using SGSGeneloss-based method [42]. We retrieved the longest transcript for each gene, considering it as the gene body, and subsequently calculated coverage based on this gene body [13]. If a minimum of two reads covered less than 5% of the exon regions, the gene was defined as absent in this individual; otherwise, it was classified as present [43].
We further investigated effects of artificial and natural selection, which were among the most important factors that reshaped the genome's architecture [44]. To identify genes under selection during domestication and improvement, we implemented PAV selection analysis between Bezoar (n = 24) and native goats (n = 121) to identify genes associated with domestication (Table 2). We also performed PAV selection analysis between native goats (n = 121) and improved dairy goats (n = 57) to identify genes related to improvement process (Table 2). Fisher’s exact test was conducted to determine the significance of the difference in the presence frequencies for each gene. A gene with a P value less than 0.005 was chosen as the putative gene.
Results
Pan-genome of Capra genus
A total of 813 individuals were collected, including 723 domestic goats and 90 wild goats from Europe, Africa, and Asia (Fig. 1, Additional file 1: Table S1), and were used to assemble a goat pan-genome. The average sequence coverage of all the individuals was 16.40× (1.33×–48.64×). After removing contaminants and redundancies, 146 Mb sequences absent from ARS1.2 were retained. These nonreference sequences contained 133,959 contigs (501 bp ≤ length of contig ≤ 113,829 bp) with an average length of 1,091 bp. In addition, 974 protein-coding genes were predicted in the nonreference genome (Additional file 1: Table S4). Ultimately, the goat pan-genome, comprising the ARS1.2 and the nonreference sequences, harbored 25,743 genes (21,546 protein-coding genes and 4,197 other types of genes) covering DNA length up to 3,068,830,139 bp. We also compared the size and number of protein-coding genes of the goat pan-genome with other published goat reference genomes, such as CHIR_1.0 (GCF_000317765.1) and CVASU_BBG_1.0 (GCA_004361675.1). The genome size and number of protein-coding genes in the goat pan-genome were higher than the two reference genomes (Additional file 1: Table S7).
Novel SNP calling and population genetic analysis
To identify missing SNPs in previous studies using a single reference genome, reads that couldn't be aligned to ARS1.2 were mapped to the nonreference sequences, and SNP calling was performed. We detected 3,190 high-quality novel SNPs in 813 samples. We conducted PCA, approximately-maximum-likelihood phylogenetic tree, and model-based clustering analysis using the novel SNP dataset. The results showed that samples from the same species tended to cluster (Fig. 2A, Additional files 2 and 3: Fig. S1 and S2). Most domestic goats were clustered based on their breeds (Additional file 4: Fig. S3), and domestic goats were split into 3 geographically structured groups (Asian, African, and European populations) (Fig. 2B–D, Additional files 2 and 3: Fig. S1 and S2). Population genetic analysis showed a close genetic relationship between Bezoar Ibex/Markhor and domestic goats (Fig. 2A, Additional files 2 and 3: Fig. S1 and S2). The pattern was consistent with previous studies using whole genome sequence datasets [1].

Population genetic analysis based on novel SNPs. A Maximum-likelihood tree based on novel SNPs for domestic goats and close wild relatives. B Maximum-likelihood tree based on novel SNPs for domestic goats. C Principal component analysis based on novel SNPs for domestic goats. D Model-based clustering of domestic goats with different numbers of ancestral kinships (K = 3, 4, and 5)
Characterization of gene PAVs
A total of 277 individuals with an average sequence depth of more than 10× (Additional file 1: Table S3) were used to perform PAV calling. The presence frequency of each gene in all 277 individuals was calculated. In accordance with previous pan-genome studies [9, 13, 45], the genes were grouped into four categories (core, softcore, shell, cloud) according to their presence frequencies: core gene presenting in all individuals (100%); softcore gene presenting in more than 99% of individuals but less than 100% of individuals; shell gene presenting in 1%–99% individuals and cloud gene presenting in less than 1% individuals. Finally, 23,098 genes (89.73%) were found in all 277 individuals and were classified as core genes. The remaining 2,645 genes were classified as dispensable genes, including 914 softcore genes (3.55%), 867 shell genes (3.37%), and 864 cloud genes (3.36%) (Fig. 3A). The distribution pattern of gene PAVs was consistent with pan-genome studies of other species, such as chicken [13], Brassica oleracea [46], and arabidopsis [47]. It showed moderately high conserved genes (core and softcore genes) (Fig. 3A). The model of the pan-genome size by 200 iteratively randomly sampling individuals from 1 to 277 indicated an open pan-genome with an estimated total of 25,394 genes (Fig. 3B). The result suggested that the goat pan-genome assembled in this study didn’t reach saturation and could not include all or nearly all Capra genus gene contents. We also noted a different distribution of gene PAV across various species or populations (Fig. 3C). A striking feature was the decrease in gene presence from wild species to native populations and further from native populations to improved populations (Fig. 3C).

Characterization of gene PAVs and pan-genome model. A Pan-genome gene classification. B Pan-genome modeling. The upper and lower lines represent the pan and core-genome numbers, respectively. C The heatmap shows the PAVs of variable genes within wild, native, and improved dairy populations
Gene PAVs associated with domestication and improvement of goats
We compared gene presence between Bezoar and native goat populations (Table 2, Additional file 1: Table S2). The genes with higher frequencies in native goat populations than Bezoar population were considered as possible favorable genes, and genes with lower frequencies in native goat populations than Bezoar population were considered as possible unfavorable genes [9]. We identified 89 genes with significantly altered frequencies (Fig. 4A), comprising 15 favorable genes and 74 unfavorable genes (Fig. 4B). Annotation analysis indicated that all 89 genes were in ARS1.2 (Additional file 1: Table S5). Among the 15 favorable genes, CLEC2D (LOC102184901) was associated with immune and inflammatory responses (Table 2) and has been fixed in native goats (frequency = 99.17%) (Additional file 1: Table S5). We also detected genes related to immune and inflammatory responses in unfavorable genes, such as FAM26F (LOC102181592). FAM26F that encoding tetraspanin-like membrane glycoprote [48] was detected in all Bezoar populations and 64.46% of native populations (Additional file 1: Table S5). We performed a comparative analysis of gene presence between native goats and dairy goats. Of note, 35 favorable genes (with a higher frequency in improved goat populations compared with native goat populations) and 72 unfavorable genes (with a lower frequency in improved goat populations compared with native goat populations) were identified (Fig. 4C and D, Additional file 1: Table S6). Genes associated with immune and inflammatory responses were also found in unfavorable genes, such as GIMAP6 (LOC108635866) and CLECL1 (LOC108636138). GIMAP6 was detected in 96.69% of native goats and 54.39% of dairy goats (Additional file 1: Table S6) and has been reported to be associated with the regulation of autophagy, immune competence, and inflammation in mammals [49].

Gene PAVs associated with domestication and improvement of goats. A Scatter plots showing gene occurrence frequencies in wild (Bezoar) and native populations. B Occurrence frequency patterns of putative selected genes during domestication. C Scatter plots showing gene occurrence frequencies in native and improved populations. D Occurrence frequency patterns of putative selected genes during improvement. E Violin plots showing the number of detected genes in each individual within wild, native, and improved dairy populations. The significant difference of Mann–Whitney U test: ***P < 0.001
Notably, a prominent feature was the loss of genes during domestication and improvement (Fig. 4B and D). Specifically, 75 (83.15%) and 72 genes (67.29%) were identified as unfavorable genes, respectively (Additional file 1: Table S5 and S6). We further compared the gene count within each individual of the Bezoar population and the native goat and dairy goat populations. An average of 24,664, 24,623, and 24,615 genes were detected, showing a decreasing trend in gene number from wild species to native populations and from native populations to dairy populations (Fig. 4E).
Discussion
This study used the map-to-pan strategy to construct a goat pan-genome. Taking into account the potential impact of sample size and representativeness on the pan-genome in humans and other organisms [13, 50, 51], we collected a diverse range of samples, covering almost all goats and their wild relatives across various geographic regions (Fig. 1, Additional file 1: Table S1), surpassing the scope of the previous study [18]. A total of 146 Mb nonreference sequences and 974 additional genes were identified as absent from ARS1.2. Notably, both indices were higher than those in the previous goat pan-genome study [18]. Utilizing the goat pan-genome, we identified 3,190 novel SNPs, indicating that the pan-genome can capture a greater diversity of genetic variations than a single reference genome.
Modeling of the goat pan-genome size revealed an open pan-genome (Fig. 3B). The result suggested that 277 samples in our PAV analysis were insufficient to capture the full spectrum of genetic diversity within goats. One potential reason was the limited sample size. Our study’s sample size was moderate compared with other animals, such as 268 individuals in chicken [13] and 250 individuals in pigs [11]. Nevertheless, the mosaic domestication of goats in the Fertile Crescent has led to genetically and geographically distinct goat populations since the Neolithic period, and continues to influence their diversity today [3, 52]. In addition, a high level of interspecies sequence variation has been observed compared with intraspecies in Capra genus [53], a high proportion of wild goats (31%, Additional file 1: Table S3) in our PAV analysis could contribute to the heterogeneity observed in genomic sequences. Therefore, by incorporating additional samples, particularly from improved and native breeds outside of China, PAV analysis will encompass a broader spectrum of goat gene contents.
Compared with other species, the goat pan-genome exhibited a higher proportion of core genes (89.73%), such as in pigs (75%) [54], chickens (76.32%) [13] and goose (75.86%) [55]. Although the proportion of core genes may decrease with an increase in sample size, 11% of variable genes within the goat pan-genome indicate substantial genomic potential for goat breeding. The PAV analysis unveiled gene loss events during domestication and improvement (Fig. 4). Comparable patterns have been observed in domesticated plants, such as tomato and cotton [9, 56]. The annotation of PAVs suggested that domestication and improvement may have influenced genomic features associated with immune and inflammatory responses [9, 56]. We also identified several immune and inflammation-related genes, including CLEC2D, FAM26F, GIMAP6, and CLECL1 (Table 2), that may be related to the domestication and improvement of goats. CLEC2D was found to be fixed (frequency = 0.99) within native goat populations. It plays a role in regulating immune and inflammatory responses as a ligand of CD161 receptor, and inhibiting the function of NK cells and T cells by its expression [57, 58]. FAM26F was observed to be fixed (frequency = 1.00) within Bezoar population. The gene could be activated by various infections, such as bacteria, viruses, and parasites, and expressed in various immune cells [48]. This observation is consistent with variable environments encountered by Bezoar. Bezoars are primarily distributed in a discontinuous range across Central Asia and the Caucasus, extending to southwestern Turkey [59]. Their natural habitats are characterized by harsh environmental conditions, including arid deserts and high-altitude rocky mountains [60]. GIMAP6 and CLECL1 exhibited a higher frequency in native goats than dairy goats (Additional file 1: Table S6) and are associated with immune responses [49, 61, 62]. This might also reflect the diverse environments encountered by native goats [53].
Conclusion
We constructed a goat pan-genome based on 813 individuals, including 723 domestic goats and 90 samples from their wild relatives, which presented a broad genetic and geographical representativeness. A total of 146 Mb sequences and 974 protein-coding genes were identified as absent from ARS1.2. Additionally, 3,190 novel SNPs were identified from the nonreference sequences, which suggested that the goat pan-genome could capture more genetic variations. PAV analysis revealed evidence of gene loss during domestication and subsequent improvement processes. PAV selection analysis identified several genes related to immune regulation and inflammatory responses. This research enhances our understanding of the genomic changes throughout the history of goat breeding and highlight the importance of pan-genome in goat genomic studies.