Documentation

General

The rapid advancement of next-generation sequencing technology yielded a deluge of world-wide Pig genomic data for characterization of population genetic diversity and genomic selection. However, efficient storage, querying and visualization of such huge datasets remain challenging. Here, we developed a comprehensive Pig Genome Variation Database (PGVD) that provides six main functionalities: Gene Quick Search, Variation Search, Genomic Signature Search, Genome Browser, Alignment Search Tools (BLAT/BLAST) and Genome Coordinate Conversion Tool (LiftOver). The PGVD provides genomic variations comprising ~60.44 M SNPs, ~6.86 M indels, 76,633 CNV regions and signatures of selective sweeps in 432 world-wide modern Pig. Users can quickly retrieve variations distribution patterns of 54 globally representative Pig breeds of a given gene symbol or genomic region through three versions of the Pig genome (ARS-UCD1.2, UMD3.1.1 and Sscrofa11.1). The signals of selection are displayed in the common formats of Manhattan plots and Genome Browser tracks. To further investigate the relationship between variants and signatures of selection, the Genome Browser integrate all variations and selection data coupled with resources from NCBI, the UCSC Genome Browser and AnimalQTLdb for convenient visualization. Collectively, all these features make the PGVD a useful archive for in-depth analysis in Pig biology and Pig breeding.

1. Chen, N., Cai, Y., Chen, Q., Li, R., Wang, K., Huang, Y. et al. Whole-genome resequencing reveals world-wide ancestry and adaptive introgression events of domesticated Pig in East Asia. Nature Communications, 2018, 9, 2337.
2. Wang X, Zheng Z, Cai Y, Chen T, Li C, Fu W, Jiang Y*. CNVcaller: Highly Efficient and Widely Applicable Software for Detecting Copy Number Variations in Large Populations. GigaScience, 2017, 6(12):1-12.

Methods

I. Pipeline of SNPs and indels calling

All raw sequence data were obtained from Sequence Read Archive (SRA) of NCBI. We collected a total of 432 samples representing 54 breeds. The genome resequencing achieved an average depth of ~13X.
All cleaned reads were mapped to the Pig reference assembly Btau_5.0.1 (GCF_000003205.7) using BWA-MEM (0.7.13-r1126) with default parameters. Duplicate reads were removed using Picard Tools (http://broadinstitute.github.io/picard/). Then, the Genome Analysis Toolkit (GATK, version 3.6-0-g89b7209) was used to detect single nucleotide polymorphisms (SNPs). The following criteria were applied to all SNPs: (1) SNPs mean sequencing depth (over all included individuals) < 1/3X and > 3X were filtered; (2) SNPs with Variant Confidence/Quality by Depth (QD) < 2 were filtered; (3) SNPs with RMS Mapping Quality (MQ) < 40.0 were filtered; (4) SNPs with Phred-scaled P-value using Fisher’s exact test to detect strand bias (FS) > 60 were filtered; (5) SNPs with Z-score from the Wilcoxon rank sum test of Alt vs. Ref read mapping qualities (MQRankSum) < -12.5 were filtered; (6) SNPs with Z-score according to the Wilcoxon rank sum test of Alt vs. Ref read position bias (ReadPosRankSum) < -8 were filtered; (7) SNPs with maximum missing rate < 0.1; and (8) SNPs with only two alleles.
For indel calling, we first sifted structural variations for each sample by GATK with the SelectVariants-based method. Then, we applied the hard filter command “VariantFiltration” to exclude potential false-positive variant calls with the parameter –filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < −8.0 || InbreedingCoeff < −0.8". Finally, we only retained the 1–30 bp indels for downstream analysis.
A total of ~60.4 million autosomal SNPs and ~6.8 million autosomal indels were identified.
Beagle software was used to phase the identified SNPs in Pig.
Annotation of SNPs and indels was carried out by using snpEff.
Minor allele frequencies (MAF) for all Pig, and allele frequencies for each breed and the "core" Pig group were calculated with PLINK.

II. Pipeline of CNVs calling

The CNVcaller was used to discover haploid copy number in 432 Pig genomes.
Filtering criteria: the copy number diverges from normal around two times standard deviation of this sample; Alternative allele frequency > 0.05 or have at least 3 homozygous individuals in the population.
After the above screen, the adjacent correlated candidate CNV windows were merged in to a continuous CNV region. In each CNV region, the copy number of all samples was clustered by mean-shift algorithm.
The CNVs were annotated using Annovar.

III. Population structure

The PCA of the SNPs was performed using the smartpca programme in EIGENSOFT v5.0. The Tracy-Widom test was used to determine the significance level of the eigenvectors. ADMIXTURE version 1.3.0 was used to quantify the genome-wide admixtures among modern Pig populations. ADMIXTURE was run for each possible group number (K = 2 to 8) with 200 bootstrap replicates. For autosomal genome data, a neighbour-joining tree was constructed with PLINK (version 1.9) using the matrix of pairwise genetic distances.
Combining our previous result, six geographically distributed ancestral components can be roughly ascribed to: African taurine, European taurine, Eurasian taurine, East Asian taurine, Chinese indicine, and Indian indicine.

IV. Selection evaluation

PGVD provides nucleotide diversity (Pi), heterozygosity (H_p), integrated haplotype score (iHS), Cockerham and Weir Fst (F_ST), cross-population extended haplotype homozygosity (XP-EHH), and cross-population composite likelihood ratio (XP-CLR) for eight Pig groups. To facilitate the identification of true selective signatures, we set a cutoff corresponding to Z test P < 0.005.

V. liftOver chain file

We aligned UMD3.1.1 and newly published ARS-UCD1.2 genome to Sscrofa11.1 genome producing pairwise alignments by LAST v88555.
Three utilities, maf-convert, axtChain and chainMergeSort, were used to produce two liftOver chain files including Btau5.0.1ToUMD3.1.1.chain.gz and Btau5.0.1ToARS-UCD1.2.chain.gz.

VI. Database implementation

High-quality SNPs, indels, CNVs, selection scores and their corresponding annotations, classification and threshold value, were processed with Perl scripts and stored in the MySQL database.
We use PHP Server Pages, HTML5 and JavaScript to implement search, data visualization and download.

Manual

I. Samples and population structure

Pig have had a central role in the evolution of human cultures and are the most economically important of domesticated animal species. Here, we developed a comprehensive Pig Genome Variation Database (PGVD) for providing six main functionalities: Gene Quick Search, Variation Search, Genomic Signature Search, Genome Browser, Alignment Search Tools (BLAT/BLAST) and Genome Coordinate Conversion Tool (LiftOver). In current version, PGVD contains 74,283,444 SNPs and 10,500,671 indels derived from 448 animals. And selective signatures were evaluated for six pig groups by using four methods (Pi, iHS, FST and XP-EHH). Many external databases, such as NCBI, the UCSC Genome Browser, AnimalQTLdb, AmiGO 2 and KEGG were integrated into our browser. PGVD will be a useful archive for in-depth analysis in pig biology and pig breeding.

Fig 1. Geographic distribution and population genetics analyses of 432 Pig individuals.

II. Gene quick search

Type a gene symbol into the "search term" box, then press "search" to obtain basic gene information (e.g., genomic location, transcript sequence, protein sequence, GO ID and GO terms, and relevant KEGG pathways), gene variation information (e.g., SNPs and Indels), and gene selective signatures (e.g., FST, XP-CLR, XP-EHH, Pi, Hp, and iHS).

III. Variation search

The PGVD allows users to obtain information of SNPs and indels by searching for a specific gene or a genomic region in a Pig genome (Sscrofa_11.1). Users can filter SNPs and indels further by "Advanced Search", in which some parameters, such as minor allele frequency and consequence type, can be set; this option enables users to narrow down the items of interest in an efficient and intuitive manner. The results are presented in an interactive table and graph. For SNPs and indels, users can obtain related details including variant position, alleles, minor allele frequency, variant effect, rs id and the allele frequency distribution pattern in 54 world-wide Pig breeds or six "core" Pig groups.

SNPs or indels Search

IV. Signature search

Users can select a specific gene symbol or genomic region, one of the statistical methods (Pi, H_p, iHS, F_ST, XP-CLR, XP-EHH), and a specific "core" Pig group to view the selection scores. In our database, the selection scores are pre-processed by several algorithms (Z-transform, logarithm) which are commonly used in published papers. The results are retrieved in a tabular format. When users click the "show" button on the table, selective signals are displayed in Manhattan plots or common graphics, where the target region or gene is highlighted in red/blue colour.

V. PGVD tools

Alignment search tools (BLAT/BLAST)

We introduced two sequence alignment tools, webBlat and NCBI wwwBLAST. The webBlat can be used to quickly search for homologous regions of a DNA or mRNA sequence, which can then be displayed in the browser. BLAST can find regions of local similarity between sequences, which can be used to infer functional and evolutionary relationships between sequences.

Project organizers

Yu Jiang

Northwest A&F University, Yangling, Shaanxi, China

Email: yu.jiang@nwafu.edu.cn

Chuzhao Lei

Northwest A&F University, Yangling, Shaanxi, China

Email: leichuzhao1118@ nwafu.edu.cn