Title

Documentation

General

The rapid advancement of next-generation sequencing technology yielded a deluge of world-wide cattle genomic data for characterization of population genetic diversity and genomic selection. However, efficient storage, querying and visualization of such huge datasets remain challenging. Here, we developed a comprehensive Bovine Genome Variation Database (BGVD) that provides six main functionalities: Gene Quick Search, Variation Search, Genomic Signature Search, Genome Browser, Alignment Search Tools (BLAT/BLAST) and Genome Coordinate Conversion Tool (LiftOver). The BGVD provides genomic variations comprising ~60.44 M SNPs, ~6.86 M indels, 76,633 CNV regions and signatures of selective sweeps in 432 world-wide modern cattle. Users can quickly retrieve variations distribution patterns of 54 globally representative cattle breeds of a given gene symbol or genomic region through three versions of the bovine genome (ARS-UCD1.2, UMD3.1.1 and Btau 5.0.1). The signals of selection are displayed in the common formats of Manhattan plots and Genome Browser tracks. To further investigate the relationship between variants and signatures of selection, the Genome Browser integrate all variations and selection data coupled with resources from NCBI, the UCSC Genome Browser and AnimalQTLdb for convenient visualization. Collectively, all these features make the BGVD a useful archive for in-depth analysis in cattle biology and cattle breeding.

Related articles:

1. Chen, N., Cai, Y., Chen, Q., Li, R., Wang, K., Huang, Y. et al. Whole-genome resequencing reveals world-wide ancestry and adaptive introgression events of domesticated cattle in East Asia. Nature Communications, 2018, 9, 2337.
2. Wang X, Zheng Z, Cai Y, Chen T, Li C, Fu W, Jiang Y*. CNVcaller: Highly Efficient and Widely Applicable Software for Detecting Copy Number Variations in Large Populations. GigaScience, 2017, 6(12):1-12.

Methods

I. Pipeline of SNPs and indels calling
  1. All raw sequence data were obtained from Sequence Read Archive (SRA) of NCBI. We collected a total of 432 samples representing 54 breeds. The genome resequencing achieved an average depth of ~13X.
  2. All cleaned reads were mapped to the cattle reference assembly Btau_5.0.1 (GCF_000003205.7) using BWA-MEM (0.7.13-r1126) with default parameters. Duplicate reads were removed using Picard Tools (http://broadinstitute.github.io/picard/). Then, the Genome Analysis Toolkit (GATK, version 3.6-0-g89b7209) was used to detect single nucleotide polymorphisms (SNPs). The following criteria were applied to all SNPs: (1) SNPs mean sequencing depth (over all included individuals) < 1/3X and > 3X were filtered; (2) SNPs with Variant Confidence/Quality by Depth (QD) < 2 were filtered; (3) SNPs with RMS Mapping Quality (MQ) < 40.0 were filtered; (4) SNPs with Phred-scaled P-value using Fisher’s exact test to detect strand bias (FS) > 60 were filtered; (5) SNPs with Z-score from the Wilcoxon rank sum test of Alt vs. Ref read mapping qualities (MQRankSum) < -12.5 were filtered; (6) SNPs with Z-score according to the Wilcoxon rank sum test of Alt vs. Ref read position bias (ReadPosRankSum) < -8 were filtered; (7) SNPs with maximum missing rate < 0.1; and (8) SNPs with only two alleles.
  3. For indel calling, we first sifted structural variations for each sample by GATK with the SelectVariants-based method. Then, we applied the hard filter command “VariantFiltration” to exclude potential false-positive variant calls with the parameter –filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < −8.0 || InbreedingCoeff < −0.8". Finally, we only retained the 1–30 bp indels for downstream analysis.
  4. A total of ~60.4 million autosomal SNPs and ~6.8 million autosomal indels were identified.
  5. Beagle software was used to phase the identified SNPs in cattle.
  6. Annotation of SNPs and indels was carried out by using snpEff.
  7. Minor allele frequencies (MAF) for all cattle, and allele frequencies for each breed and the "core" cattle group were calculated with PLINK.
II. Pipeline of CNVs calling
  1. The CNVcaller was used to discover haploid copy number in 432 cattle genomes.
  2. Filtering criteria: the copy number diverges from normal around two times standard deviation of this sample; Alternative allele frequency > 0.05 or have at least 3 homozygous individuals in the population.
  3. After the above screen, the adjacent correlated candidate CNV windows were merged in to a continuous CNV region. In each CNV region, the copy number of all samples was clustered by mean-shift algorithm.
  4. The CNVs were annotated using Annovar.
III. Population structure
  1. The PCA of the SNPs was performed using the smartpca programme in EIGENSOFT v5.0. The Tracy-Widom test was used to determine the significance level of the eigenvectors. ADMIXTURE version 1.3.0 was used to quantify the genome-wide admixtures among modern cattle populations. ADMIXTURE was run for each possible group number (K = 2 to 8) with 200 bootstrap replicates. For autosomal genome data, a neighbour-joining tree was constructed with PLINK (version 1.9) using the matrix of pairwise genetic distances.
  2. Combining our previous result, six geographically distributed ancestral components can be roughly ascribed to: African taurine, European taurine, Eurasian taurine, East Asian taurine, Chinese indicine, and Indian indicine.
IV. Selection evaluation
  1. BGVD provides nucleotide diversity (Pi), heterozygosity (Hp), integrated haplotype score (iHS), Cockerham and Weir Fst (FST), cross-population extended haplotype homozygosity (XP-EHH), and cross-population composite likelihood ratio (XP-CLR) for eight cattle groups. To facilitate the identification of true selective signatures, we set a cutoff corresponding to Z test P < 0.005.
V. liftOver chain file
  1. We aligned UMD3.1.1 and newly published ARS-UCD1.2 genome to Btau 5.0.1 genome producing pairwise alignments by LAST v88555.
  2. Three utilities, maf-convert, axtChain and chainMergeSort, were used to produce two liftOver chain files including Btau5.0.1ToUMD3.1.1.chain.gz and Btau5.0.1ToARS-UCD1.2.chain.gz.
VI. Database implementation
  1. High-quality SNPs, indels, CNVs, selection scores and their corresponding annotations, classification and threshold value, were processed with Perl scripts and stored in the MySQL database.
  2. We use PHP Server Pages, HTML5 and JavaScript to implement search, data visualization and download.

Manual

I. Samples and population structure

Our database integrates resequencing data from published cattle genetic works, giving a total of 432 sample set representing 54 breeds. The set contains 10 geographic groups: 108 West European cattle, 83 Central South European cattle, 9 Middle East cattle, 9 Tibetan cattle, 28 Northeast Asian cattle, 47 North and Central Chinese cattle, 33 South Chinese cattle, 24 Indo-Pakistani cattle and 70 African cattle. Principal component analysis (PCA) and ADMIXTURE analysis demonstrated a clear genetic structure with samples from each geographical region clustering together. Six geographically distributed ancestral components can be roughly ascribed to: African taurine, European taurine, Eurasian taurine, East Asian taurine, Chinese indicine, and Indian indicine.

 

Fig 1. Geographic distribution and population genetics analyses of 432 cattle individuals.

 
II. Gene quick search

We integrated information from NCBI, AmiGO 2 and KEGG. Users can input a gene symbol to view basic gene information (e.g., genomic location, transcript and protein sequence, GO ID and GO terms, and relevant KEGG pathways), gene variation information (e.g., SNPs, Indels, and CNVs), and gene selective signatures (e.g., FST, XP-CLR, XP-EHH, Pi, Hp, iHS). We also provide links to Gbrowse and external databases (NCBI, AmiGO 2, and KEGG) to help the user obtain more information, such as gene/mRNA/protein sequence, KEGG Orthology (KO), and motif.

III. Variation search

The BGVD allows users to obtain information of SNPs, indels and CNVs by searching for a specific gene or a genomic region in three versions of the bovine genome (Btau 5.0.1, UMD3.1.1 and newly published ARS-UCD1.2). Users can filter SNPs and indels further by "Advanced Search", in which some parameters, such as minor allele frequency and consequence type, can be set; this option enables users to narrow down the items of interest in an efficient and intuitive manner. The results are presented in an interactive table and graph. For SNPs and indels, users can obtain related details including variant position, alleles, minor allele frequency, variant effect, rs id and the allele frequency distribution pattern in 54 world-wide cattle breeds or six "core" cattle groups. For CNVs, users can obtain information about CNV region, such as intersected genomic region, CNV length, the closest gene, consequence type and copy number distribution in 432 individuals representing 49 cattle populations.

1. SNPs or indels Search

2. CNVs Search

 
IV. Signature search

Users can select a specific gene symbol or genomic region, one of the statistical methods (Pi, Hp, iHS, FST, XP-CLR, XP-EHH), and a specific "core" cattle group to view the selection scores. In our database, the selection scores are pre-processed by several algorithms (Z-transform, logarithm) which are commonly used in published papers. The results are retrieved in a tabular format. When users click the "show" button on the table, selective signals are displayed in Manhattan plots or common graphics, where the target region or gene is highlighted in red/blue colour.

 
V. BGVD tools

1. Local UCSC genome browser

Users can search with a gene symbol, or a transcript name, or a genomic region to view SNPs, indels, CNVs, genomic signature, QTL and conserved elements in the global view. Currently, 57 tracks have been released for the Btau 5.0.1 assembly. The "PDF/PS" item under the "View" menu of navigation bar was used to generate a high quality image in PostScript or PDF formats.

2. Alignment search tools (BLAT/BLAST)

We introduced two sequence alignment tools, webBlat and NCBI wwwBLAST. The webBlat can be used to quickly search for homologous regions of a DNA or mRNA sequence, which can then be displayed in the browser. BLAST can find regions of local similarity between sequences, which can be used to infer functional and evolutionary relationships between sequences.

3. Genome coordinate conversion tool (liftOver)

We also introduced a genome coordinate conversion Tool, liftOver. The liftOver tool is used to translate genomic coordinates from one assembly version into another and also retrieves putative orthologous regions in other species. Our database produces two liftOver chain files (Btau5.0.1ToUMD3.1.1.chain.gz and Btau5.0.1ToARS-UCD1.2.chain.gz) and provides an online lift from Btau_5.0.1 to UMD_3.1.1 and from Btau_5.0.1 to ARS-UCD1.2.

Project organizers

Yu Jiang

Northwest A&F University, Yangling, Shaanxi, China

Email: yu.jiang@nwafu.edu.cn

Chuzhao Lei

Northwest A&F University, Yangling, Shaanxi, China

Email: leichuzhao1118@ nwafu.edu.cn