Documentation

General

The rapid advancement of next-generation sequencing technology yielded a deluge of world-wide bee genomic data for characterization of population genetic diversity and genomic selection. However, efficient storage, querying and visualization of such huge datasets remain challenging. Here, we developed a comprehensive Bovine Genome Variation Database (EGVD) that provides four main functionalities: Variation Search, Genome Browser, Alignment Search Tools (BLAT/BLAST) and Genome Coordinate Conversion Tool (LiftOver). We provide a parallel search for variations, such as dbSNP rsid, gene symbol, and genomic region in one versions of the bee genome (ACSNU-2.0). We provide allele frequency distribution pattern of each SNP or indel in 15 globally representative bee breeds or five "core" groups.The signals of selection are displayed in the common formats of Manhattan plots and Genome Browser tracks. To further investigate the relationship between variants and signatures of selection, the Genome Browser integrate all variations and selection data coupled with resources from NCBI, the UCSC Genome Browser and AnimalQTLdb for convenient visualization. Collectively, all these features make the EGVD a useful archive for in-depth analysis in bee biology and bee breeding.

1. Chen, N., Cai, Y., Chen, Q., Li, R., Wang, K., Huang, Y. et al. Whole-genome resequencing reveals world-wide ancestry and adaptive introgression events of domesticated bee in East Asia. Nature Communications, 2018, 9, 2337.
2. Wang X, Zheng Z, Cai Y, Chen T, Li C, Fu W, Jiang Y*. CNVcaller: Highly Efficient and Widely Applicable Software for Detecting Copy Number Variations in Large Populations. GigaScience, 2017, 6(12):1-12.

Methods

I. Pipeline of SNPs and indels calling

All raw sequence data were obtained from Sequence Read Archive (SRA) of NCBI. We collected a total of 432 samples representing 54 breeds. The genome resequencing achieved an average depth of ~13X.
All cleaned reads were mapped to the bee reference assembly Btau_5.0.1 (GCF_000003205.7) using BWA-MEM (0.7.13-r1126) with default parameters. Duplicate reads were removed using Picard Tools (http://broadinstitute.github.io/picard/). Then, the Genome Analysis Toolkit (GATK, version 3.6-0-g89b7209) was used to detect single nucleotide polymorphisms (SNPs). The following criteria were applied to all SNPs: (1) SNPs mean sequencing depth (over all included individuals) < 1/3X and > 3X were filtered; (2) SNPs with Variant Confidence/Quality by Depth (QD) < 2 were filtered; (3) SNPs with RMS Mapping Quality (MQ) < 40.0 were filtered; (4) SNPs with Phred-scaled P-value using Fisher’s exact test to detect strand bias (FS) > 60 were filtered; (5) SNPs with Z-score from the Wilcoxon rank sum test of Alt vs. Ref read mapping qualities (MQRankSum) < -12.5 were filtered; (6) SNPs with Z-score according to the Wilcoxon rank sum test of Alt vs. Ref read position bias (ReadPosRankSum) < -8 were filtered; (7) SNPs with maximum missing rate < 0.1; and (8) SNPs with only two alleles.
For indel calling, we first sifted structural variations for each sample by GATK with the SelectVariants-based method. Then, we applied the hard filter command “VariantFiltration” to exclude potential false-positive variant calls with the parameter –filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < −8.0 || InbreedingCoeff < −0.8". Finally, we only retained the 1–30 bp indels for downstream analysis.
A total of ~60.4 million autosomal SNPs and ~6.8 million autosomal indels were identified.
Beagle software was used to phase the identified SNPs in bee.
Annotation of SNPs and indels was carried out by using snpEff.
Minor allele frequencies (MAF) for all bee, and allele frequencies for each breed and the "core" bee group were calculated with PLINK.

II. Pipeline of CNVs calling

The CNVcaller was used to discover haploid copy number in 432 bee genomes.
Filtering criteria: the copy number diverges from normal around two times standard deviation of this sample; Alternative allele frequency > 0.05 or have at least 3 homozygous individuals in the population.
After the above screen, the adjacent correlated candidate CNV windows were merged in to a continuous CNV region. In each CNV region, the copy number of all samples was clustered by mean-shift algorithm.
The CNVs were annotated using Annovar.

III. Population structure

The PCA of the SNPs was performed using the smartpca programme in EIGENSOFT v5.0. The Tracy-Widom test was used to determine the significance level of the eigenvectors. ADMIXTURE version 1.3.0 was used to quantify the genome-wide admixtures among modern bee populations. ADMIXTURE was run for each possible group number (K = 2 to 8) with 200 bootstrap replicates. For autosomal genome data, a neighbour-joining tree was constructed with PLINK (version 1.9) using the matrix of pairwise genetic distances.
Combining our previous result, six geographically distributed ancestral components can be roughly ascribed to: African taurine, European taurine, Eurasian taurine, East Asian taurine, Chinese indicine, and Indian indicine.

IV. Selection evaluation

EGVD provides nucleotide diversity (Pi), heterozygosity (H_p), integrated haplotype score (iHS), Cockerham and Weir Fst (F_ST), cross-population extended haplotype homozygosity (XP-EHH), and cross-population composite likelihood ratio (XP-CLR) for eight bee groups. To facilitate the identification of true selective signatures, we set a cutoff corresponding to Z test P < 0.005.

V. liftOver chain file

We aligned UMD3.1.1 and newly published ARS-UCD1.2 genome to Btau 5.0.1 genome producing pairwise alignments by LAST v88555.
Three utilities, maf-convert, axtChain and chainMergeSort, were used to produce two liftOver chain files including Btau5.0.1ToUMD3.1.1.chain.gz and Btau5.0.1ToARS-UCD1.2.chain.gz.

VI. Database implementation

High-quality SNPs, indels, CNVs, selection scores and their corresponding annotations, classification and threshold value, were processed with Perl scripts and stored in the MySQL database.
We use PHP Server Pages, HTML5 and JavaScript to implement search, data visualization and download.

Manual

I. Samples and population structure

Our database integrates resequencing data from published bee genetic works, giving a total of 290 sample set representing 14 breeds. The set contains 14 geographic groups: 36 Changbaishan bee, 10 Fujian bee, 20 Gansu bee, 10 Guangxi bee, 10 Guizhou bee, 14 Hainan bee, 10 Hubei bee, 10 Ningxia bee , 20 Shandong bee, 68 Shanxi bee, 12 Tibet bee, 10 xinyuan beeand 26 Yunnan bee . Principal component analysis (PCA) and ADMIXTURE analysis demonstrated a clear genetic structure with samples from each geographical region clustering together.

Fig 1. Geographic distribution and population genetics analyses of 432 bee individuals.

II. Variation search

The EGVD allows users to obtain information of SNPs, indels and CNVs by searching for a specific gene or a genomic region in three versions of the bovine genome (Btau 5.0.1, UMD3.1.1 and newly published ARS-UCD1.2). Users can filter SNPs and indels further by "Advanced Search", in which some parameters, such as minor allele frequency and consequence type, can be set; this option enables users to narrow down the items of interest in an efficient and intuitive manner. The results are presented in an interactive table and graph. For SNPs and indels, users can obtain related details including variant position, alleles, minor allele frequency, variant effect, rs id and the allele frequency distribution pattern in 54 world-wide bee breeds or six "core" bee groups. For CNVs, users can obtain information about CNV region, such as intersected genomic region, CNV length, the closest gene, consequence type and copy number distribution in 432 individuals representing 49 bee populations.

1. SNPs or indels Search

2. Indels Search

III. EGVD tools

1. Local UCSC genome browser

Users can search with a gene symbol, or a transcript name, or a genomic region to view SNPs, indels, genomic signature, QTL and conserved elements in the global view. Currently, 57 tracks have been released for the Btau 5.0.1 assembly. The "PDF/PS" item under the "View" menu of navigation bar was used to generate a high quality image in PostScript or PDF formats.

2. Alignment search tools (BLAT/BLAST)

We introduced two sequence alignment tools, webBlat and NCBI wwwBLAST. The webBlat can be used to quickly search for homologous regions of a DNA or mRNA sequence, which can then be displayed in the browser. BLAST can find regions of local similarity between sequences, which can be used to infer functional and evolutionary relationships between sequences.

3. Genome coordinate conversion tool (liftOver)

We also introduced a genome coordinate conversion Tool, liftOver. The liftOver tool is used to translate genomic coordinates from one assembly version into another and also retrieves putative orthologous regions in other species. Our database produces two liftOver chain files (Btau5.0.1ToUMD3.1.1.chain.gz and Btau5.0.1ToARS-UCD1.2.chain.gz) and provides an online lift from Btau_5.0.1 to UMD_3.1.1 and from Btau_5.0.1 to ARS-UCD1.2.

Project organizers

Yu Jiang

Northwest A&F University, Yangling, Shaanxi, China

Email: yu.jiang@nwafu.edu.cn

Chuzhao Lei

Northwest A&F University, Yangling, Shaanxi, China

Email: leichuzhao1118@ nwafu.edu.cn