Title

1  About SheepVar

SheepVar is a comprehensive public repository of sheep genomic variation that integrates single nucleotide polymorphisms (SNPs) and small insertions and deletions (InDels) derived from large-scale genomic datasets. The database incorporates whole-genome resequencing data from 5,603 modern sheep samples, comprising 53.7M SNPs and 3.7M InDels, as well as genotype data from 7,013 individuals profiled using the Illumina Ovine 50k BeadChip, covering approximately 41k loci. In addition, SheepVar includes 75 ChIP-seq and 75 ATAC-seq datasets, together with paleogenomic data from 100 ancient sheep specimens. Collectively, these datasets represent 135 domestic sheep breeds and seven wild relatives. SheepVar provides extensive variant annotations, including genomic context, evolutionary conservation, inferred ancestral states, population-specific allele frequencies, molecular quantitative trait loci (QTLs), epigenetic features, and signatures of selection. Selection signals are characterized using five complementary statistical methods: nucleotide diversity (π), Tajima’s D, integrated haplotype score (iHS), composite likelihood ratio (CLR), and fixation index (FST).

The platform supports multiple analytical functions, including flexible variant querying, multi-omics data integration across whole-genome sequencing (WGS), RNA-seq, ChIP-seq, and ATAC-seq datasets, and online genotype imputation services for both SNP array data and low-coverage whole-genome sequencing data.

Overall, SheepVar provides a curated and standardized variant resource that serves as a foundational reference for ovine genomics research and facilitates studies in sheep genetics, evolutionary biology, and precision breeding.

2  Pipeline of SNPs and InDels calling

Using SRAToolkit (v. 3.2.0), raw SRA data were first converted into FASTQ files. These reads were then trimmed to remove adapter sequences and low-quality bases with fastp (v0.12.4). The cleaned reads were aligned to the reference genomes using BWA-MEM. The resulting aligned reads were merged into a single BAM file and sorted using SAMtools.

Variant calling and genotyping for SNPs and InDels were performed using the “HaplotypeCaller,” “GenotypeGVCFs,” and “SelectVariants” tools within the Genome Analysis Toolkit (GATK, v.4.1.8.1). To ensure high accuracy, the following hard-filtering expressions were applied: for SNPs, “QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 3.0”; for InDels, “QD < 2.0 || FS > 200.0 || SOR > 10.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0”. Only InDels with lengths between 1 and 30 bp were retained.

Subsequently, only biallelic variants were kept for stricter quality control using PLINK (v.1.90b4.6) with the following criteria: SNP call rate > 90% and sample call rate > 90%. After removing duplicate or closely related individuals (pi-hat ≥ 0.5), the remaining SNPs and samples were used for genotype imputation with Beagle (v.5.4). A total of 5,603 samples were retained for imputation. Finally, variants with a minor allele frequency (MAF) ≥ 0.1% were kept, resulting in a final dataset comprising 53.7M SNPs and 3.7M InDels.

3  Overview of the Home page

1. Main navigation of the database.

2. A simple introduction about the database.

3. Fast Retrieval of Database.

4.The statistics of the data in database.

5. Updates & Related Resources.

6.Global Visitors.

4.1 general search

This module allows users to search genome-wide information on SNPs and InDels. Search SNP are organized into two categories:chip-based genotyping data and whole-genome sequencing (WGS) data. By specifying a genomic region of interest, users can retrieve variants using rsIDs, gene symbols, or chromosomal coordinates. The Advanced Search function provides multiple filtering options, enabling users to refine results in advance based on criteria such as minor allele frequency (MAF) and variant consequence type.

Upon accessing the results page, a list of matching variants is displayed. On the SNPs Found page, general information is provided for each variant, including chromosomal position, dbSNP identifier, associated gene and genomic region, minor allele frequency (MAF), and predicted functional effect. A filtering panel at the top of the page allows users to further narrow down the results and focus on variants of interest. For more detailed information, users can click the arrow icon next to each entry to navigate to the Variant Detail page.

4.2 Variant Details

Here is the Variant Details Page, which includes the following comprehensive information and interactive features. You can explore the data by navigating through the different sections below.

Variant Information:This section serves as the primary entry point for a genetic variant and presents its essential genomic attributes. It displays the chromosomal location (chromosome and coordinate), unique variant identifier, and the observed allelic change, including reference and alternate alleles. In addition, the inferred ancestral allele—derived from multi-species genome alignments—is provided. This information enables users to determine whether a variant represents a derived mutation or an ancestral state, offering immediate evolutionary context.

Quality Metrics:This module evaluates the technical reliability of variant calls, helping distinguish true genetic variants from potential sequencing or alignment artifacts. It reports key quality-related metrics, including reads depth (site coverage) and multiple variant quality scores such as QUAL, QD, FS, SOR, MQ, MQRankSum, and ReadPosRankSum. Together, these metrics assess call confidence, strand bias, and mapping quality, allowing users to assess and integrate current variant data quality.

Frequency Information:This section provides population-level insights into the distribution of the variant. It displays allele frequencies across diverse modern sheep breeds, indicating whether a variant is breed-specific, common, or rare. In addition, ancient DNA data are integrated to show inferred genotypes at the same genomic position in historical sheep samples, visualized on an interactive timeline and geographic map. This combined information facilitates the investigation of allele frequency changes over time and the identification of potential signatures of natural or artificial selection.

Conservation Scores:To assess the evolutionary and functional importance of the genomic region containing the variant, this module presents conservation scores such as phyloP and phastCons. These scores are calculated from multi-species sequence alignments and reflect the degree of evolutionary constraint at a given site. Variants located in highly conserved regions are more likely to affect function and may have deleterious or regulatory consequences.

ATAC-seq & ChIP-seq:These modules provide epigenomic context for variant interpretation. The ATAC-seq and ChIP-seq tracks indicate whether a variant overlaps regions of open chromatin or regulatory binding sites, such as promoters, enhancers, transcription factor binding sites, or histone modification marks (e.g., H3K27ac and H3K4me3). When a variant falls within an ATAC-seq or ChIP-seq peak, the corresponding peak information is displayed in this section.

QTL Mapping:This section links genetic variants to molecular phenotypes by displaying their associations with cis-acting quantitative trait loci (QTLs), covering seven distinct types of molecular traits, including expression QTLs (eQTLs) and splicing QTLs (sQTLs), across up to 40 sheep tissues or cell types. Users can select specific tissue–gene pairs to view interactive violin plots that compare the distributions of molecular traits (such as gene expression levels) among different genotypes at the variant site, providing an intuitive visualization of variant effects.

QTL Information:This module connects variants to previously reported, large-effect QTLs associated with economically important traits in sheep, curated from Animal QTLdb. It lists overlapping QTLs related to traits including growth, carcass characteristics, wool quality, reproduction, and disease resistance. This information helps bridge molecular-level functional evidence with phenotypic variation relevant to breeding and production.

5  Genomic signature

This module allows us to select regions and populations of interest to examine selection signals, divided into chip data and WGS data. For WGS data, we display five metrics (Pi, Tajima’s D, CLR, iHS) in non-overlapping 30 kb windows. Similarly, for chip data, we present the same selection signal metrics; however, due to the sparsity of chip-based sites, we display them in non-overlapping 150 kb windows.

6  Imputation

This module provides genotype imputation services for both chip-based data and whole-genome sequencing (WGS) data. For WGS data, imputation is performed using GLIMPSE2. A global sheep reference panel is provided to support broad population coverage. In addition, to facilitate user access and improve computational efficiency, several population-specific reference panels are available, such as a Hu sheep–specific reference panel. Users may upload whole-genome BAM files or single-chromosome BAM files, enabling chromosome-by-chromosome imputation. Imputed genotypes are returned as chromosome-separated VCF files. For chip-based data, genotype imputation is conducted using IMPUTE2. Users can upload input files in PED or VCF format, and the imputed results are generated accordingly.

7  Tools

7.1 Local UCSC genome browser

Users can search with a gene symbol, or a transcript name, or a genomic region to view SNPs, indels, genomic signature, genotype patterns, and conserved elements in the global view. Currently, 90 tracks have been released for the sheep ARS-UI_Ramb_v2.0 assembly. The "PDF/PS" item under the "View" menu of navigation bar was used to generate a high quality image in PostScript or PDF formats.

7.2 Alignment search tools (BLAT/BLAST)

We introduced two sequence alignment tools, webBlat and viroBLAST. The webBlat can be used to quickly search for homologous regions of a DNA or mRNA sequence, which can then be displayed in the browser. ViroBLAST can find regions of local similarity between sequences, which can be used to infer functional and evolutionary relationships between sequences.

7.3 Genome coordinate conversion tool (liftOver)

We also introduced a genome coordinate conversion Tool, liftOver. The liftOver tool is used to translate genomic coordinates from one assembly version into another and also retrieves putative orthologous regions in other species. Our database produces six liftOver chain files

8  About us

Project Organizers

Yu Jiang

Northwest A&F University, Yangling, Shaanxi, China

Email: yu.jiang@nwafu.edu.cn