Title

1  About SheepVar

SheepVar is a comprehensive, variant-centered resource for sheep genomic variation. It integrates high-quality single-nucleotide polymorphisms (SNPs) and small insertions and deletions (InDels) derived from large-scale genomic datasets. The current release includes whole-genome resequencing data from 5,603 modern sheep samples, from which approximately 62.71 million SNPs and 3.72 million small InDels were retained after quality control. SheepVar also incorporates genotype data from 7,013 individuals generated using the Illumina Ovine 50K BeadChip, covering approximately 41K loci, together with 123 sheep epigenomic datasets including ChIP-seq and ATAC-seq profiles, and paleogenomic data from 100 ancient sheep specimens. Collectively, these datasets provide broad representation of domestic sheep populations and wild relatives worldwide.

SheepVar provides multidimensional annotations for individual variants, including genomic context, evolutionary conservation, inferred ancestral states, population allele frequencies, molecular quantitative trait locus (QTL) associations, epigenomic features and selection signals. Selection signals are summarized using five complementary statistics: nucleotide diversity (Pi), Tajima’s D, integrated haplotype score (iHS), composite likelihood ratio (CLR) and fixation index (FST).

The platform supports flexible variant queries, integrated visualization of multi-omics data, exploration of selection signals and online genotype imputation. Users can search variants by genomic coordinates, rsIDs, gene symbols or SNP array marker IDs, and can access detailed annotations through the Variant Details page. SheepVar also provides tools for genome browsing, sequence alignment, coordinate conversion and data download.

Overall, SheepVar provides a curated and standardized resource for browsing, interpreting and prioritizing sheep genomic variants, supporting studies in sheep population genomics, functional genomics, evolutionary biology and molecular breeding.

2  SNP and small InDel calling pipeline

Raw SRA files were converted into FASTQ format using SRA Toolkit v3.2.0. Adapter sequences and low-quality bases were removed using fastp v0.12.4. The cleaned reads were then aligned to the sheep reference genome using BWA-MEM. The resulting alignments were sorted, and BAM files from the same sample were merged using SAMtools.

SNP and small InDel calling was performed using GATK v4.1.8.1. Variant discovery and genotyping were conducted with the HaplotypeCaller, GenotypeGVCFs and SelectVariants tools. To obtain a high-confidence variant set, hard-filtering criteria were applied separately for SNPs and InDels. SNPs were filtered using the following expression: QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 3.0. InDels were filtered using: QD < 2.0 || FS > 200.0 || SOR > 10.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0. Only small InDels with lengths between 1 and 30 bp were retained.

For further quality control, only biallelic variants were retained. PLINK v1.90 was then used to filter variants and samples with a genotyping rate greater than 90%. After quality control, 5,603 samples were retained and genotype imputation was performed using Beagle v5.4. Finally, variants with a minor allele frequency (MAF) ≥ 0.1% were retained, resulting in a final high-confidence dataset of approximately 62.71 million SNPs and 3.72 million small InDels.

3  Overview of the Home page

1. Main navigation of the database.

2. A simple introduction about the database.

3. Fast Retrieval of Database.

4.The statistics of the data in database.

5. Updates & Related Resources.

6.Global Visitors.

4.1 general search

The General Search module allows users to query genome-wide SNPs and small InDels in SheepVar. Variant records are organized into two data sources: WGS data and chip-based genotyping data. Users can retrieve variants by specifying genomic coordinates, rsIDs, gene symbols or SNP array marker IDs. The Advanced Search function provides additional filtering options, allowing users to refine queries in advance based on criteria such as MAF and predicted consequence type.

After submission, matching variants are displayed on the SNPs Found page. For each variant, summary information is provided, including chromosomal position, dbSNP identifier, associated gene, genomic region, MAF and predicted functional consequence. A filtering panel at the top of the results page allows users to further narrow the results and focus on variants of interest. Detailed annotations for individual variants can be accessed by clicking the arrow icon next to each entry, which directs users to the Variant Details page.


4.2 Variant Details

The Variant Details page provides an integrated view of each genetic variant in SheepVar. It brings together basic variant information, quality metrics, population allele frequencies, evolutionary annotations, regulatory evidence, molecular QTL associations and trait-related QTL information. Users can navigate through the sections below to evaluate the reliability, population distribution, evolutionary context and potential functional relevance of a variant.

Variant Information

The Variant Information section provides basic information and gene-level annotation for the selected variant. Each variant is represented by a standardized identifier in the format Chr:Pos_Ref_Alt. When available, the corresponding dbSNP rsID is also shown.

This section includes the reference and alternate alleles, minor allele, minor allele frequency, and inferred ancestral allele. The ancestral allele was inferred from multi-species genome alignments and helps users distinguish ancestral and derived allele states.

Gene and consequence annotations generated by SnpEff are also provided, including consequence type, putative impact, affected gene and transcript-level annotation. These annotations help users determine whether a variant is located in a coding region, intron, UTR, upstream or downstream region, or other genomic feature.

An embedded genome viewer is provided to display the local genomic context of the variant, including nearby genes and surrounding genomic features.


Quality Metrics

The Quality Metrics section reports variant-level quality information to help users assess the technical reliability of each variant call. The displayed metrics include read depth and standard variant-calling quality metrics, such as QUAL, QD, FS, SOR, MQ, MQRankSum and ReadPosRankSum.

These metrics provide information on read support, call confidence, strand bias, mapping quality and positional bias. Users can use this information to evaluate whether a variant is well supported by sequencing data or may require additional caution in downstream analyses.


Frequency Information

The Frequency Information section displays allele frequency patterns across sheep populations. It allows users to examine whether a variant is common, rare or enriched in specific populations.

For modern sheep populations, allele frequencies are shown across population groups. To improve robustness, frequency information is displayed only for populations with sufficient sample size, except for wild sheep populations, which are retained as comparative references.

Ancient sheep data are also integrated when available. Genotype information from ancient samples is displayed together with temporal and geographic information, allowing users to explore allele distribution across historical periods and geographic regions. By combining modern and ancient frequency information, users can examine population differentiation and temporal allele frequency changes at a given variant site.


Conservation Scores

The Conservation Scores section provides evolutionary conservation annotations based on multi-species sequence alignments. Two conservation metrics are displayed: phyloP and phastCons.

phyloP measures site-specific evolutionary conservation or acceleration. Positive phyloP scores indicate evolutionary conservation, whereas negative scores indicate accelerated evolution. phastCons estimates the probability that a site lies within a conserved genomic element, with scores ranging from 0 to 1.

Together, these scores provide evolutionary context for variant interpretation. Variants located at conserved sites or within conserved elements may be more likely to affect functionally constrained genomic regions.


ATAC-seq

The ATAC-seq section shows whether the selected variant overlaps an open chromatin region. ATAC-seq peaks indicate regions of accessible chromatin, where transcription factors and other regulatory proteins may bind. When a variant overlaps an ATAC-seq peak, the corresponding peak information is displayed, including genomic coordinates, tissue or cell type, and peak-related annotation. This information helps users evaluate whether the variant may lie within a potential regulatory element.


ChIP-seq

The ChIP-seq section shows whether the selected variant overlaps ChIP-seq peaks. ChIP-seq data provide regulatory context by identifying transcription factor binding sites or histone modification marks. When overlap is detected, SheepVar displays the relevant ChIP-seq peak information, including tissue or experimental context. Histone modification type, such as H3K4me1, H3K27ac or H3K4me3, can help users infer the possible regulatory state of the region containing the variant.


Molecular QTL

The Molecular QTL section links genetic variants to molecular phenotypes using cis-molQTL associations from SheepGTEx. This section includes seven classes of molecular QTLs, including expression QTLs (eQTLs), splicing QTLs (sQTLs), exon expression QTLs (eeQTLs), RNA stability QTLs (stQTLs), isoform expression QTLs (isoQTLs), 3’UTR alternative polyadenylation QTLs (3′aQTLs) and enhancer expression QTLs (enQTLs), across up to 40 sheep tissues or cell types.

For each associated variant, information such as tissue, gene, molecular trait type, effect direction and significance is provided when available. Users can select specific tissue–gene pairs to view interactive violin plots, which show differences in molecular phenotypes among genotypes at the variant site.

This section helps users evaluate whether a variant may influence gene expression, splicing, transcript usage or other molecular phenotypes.


QTL Information

The QTL Information section connects variants with trait-associated QTLs curated from Animal QTLdb. It reports whether the selected variant overlaps previously reported QTL regions associated with economically important traits in sheep.

Displayed traits may include growth, carcass characteristics, wool quality, reproduction, disease resistance and other production-related phenotypes. This information links molecular and regulatory evidence with organism-level traits, helping users evaluate the potential relevance of a variant for sheep breeding and production.

5  Genomic signature

The Genomic signature module allows users to select populations and genomic regions of interest to examine signals of selection. Results are provided separately for WGS and SNP chip datasets. For WGS data, SheepVar displays Pi, Tajima’s D, CLR, iHS and FST in non-overlapping 30kb windows. For SNP chip data, SheepVar displays Pi, Tajima’s D, iHS and FST in non-overlapping 150kb windows because of the lower marker density of chip-based data; CLR analysis is not provided for SNP chip datasets.

6  Imputation

This module provides genotype imputation services for chip-based data. Imputation is performed using Beagle v5.4 with a global sheep reference panel comprising 3,125 individuals. To improve usability and computational efficiency, SheepVar provides reference panels aligned to two sheep genome assemblies, ARS-UI_Ramb_v2.0 and ARS-UI_Ramb_v4.0, as well as population-specific sub-panels, such as those for Asian or African sheep. Users can customize imputation parameters and choose chromosome-wise or whole-genome imputation according to their needs.

The online imputation service accepts genotype data in standard VCF format, as well as PLINK-compatible PED and BED formats generated by PLINK v1.90. After submission, users can monitor job status and retrieve imputation results through the Job List page. For data privacy, each user can access only the jobs and results submitted under their own account. Once the imputation process is completed, an email notification is sent to inform users that the results are available. The server supports concurrent submissions from multiple users while restricting repeated submissions from the same account.

7  Tools

7.1 Genome Browser

The Genome Browser allows users to visualize genomic features and annotation tracks based on the sheep reference genome assembly ARS-UI_Ramb_v2.0. Currently, 558 tracks are available, including SNPs, small InDels, selection signals, genotype patterns, QTLs, conservation scores, conserved elements and epigenomic regulatory tracks.

Users can search the browser using a gene symbol, transcript name or genomic region to explore genomic information in a genome-wide context. Search results and candidate regions from SheepVar can be directly linked to the Genome Browser, allowing users to examine candidate genes or loci together with multiple annotation tracks. High-quality images can be exported in PDF or PostScript format using the PDF/PS option under the View menu.

7.2 Alignment search tools: BLAT and BLAST

SheepVar provides two sequence alignment tools, webBLAT and ViroBLAST, to support sequence-based searches. webBLAT enables rapid alignment of DNA or mRNA sequences to the sheep reference genome, and matched regions can be directly displayed in the Genome Browser. ViroBLAST is used to identify local sequence similarities, helping users explore potential functional or evolutionary relationships among sequences.

7.3 Genome coordinate conversion tool: LiftOver

SheepVar also provides LiftOver for genome coordinate conversion. This tool converts genomic coordinates between different sheep genome assembly versions and can also be used to retrieve putative orthologous regions in other species. Seven LiftOver chain files are currently available in SheepVar, allowing users to perform coordinate conversion and cross-assembly comparison conveniently.

8  About us

Project Organizers

Yu Jiang

Northwest A&F University, Yangling, Shaanxi, China

Email: yu.jiang@nwafu.edu.cn

Ran Li

Northwest A&F University, Yangling, Shaanxi, China

Email: ran.li@nwsuaf.edu.cn

QuanZhong Liu

Northwest A&F University, Yangling, Shaanxi, China

Email: liuqzhong@nwsuaf.edu.cn