Home

SVLearn
Update (0.0.5)
Please cite
https://doi.org/10.1038/s41467-025-57756-z
Downloads
Requirements
2.pysam=0.22.0
3.polars=0.20.15
4.pandas=2.2.1
5.scikit-learn=1.3.0
6.pyfaidx
7.pyarrow
8.pybedtools
9.intervaltree
11.samtools>=1.17
12.sambamba>=1.0.1
2.trf=4.09
3.GenMap=1.3.0
4.BISER=1.4
Trained Models
Please select the corresponding coverage genotyping model to achieve the best genotyping results.Cattle_24feature_RandomForest_model.joblib (2.7G)
Human_30x_24feature_RandomForest_model.joblib (1.8G)
Human_20x_24feature_RandomForest_model.joblib (1.6G)
Human_20x_18feature_RandomForest_model.joblib (1.3G)
Human_10x_18feature_RandomForest_model.joblib (1.4G)
Human_10x_24feature_RandomForest_model.joblib (627.4MB)
Human_5x_18feature_RandomForest_model.joblib (557.3MB)
Human_5x_24feature_RandomForest_model.joblib (1.2G)
Sheep_24feature_RandomForest_model.joblib (3.2G)
Demonstration
We provide one sample dataset for each of the three species: human, cattle, and sheep, which can be used for demonstration and validation.01.download-data.zip (3.9KB)
02.SV-set_Genomes.zip (5.6GB)
03.training-dataset.zip (342.6 MB)
04.validation-dataset.zip (79.6 MB)
Documents

CNVcaller pipeline includes three main steps. First, considering the population sequencing data may come from different platforms, the read-depth (RD) of each sample is counted and corrected individually. An original absolute copy number correction is used to modify the standard read alignments generated by BWA software to multi-hit alignments, as similar to mrsFAST format. After corrections and normalization, the comparable RDs of each sample is concentrated to a ~100 Mb intermediate file and output. This design avoids repeat calculation of a same individual in different populations.
In the second CNVR detection step, the RD files of all samples are piled up into a two dimensional population RD file. Multi-criteria are implied to remove the high-proportional noise caused by low sequencing quality or assembly bias. Individually, the RD of the candidate CNV window should significantly deviates from average. The piled-up candidate windows should also meet two population-level criteria: CNV allele frequency > 5% and the multi-sample RDs of adjacent windows are significantly correlated.
After merging the candidate CNV windows into a CNVR, the RDs of all samples in each CNVR are clustered by the mixture Gaussian model and deducing the integer copy number of each individual. This step is called genotyping as used in SNP detection. The final output is compatible with most SNP based population genetic algorithm.
Documents
- Step 1. Indexing Reference Genome
The reference genome is segmented into overlapping sliding windows. The windows are indexed to form a reference database used in all samples. This commend will create the file
referenceDB.windowsize
in current directory by default.$ perl CNVReferenceDB.pl <ref> Required arguments <ref> Reference sequence Optional arguments -w the window size (bp) for all samples [default=800] -l the lower limit of GC content [default=0.2] -u the upper limit of GC content [default=0.7] -g the upper limit of gap content [default=0.5]
Argument details:
-w
We recommend 400-1000bp window size for >10X coverage sequencing data, 1000-2000bp window size for <10X coverage sequencing data. Increasing the window size will reduce the noise at the cost of sensitivity. - Step 2: Individual RD processing
Count the reads of each window across genome from BAM file and generate a comparable read depth (RD) file of each individual.
referenceDB.windowsize
must be placed in current directory.Three default directories
RD_raw
RD_absolute
RD_normalized
will be created in current directory in order, containing the raw read depth, read depth after absolute copy number correction and the final GC corrected normalized read depth of each sample. The name of the normalized RD file indicates the average RD (mean), STDEV of the RD and the gender (1=XX/ZZ, 2=XY/ZW) of this sample. The final read depths are normalized to one.This step consumes about 500 MB for each individual, multiple tasks can be run in parallel. Shell script
Individual.Process.sh
is provided to complete these procedures.$ bash Individual.Process.sh -b <bam> -h <header> -d <dup> -s <sex_chromosome> Required arguments -b|--bam alignment file in BAM format -h|--header header of bam file, the prefix of output file -d|--dup duplicated window record file used for absolute copy number correction -s|--sex the name of sex chromosome
Argument details
-dup
The duplicated window record files. We provide duplicated window record files for different species, such as human, goat, sheep, pig, cattle, chicken, maize, wheat, and soybean. If you work with other organisms, you will want to create duplicated window record file in order to use absolute copy number correction function of CNVcaller. Follow the instruction.-s
The gender of this individual will be determines by the ratio of RD of the given sex chromosome and the RD of the other autosomes. The name of X or Z chromosome should be given for the XY or ZW genomes.Example, to convert ERR340328.bam to normalized copy number using 1000bp window size.
bash Individual.Process.sh -b ERR340328.bam -h ERR340328 -d link -s X
- Step 3: CNVR detection
The normolized RD files of all samples are piled up into a two-dimensional population RD file. The integrated CNVR are detected by scanning the population RD file with aberrantly RD, CNV allele frequency and significantly correlation with adjacent windows. The adjacent candidate windows showing high correlation will be further merged.
$ bash CNV.Discovery.sh -l <RDFileList> -e <excludedFileList> -f <frequency> -h <homozygous> -r <pearsonCorrelation> -p <primaryCNVR> -m <mergedCNVR> Required arguments -l|--RDFileList individual normalized read depth file list -e|--excludedFileList list of samples exclude from CNVR detection -f|--frequency minimum frequency of gain/loss individuals for candidate CNV window definition [recommend 0.1] -h|--homozygous number of homozygous gain/loss individuals for candidate CNV window definition [recommend 3] -r|--pearsonCorrelation minimum of Pearson’s correlation coefficient between the two adjacent non-overlapping windows 0.5 for sample size (0, 30] 0.4 for sample size (30, 50] 0.3 for sample size (50, 100] 0.2 for sample size (100, 200] 0.15 for sample size (200, 500] 0.1 for sample size (500,+∞) -p|--primaryCNVR primary CNVR result -m|--mergedCNVR merged CNVR result
Argument details
-e
The samples in this list will be exclude from CNVR detection, and their copy numbers are deduced based on the CNVR boundaries defined by other samples. This option is applicable to the outgroup or the poor quality precious samples. An empty file means all individuals are included in the CNVR detection.-f/-h
Windows satisfied any of this two conditions will be selected as candidate CNV windows.-r
The adjacent windows with significant correlation will be merged in to one call. The recommend value is significant at p=0.01 level. Raise this index will increase the detection accuracy with a decrease of sensitivity.Example, run
CNV.Discovery.sh
on all your individual normalized RD files for discovering CNV. An example of normalized read depth file list-l
:RD_normalized/ERR340328_mean_70.81_SD_10.84_sex_1 RD_normalized/ERR340329_mean_62.00_SD_10.52_sex_1 RD_normalized/ERR340330_mean_135.66_SD_13.96_sex_1 RD_normalized/ERR340331_mean_128.76_SD_15.27_sex_1 RD_normalized/ERR340333_mean_69.30_SD_10.19_sex_1 RD_normalized/ERR340334_mean_132.30_SD_14.59_sex_1 RD_normalized/ERR340335_mean_73.50_SD_10.16_sex_1 RD_normalized/ERR340336_mean_72.52_SD_10.03_sex_1 RD_normalized/ERR340338_mean_124.12_SD_13.24_sex_1 RD_normalized/ERR340340_mean_131.00_SD_14.74_sex_1
bash CNV.Discovery.sh -l list -e exclude_list -f 0.1 -h 3 -r 0.5 -p primaryCNVR -m mergeCNVR
- Step 4: Genotyping
Clustering the input samples into genotypes using Gaussian mixture modes. The output contain a genotype VCF -a VCF format file containing the input site descriptions, additional site-specific information and a called genotype for each input sample.
$ python Genotype.py --cnvfile <input> --outprefix <outfile prefix> Required arguments: --cnvfile merged CNVR file --outprefix prefix of out files Optional arguments: --nproc number of process will be used, default is one.
Example
python Genotype.py --cnvfile mergeCNVR --outprefix Genotype
Documents
Any questions, bug reports and suggestions can be posted to Email:
yu.jiang@nwafu.edu.cn.
yangqimeng99@163.com.