RGD

01. Whole Genome Synteny
02. Gene Annotation Improvement
03. GO/Pathway Annotation
04. Conservation Scores
05. RNA-seq Analysis Pipeline
06. Epigenome Analysis Pipeline

01. Whole Genome Synteny

We built whole-genome alignments of 110 species (78 ruminants and 32 outgroup species) with LAST v1061 and MULTIZ, using the assembly of cattle (ARS-UCD1.2_Btau5.0.1Y), sheep (Oar_rambouillet_v1_0_addY), and goat (ARS1) as the reference respectively.

step 1: lastdb -P0 -uMAM4 -R00 ref_db/ref ref.fa [In mammalian scale]
        lastdb -P0 -uMAM8 -R00 ref_db/ref ref.fa [In vertebrate scale]
step 2: last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 ref_db/ref query.fa > query.mat
        lastal -m100 -E0.05 -C2 -P20 -p query.mat ref_db/ref query.fa | last-split -m1 > query.maf
step 3: maf-swap query.maf | awk '/^s/ {$2 = (++s % 2 ? "query." : "ref.") $2} 1' | last-split -m1 | maf-swap | grep -v "^p" > query.one.maf
step 4: perl -e 'open(IN,"$ARGV[0]"); my $Sum_len_ref; my $Sum_len_query; while(<IN>){chomp; if($_=~/^a score/){ my $first = <IN>;
        my $second = <IN>; my $len_ref=(split /\s+/,$first)[3]; my $len_query=(split /\s+/,$second)[3]; $Sum_len_ref+=$len_ref;
        $Sum_len_query+=$len_query; } }  print "Target_len:$Sum_len_ref\nQuery_len:$Sum_len_query\n";' query.one.maf > Target.Query.alignLen
        faSize query.fa > query.fa.size
        perl -e 'open(IN1,"$ARGV[0]"); open(IN2,"$ARGV[1]"); <IN1>; my $queryLen; while(<IN1>){chomp; $queryLen=(split/:/, $_)[1]; last;}
        my $genomeSize; while(<IN2>){chomp; $genomeSize=(split/\s+/,$_)[0]; last;} print "queryLen:",$queryLen,"\n", "genomeSize:",$genomeSize,"\n",
        "Ratio:",$queryLen/$genomeSize,"\n";' Target.Query.alignLen query.fa.size > query.stat
step 5: multiz query1.one.maf query2.one.maf 0 all > query1.query2.mutiz.maf

02. Gene Annotation Improvement

We aligned goat whole genome (ARS1) to the sheep, cattle, and human genome respectively to generate pairwise alignments. Then gene coordinates of these three species were converted to the goat genome using LiftOver tool to improve the goat gene annotation.

step 1: lastdb -P0 -uMAM4 -R00 ref_db/ref ref.fa
step 2: last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 ref_db/ref query.fa > query.mat
        lastal -m100 -E0.05 -C2 -P20 -p query.mat ref_db/ref query.fa | last-split -m1 > query.maf
step 3: maf-swap query.maf | last-split -m1 | maf-swap | last-split | maf-sort > query.one.maf
step 4: maf-convert psl query.LAST.maf > query.psl
step 5: axtChain -linearGap=loose -psl query.psl ref.2bit query.2bit refToQuery.chain [The chain file has the old genome as the target and the new genome as the query]
step 6: liftOver -minMatch=0.9 -genePred ref.gene.genePred refToQuery.chain query.gene.map query.gene.unMapped (for sheep and cattle)
        liftOver -minMatch=0.2 -minBlocks=0.5 -genePred ref.gene.genePred refToQuery.chain query.gene.map query.gene.unMapped (for human)

03. Orthologous Gene GO/Pathway Annotation

A synteny alignment approach was applied to identify the orthologous genes of each ruminant genome. The consensus mRNA and coding sequences of each species were extracted based on the NCBI gene annotation file (requiring Mysql database). Then Gene Ontology (GO), KEGG pathways and wikiPathways of cattle (with well-annotated gene annotation) were reciprocal mapping back to the orthologous genes to obtain ruminant orthologous GO and pathway annotation.

step 1: mafGene Cattle 110speciesMultiz -useFile genePred species.list 110species.orthologous.mRNA.output -noTrans
step 2: mafGene Cattle 110speciesMultiz -useFile genePred species.list 110species.orthologous.protein.output

04. Conservation Scores

Both phastCons and phyloP conservation scores have been computed separately for three groups of organisms: Bovidae (39 species), Pecora (54 species) and Ruminatia (55 species). And conserved elements and conservation scores of mammalian and vertebrate scales were converted from human data of the UCSC Genome Browser website.

[ Generate non-conservative and conservative models ]

step 1: phyloFit --tree tree.nwk --msa-format FASTA --out-root nonconserved-4d.mod 4DTV.fasta
step 2: msa_split chr.maf --in-format MAF --windows 100000,0 --out-root split_maf/chr --out-format SS --min-informative 1000 --between-blocks 5000
step 3: phastCons --estimate-rho window.ss --no-post-probs window.ss nonconserved-4d.mod.mod
step 4: phyloBoot --read-mods window.cons.list --output-average Average.chr.cons.mod
        window.cons.list: 6.10000001-10100000.ss.cons.mod,6.1000001-1100000.ss.cons.mod,6.100001-200000.ss.cons.mod,6.100099918-100200000.ss.cons.mod,6.100200001-100300000.ss.cons.mod ...

[ phastCons ]

phastCons --most-conserved phastCons.chr.bed --score chr.maf Average.chr.cons.mod,nonconserved-4d.mod.mod > phastCons.chr.wig

[ phyloP ]

phyloP --wig-scores --mode CONACC --method LRT nonconserved-4d.mod.mod chr.maf > phyloP.chr.wig

[ Convert human conservation scores and conserved elements to ruminant genomes ]

[Convert human conservation scores]
$1="hg38.phastCons100way" or "hg38.phastCons30way" or "hg38.phyloP100way" or "hg38.phyloP30way"
bigWigToBedGraph $1.bw $1.bedGraph
liftOver -minMatch=0.2 $1.bedGraph Hg38_To_ruminant.over.chain.gz $1.map.bedGraph $1.unMap.bedGraph

[Convert human conserved elements]
$1="phastConsElements100way" or "phastConsElements30way"
liftOver -minMatch=0.2 $1.bed Hg38_To_ruminant.over.chain.gz $1.map.bed $1.unMap.bed

05. RNA-seq Analysis Pipeline

Pipeline: RNA-seq data with a reference genome

step 1: [Quality_control]
        Single-read: java -Xmx30g -jar trimmomatic-0.36.jar SE -phred33 -threads 10 sample.fq.gz sample.clean.fq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40 TOPHRED33 > sample.log
        Paired-end: java -Xmx30g -jar trimmomatic-0.36.jar PE -phred33 -threads 10 sample_1.fq.gz sample_2.fq.gz sample_1.clean.fq.gz sample_1.unpaired.fq.gz sample_2.clean.fq.gz sample_2.unpaired.fq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40 TOPHRED33 > sample.log
step 2: [STAR_index]
        STAR --runMode genomeGenerate --genomeDir index --genomeFastaFiles ref.fa --runThreadN 24 --sjdbGTFfile ref.gtf --sjdbOverhang 149
step 3: [Hisat2_index]
        hisat2-build ref.fa ref.hisat2.index
step 4: [Splice_sites]
        python3.5 hisat2_extract_splice_sites.py ref.gtf > ref.hisat.splicing.site
step 5: [Mapping]
        [We used an improved pipeline. The high-quality reads were first aligned to the reference genome with STAR, 
        then unmapped reads were extracted for further mapping by HISAT2 to improve reads utilization.]
        STAR --genomeDir $INDEX --readFilesIn $IN/${i}_1.clean.fq.gz $IN/${i}_2.clean.fq.gz --readFilesCommand zcat --runThreadN 8 --outFilterMultimapNmax 1 --outFilterIntronMotifs RemoveNoncanonical Unannotated --outFilterMismatchNmax 10 --outSAMstrandField intronMotif --outSJfilterReads Unique --outSAMtype BAM Unsorted --outReadsUnmapped Fastx --outFileNamePrefix $OUT/${i}
        hisat2 --dta-cufflinks --no-mixed --no-discordant -p 10 --known-splicesite-infile ref.hisat.splicing.site -x ref.hisat2.index -1 $OUT/${i}Unmapped.out.mate1 -2 $OUT/${i}Unmapped.out.mate2 -S $OUT/${i}.sam
        java -Djava.io.tmpdir=$TMP -jar picard.jar CleanSam I=$OUT/${i}Aligned.out.bam O=$OUT/${i}.STAR.bam
        java -Djava.io.tmpdir=$TMP -jar picard.jar CleanSam I=$OUT/${i}.sam O=$OUT/${i}.hisat.bam
        java -Djava.io.tmpdir=$TMP -jar picard.jar MergeSamFiles I=$OUT/${i}.STAR.bam I=$OUT/${i}.hisat.bam SORT_ORDER=coordinate O=$BAM/${i}.bam
step 6: [Gene_quantification]
        stringtie -e -B -p 8 -G ref.gtf -o ./sample/sample.gtf -A ./sample/sample.gene_abund sample.bam
        bash 00.gene.all.sh
        bash 01.tissueMerge.FPKM.sh
step 7: [Calculate tissue specificity index(tau)]
        perl spec.pl tissueType exprTable output_Expr.tau

Get scripts: 00.gene.all.sh 01.tissueMerge.FPKM.sh spec.pl

Get files: tissueType [two columns: sampleID, tissueType] exprTable

06. Epigenome Analysis Pipeline

Pipeline 1: ChIP-seq data processing

step 1: [Mappping]
        Index: bwa index ref.fa
        Single-read: bwa mem ref.fa -t 8 -M -R '@RG\tID:sample\tLB:sample\tPL:ILLUMINA\tSM:sample' sample.fastq | samtools view -bS -@ 8 > sample.bam
        Paired-end: bwa mem ref.fa -t 8 -M -R '@RG\tID:sample\tLB:sample\tPL:ILLUMINA\tSM:sample' sample_1.fastq sample_2.fastq | samtools view -bS -@ 8 > sample.bam
step 2: [Merge, sort, and filter bam]
        java -Djava.io.tmpdir=/tmp/ -jar picard.jar MergeSamFiles I=sample_line1.bam I=sample_line2.bam I=sample_line3.bam SORT_ORDER=coordinate O=sample.bam
        samtools sort -@ 8 -o sample.sorted.bam -T sample.sorted.tmp -O bam sample.bam
        samtools view -q 20 sample.sorted.bam -b -o sample.sorted.unique.bam
        samtools index sample.sorted.unique.bam
step 3: [Predict fragment size]
        python2.7 macs2 predictd -i sample.sorted.unique.bam -f BAM -g genomeSize --outdir sample/
step 4: [Call peak]
        python2.7 macs2 callpeak -t sample.sorted.unique.bam -c control.sorted.unique.bam --extsize fragmentSize --nomodel -f BAM -g genomeSize -n sample -p 1e-5 -B --outdir sample_pe-5/

Pipeline 2: ATAC-seq data processing

step 1: [Mapping]
        Index: bwa index ref.fa
        Single-read: bwa mem ref.fa -t 8 -M -R '@RG\tID:sample\tLB:sample\tPL:ILLUMINA\tSM:sample' sample.fastq | samtools view -bS -@ 8 > sample.bam
        Paired-end: bwa mem ref.fa -t 8 -M -R '@RG\tID:sample\tLB:sample\tPL:ILLUMINA\tSM:sample' sample_1.fastq sample_2.fastq | samtools view -bS -@ 8 > sample.bam
step 2: [Merge, sort, filter, and remove duplicates and mitochondria from bam]
        java -Djava.io.tmpdir=/tmp/ -jar picard.jar MergeSamFiles I=sample_line1.bam I=sample_line2.bam I=sample_line3.bam SORT_ORDER=coordinate O=sample.bam
        samtools sort -@ 8 -o sample.sorted.bam -O bam sample.bam
        samtools view -q 20 sample.sorted.bam -b -o sample.sorted.unique.bam
        java -Xmx50g -Djava.io.tmpdir=tmp -jar picard.jar MarkDuplicates INPUT=sample.sorted.unique.bam OUTPUT=sample.sorted.unique.dedup.bam METRICS_FILE=sample_dedup REMOVE_DUPLICATES=true CREATE_INDEX=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT MAX_FILE_HANDLES=2000
        samtools view -h sample.sorted.unique.dedup.bam |awk '{if($3 != "Mt_name"){print $0}}' |samtools view -Sb - > sample.sorted.unique.dedup.noMT.bam
step 3: [Call peak]
        python2.7 macs2 callpeak -t sample.sorted.unique.dedup.noMT.bam --nomodel --shift -37 --extsize 73 -f BAM -g genomeSize -n sample -p 1e-5 --outdir sample_pe-5/ --broad

Pipeline 3: Predict chromatin states

step 1: [Prepare for building database]
        We download the chrom.sizes file, CpG Island data and ref.genePred file from UCSC Genome Browser. Next,
        we add chrom.sizes file to CHROMSIZES directory and put CpG Island data under the ref directory of COORDS
        directory. Then, we convert gene annotations(ref.genePred) into coordinate and anchor files for enrichment
        analyses.
step 2: [Build database]
        java -mx8000M -jar ChromHMM.jar ConvertGeneTable -gzip -l chrom.sizes ref.genePred.table prefix ref
step 3: [Bam to bed]
        bedtools bamtobed -i sample.bam > sample.bed
step 4: [To BinarizeBed]
        java -mx8000M -jar ChromHMM.jar BinarizeBed ref.chrom.sizes inputDir/ sample.bed sample.out
step 5: [Chromatin state annotations]
        java -mx8000M -jar ChromHMM.jar LearnModel inputDir/ stateNum_output number ref
step 6: [To genome Browser]
        java -mx8000M -jar ChromHMM.jar MakeBrowserFiles -gzip -c colormappingfile -m labelmappingfile sample_state(n)_segments.bed.gz sample sample_state(n)

Get the ref.genePred.table: ARS-UCD1.2_Btau5.0.1Y.genePred.table

Get the color profile: colormappingfile

Get the profile of chromatin states: labelmappingfile

Pipeline 4: Convert human regulatory data to ruminant genomes

step 1: [Convert human regulation data]
        liftOver -minMatch=0.2 bed HumanToRuminant.chain bedForward unmappedForward
step 2: [Convert Reciprocally]
        liftOver -minMatch=0.2 bedForward RuminantToHuman.chain bedReciprocal unmappedReciprocal
step 3: [Exactly match]
        python exact_match.py bed bedReciprocal bedForward exactBed
step 4: [Extract]
        awk -F '\t' '{print $1"\t"$2"\t"$3"\t"$9}' exactBed >> exactBed4
step 5: [Merge data]
        python mergeBed_max.py --bedfile exactBed4 --outfile exactBedMerge --hardcutoff 10000 --zerobaed

Get the python script: exact_match.py

Get the python script: mergeBed_max.py

07. GWAS Data Coordinate Conversion

liftOver -minMatch=0.95 GWAS.bed AssemblyV1_To_AssemblyV2.over.chain.gz GWAS.map.bed GWAS.unMap.bed