玉米RNA-seq测序数据差异基因分析

2022-6-3 萌小白

huanying今天给大家分享一个非常棒的玉米转录组的流程分析。原文作者是cxge，首发于omicshare论坛，阅读原文可跳转至本文的帖子哦~

软件及参考基因组

BWA, Samtools, Hisat2, HTseq, gffcompare, Stringtie, Ballgown，R

B73-V3, B73-V4(Enzampl database)

流程

protocol首先从原始RAN-seq数据入手，先经过质控fastqc，之后检测rRNA占比，去除杂的reads之后进行数据处理；使用HISAT2将读段匹配到参考基因组上,可提供注释文件；StringTie进行转录本组装，估算每个基因及isoform的表达水平；所有的转录本merge的数据再一次被呈递给StringTie，重新估算转录本的丰度，提供转录本reads数量的数据给下一步的ballgown；Ballgown从上一步获得所有转录本及其丰度，根据实验条件进行分类统计。

具体操作

质控检测

fastqc *.gz

Fail选项：

（1） Per Base Sequence Content（对所有reads的每一个位置，统计ATCG四种碱基正常情况的分布）：当任一位置的A/T比例与G/C比例相差超过20%，报"FAIL"。

（2）Per Sequence GC Content：偏离理论分布的reads超过30%时，报"FAIL"。

（3）Duplicate
Sequences统计序列完全一样的reads的频率。测序深度越高，越容易产生一定程度的duplication，这是正常的现象，但如果duplication的程度很高，就提示我们可能有bias的存在（如建库过程中的PCR
duplication）：当非unique的reads占总数的比例大于50%时，报"FAIL"

rRNA检测

NCBI网站下载Zea mays Eukaryotic 5S ribosomal RNA；Zea mays Eukaryotic 18S ribosomal RNA；Zea mays Eukaryotic 28S ribosomal RNA

# bwa index rRNA-maize.fa

# bwa mem -t 8 ../ref/rRNA-maize.fa NCSU_Wang_120430HiSeqRun_B73-WT- 22.fastq.gz | samtools view -bS - > B73-WT.bam

# samtools sort B73-WT.bam -o B73-WT.sort.bam

# samtools flagstat B73-WT.bam

玉米参考基因组及注释下载

# wget ftp://ftp.ensemblgenomes.org/pub/release-38/plants/fasta/ zea_mays/dna/Zea_mays.AGPv4.dna.toplevel.fa.gz

# wget ftp://ftp.ensemblgenomes.org/pub/release-31/plants/fasta/ zea_mays/dna/Zea_mays.AGPv3.31.dna.genome.fa.gz

# wget ftp://ftp.ensemblgenomes.org/pub/release-38/plants/gtf/ zea_mays/Zea_mays.AGPv4.38.gtf.gz

# wget ftp://ftp.ensemblgenomes.org/pub/release-31/plants/gtf/ zea_mays/Zea_mays.AGPv3.31.gtf.gz

从注释文件里面抽取出剪切位点和外显子信息,建立Index:

# extract_splice_sites.py Zea_mays.AGPv4.38.gtf > maize-v4.ss

#extract_exons.py Zea_mays.AGPv4.38.gtf > maize-v4.exon

# hisat2-build --ss maize-v4.ss --exon maize-v4.exon Zea_mays.AGPv4.dna.toplevel.fa maize-v4_tran 速度慢，限速

hisat2比对

#hisat2 -p 16 -x ./maize/Zea_mays_tran -U ./48/NCSU_Wang_120430HiSeqRun_B73-mt-48h.fastq.gz -S ./result/B73.sam

###多组数据用逗号隔开，双末端数据用-1，-2，单末端数据用-U，可以两种参数同时使用，reads长度可以不一致

# python -m HTSeq.s.count 统计reads number

组装转录本并定量表达基因

# stringtie B73.bam -p 8 -G

多个样品：

for i in *.sam;

i=${i%.sam*}; nohup samtools sort -@ 8 -o ${i}.bam ${i}.sam &

done

差异基因分析

>install.packages("devtools",repos="http://cran.us.r-project.org")

>source("http://www.bioconductor.org/biocLite.R")

>biocLite("ballgown", "RSkittleBrewer", "genefilter","dplyr","devtools")

>library(ballgown)

>library(RSkittleBrewer)

>library(genefilter)

>library(dplyr)

>library(devtools)

# vim phenodata.csv

ids, phenotype

sample-s1B73mt-22,mt

sample-s1B73WT-22,WT

sample-s2B73mt-22,mt

sample-s2B73WT-22,WT

sampleB73mt-22,mt

sampleB73WT-22,WT

# vim mergelistb73-22.txt

B73WT-22.gtf

B73mt-22.gtf

B73-WT-22.gtf

B73-mt-22.gtf

B73-WT-22.gtf

B73-mt-22.gtf

# stringtie --merge -p 8 –G Zea_mays.AGPv3.31.gtf -o b73_22.gtf mergelist_v3.txt

stringtie -e -B -p 8 -G b73_22.gtf –o sampleB73WT-22.gtf sampleB73WT-22.bam

stringtie -e -B -p 8 -G b73_22.gtf –o sampleB73mt-22.gtf sampleB73mt-22.bam

stringtie -e -B -p 8 -G b73_22.gtf –o sample-s1B73WT-22.gtf sample-s1B73WT-22.bam

stringtie -e -B -p 8 -G b73_22.gtf –o sample-s1B73mt-22.gtf sample-s1B73mt-22.bam

stringtie -e -B -p 8 -G b73_22.gtf –o sample-s2B73WT-22.gtf sample-s2B73WT-22.bam

stringtie -e -B -p 8 -G b73_22.gtf –o sample-s2B73mt-22.gtf sample-s2B73mt-22.bam

差异基因作图

7.1 导入数据

>pheno_data = read.csv("phenodata.csv")

Note: The
sample names you have stored in your phenotype file do not match the
file names of the samples you ran with StringTie!!!

> all(pheno_data$ids == list.files("ballgown"))

检查ballgown的list顺序：

list.files(“ballgown”)

bg = ballgown(dataDir = "ballgown",samplePattern = "sample",pData=pheno_data)

bg = ballgown(dataDir = "ballgown", samplePattern = "sample", meas = "all")

7.2滤掉低丰度的基因

>bg_filt = subset(bg,"rowVars(texpr(bg))>1",genomesubset=TRUE)

7.3确认组间有差异的转录本

>result_trans = stattest(bg_filt,feature="tran",covariate="phenotype",getFC=TRUE,meas="FPKM")

> result_trans

7.4确认组间有差异的基因

>result_gene = stattest(bg_filt,feature="gene",covariate="phenotype",getFC=TRUE,meas="FPKM")

> result_gene

7.5 对结果 result_trans增加基因名

>result_trans =data.frame(geneNames=ballgown::geneNames(bg_filt),geneIDs=ballgown::geneIDs(bg_filt), result_trans)

7.6按照P值排序（从小到大）

> result_trans=arrange(result_trans,pval)

> result_gene=arrange(result_gene,pval)

7.7把结果写入csv文件

> write.csv(result_trans, "tran_results.csv",row.names=FALSE)

> write.csv(result_gene, "gene_results.csv",row.names=FALSE)

7.8筛选出q值小于0.05的trans和genes

>subset(result_trans,result_trans$qval<0.05)

>subset(result_gene,result_gene$qval<0.05)

> write.csv(bg, "bg",row.names=FALSE)

数据可视化之颜色设定

# 赋予调色板五个指定颜色 > tropical= c('darkorange', 'dodgerblue','hotpink', 'limegreen', 'yellow') > palette(tropical)

# 当然rainbow()函数也可以完成这个任务 > palette(rainbow(5))

8.1以FPKM为参考值作图，以time作为区分条件

# 提取FPKM值

> fpkm = texpr(bg,meas="FPKM")

#方便作图将其log转换，+1是为了避免出现log2(0)的情况 > fpkm = log2(fpkm+1)

# 作图

>boxplot(fpkm,col=as.numeric(pheno_data$sex),las=2,ylab='log2(FPKM+1)')

图1

8.2就单个转录本的查看在样品中的分布

> ballgown::tranNames(bg)[12]

12 "NR_027232"

> ballgown::geneNames(bg)[12]

12 "LINC00685"

#绘制箱线图

> plot(fpkm[12,] ~ pheno_data$phenotype, border=c(1,2),

+ main=paste(ballgown::geneNames(bg)[12],' : ',

+ ballgown::tranNames(bg)[12]),pch=19, xlab="phenotype",

+ ylab='log2(FPKM+1)')

## (左图)

> points(fpkm[12,] ~ jitter(as.numeric(pheno_data$phenotype)),

+ col=as.numeric(pheno_data$phenotype))

## (右图)

8.3查看某一基因位置上所有的转录本

>plotTrans(ballgown::geneIDs(bg)[442], bg, main=c('Gene MSTRG.168 in sample sampleB73WT-22'), sample=c('sampleB73WT-22'))

有空格就出错图2

# plotTrans函数可以根据指定基因的id画出在特定区段的转录本

#可以通过sample函数指定看在某个样本中的表达情况，这里选用id=442, sample= sampleB73WT-22

> plotMeans('MSTRG.168',bg_filt,groupvar="phenotype",legend=FALSE)

图3

发表评论：