WGCNA基本概念
加权基因共表达网络分析 (WGCNA, Weighted correlation networkanalysis)是用来描述不同样品之间基因关联模式的系统生物学方法,可以用来鉴定高度协同变化的基因集,并根据基因集的内连性和基因集与表型之间的关联鉴定候补生物标记基因或治疗靶点。
相比于只关注差异表达的基因,WGCNA利用数千或近万个变化最大的基因或全部基因的信息识别感兴趣的基因集,并与表型进行显著性关联分析。一是充分利用了信息,二是把数千个基因与表型的关联转换为数个基因集与表型的关联,免去了多重假设检验校正的问题。
理解WGCNA,需要先理解下面几个术语和它们在WGCNA中的定义。
基本分析流程
WGCNA包实战
R包WGCNA是用于计算各种加权关联分析的功能集合,可用于网络构建,基因筛选,基因簇鉴定,拓扑特征计算,数据模拟和可视化等。
输入数据和参数选择
安装WGCNA
WGCNA依赖的包比较多,bioconductor上的包需要自己安装,cran上依赖的包可以自动安装。通常在R中运行下面4条语句就可以完成WGCNA的安装。
建议在编译安装R时增加--with-blas --with-lapack提高矩阵运算的速度,具体见R和Rstudio安装。
WGCNA实战
实战采用的是官方提供的清理后的矩阵,原矩阵信息太多,容易给人误导,后台回复WGCNA 获取数据。
数据读入
library(WGCNA)
Loading required package: dynamicTreeCut Loading required package:
fastcluster Attaching package: ‘fastcluster’ The following object is
masked from ‘package:stats’: hclust
==========================================================================
* * Package WGCNA 1.63 loaded. * * Important note: It appears that your
system supports multi-threading, * but it is not enabled within WGCNA
in R. * To allow multi-threading within WGCNA with all available cores,
use * * allowWGCNAThreads() * * within R. Use disableWGCNAThreads() to
disable threading if necessary. * Alternatively, set the following
environment variable on your system: * * ALLOW_WGCNA_THREADS= * * for
example * * ALLOW_WGCNA_THREADS=48 * * To set the environment variable
in linux bash shell, type * * export ALLOW_WGCNA_THREADS=48 * * before
running R. Other operating systems or shells will * have a similar
command to achieve the same aim. *
==========================================================================
Attaching package: ‘WGCNA’ The following object is masked from
‘package:stats’: cor
library(reshape2) library(stringr)
options(stringsAsFactors = FALSE)
enableWGCNAThreads()
Allowing parallel execution with up to 47 working processes.
exprMat <- “WGCNA/LiverFemaleClean.txt”
type = “unsigned”
corType = “pearson”
corFnc = ifelse(corType==”pearson”, cor, bicor)
maxPOutliers = ifelse(corType==”pearson”,1,0.05)
robustY = ifelse(corType==”pearson”,T,F)
导入数据
dataExpr <- read.table(exprMat, sep=’\t’, row.names=1, header=T, quote=””, comment=””, check.names=F)
dim(dataExpr)
[1] 3600 134
head(dataExpr)[,1:8]
F2_2 F2_3 F2_14 F2_15 F2_19 F2_20 MMT00000044 -0.01810 0.0642
6.44e-05 -0.05800 0.04830 -0.15197410 MMT00000046 -0.07730 -0.0297
1.12e-01 -0.05890 0.04430 -0.09380000 MMT00000051 -0.02260 0.0617
-1.29e-01 0.08710 -0.11500 -0.06502607 MMT00000076 -0.00924 -0.1450
2.87e-02 -0.04390 0.00425 -0.23610000 MMT00000080 -0.04870 0.0582
-4.83e-02 -0.03710 0.02510 0.08504274 MMT00000102 0.17600 -0.1890
-6.50e-02 -0.00846 -0.00574 -0.01807182 F2_23 F2_24 MMT00000044 -0.00129
-0.23600 MMT00000046 0.09340 0.02690 MMT00000051 0.00249 -0.10200
MMT00000076 -0.06900 0.01440 MMT00000080 0.04450 0.00167 MMT00000102
-0.12500 -0.06820
数据筛选
筛选中位绝对偏差前75%的基因,至少MAD大于0.01 筛选后会降低运算量,也会失去部分信息 也可不做筛选,使MAD大于0即可
m.mad <- apply(dataExpr,1,mad) dataExprVar <- m.mad=""> max(quantile(m.mad, probs=seq(0, 1, 0.25))[2],0.01)),]
转换为样品在行,基因在列的矩阵
dataExpr <- as.data.frame(t(dataExprVar))
检测缺失值
gsg = goodSamplesGenes(dataExpr, verbose = 3)
Flagging genes and samples with too many missing values… ..step 1
if (!gsg$allOK){
# Optionally, print the gene and sample names that were removed: if
(sum(!gsg$goodGenes)>0) printFlush(paste("Removing genes:",
paste(names(dataExpr)[!gsg$goodGenes], collapse = ","))); if
(sum(!gsg$goodSamples)>0) printFlush(paste("Removing samples:",
paste(rownames(dataExpr)[!gsg$goodSamples], collapse = ","))); # Remove
the offending genes and samples from the data: dataExpr =
dataExpr[gsg$goodSamples, gsg$goodGenes]
}
nGenes = ncol(dataExpr) nSamples = nrow(dataExpr)
dim(dataExpr)
[1] 134 2697
head(dataExpr)[,1:8]
MMT00000051 MMT00000080 MMT00000102 MMT00000149 MMT00000159 F2_2
-0.02260000 -0.04870000 0.17600000 0.07680000 -0.14800000 F2_3
0.06170000 0.05820000 -0.18900000 0.18600000 0.17700000 F2_14
-0.12900000 -0.04830000 -0.06500000 0.21400000 -0.13200000 F2_15
0.08710000 -0.03710000 -0.00846000 0.12000000 0.10700000 F2_19
-0.11500000 0.02510000 -0.00574000 0.02100000 -0.11900000 F2_20
-0.06502607 0.08504274 -0.01807182 0.06222751 -0.05497686 MMT00000207
MMT00000212 MMT00000241 F2_2 0.06870000 0.06090000 -0.01770000 F2_3
0.10100000 0.05570000 -0.03690000 F2_14 0.10900000 0.19100000
-0.15700000 F2_15 -0.00858000 -0.12100000 0.06290000 F2_19 0.10500000
0.05410000 -0.17300000 F2_20 -0.02441415 0.06343181 0.06627665 软阈值筛选
查看是否有离群样品
sampleTree = hclust(dist(dataExpr), method = “average”)
plot(sampleTree, main = “Sample clustering to detect outliers”, sub=””,
xlab=””)
软阈值的筛选原则是使构建的网络更符合无标度网络特征。
powers = c(c(1:10), seq(from = 12, to=30, by=2)) sft =
pickSoftThreshold(dataExpr, powerVector=powers, networkType=type,
verbose=5)
pickSoftThreshold: will use block size 2697. pickSoftThreshold:
calculating connectivity for given powers… ..working on genes 1 through
2697 of 2697 Power SFT.R.sq slope truncated.R.sq mean.k. median.k.
max.k. 1 1 0.1370 0.825 0.412 587.000 5.95e+02 922.0 2 2 0.0416 -0.332
0.630 206.000 2.02e+02 443.0 3 3 0.2280 -0.747 0.920 91.500 8.43e+01
247.0 4 4 0.3910 -1.120 0.908 47.400 4.02e+01 154.0 5 5 0.7320 -1.230
0.958 27.400 2.14e+01 102.0 6 6 0.8810 -1.490 0.916 17.200 1.22e+01 83.7
7 7 0.8940 -1.640 0.869 11.600 7.29e+00 75.4 8 8 0.8620 -1.660 0.827
8.250 4.56e+00 69.2 9 9 0.8200 -1.600 0.810 6.160 2.97e+00 64.2 10 10
0.8390 -1.560 0.855 4.780 2.01e+00 60.1 11 12 0.8020 -1.410 0.866 3.160
9.61e-01 53.2 12 14 0.8470 -1.340 0.909 2.280 4.84e-01 47.7 13 16 0.8850
-1.250 0.932 1.750 2.64e-01 43.1 14 18 0.8830 -1.210 0.922 1.400
1.46e-01 39.1 15 20 0.9110 -1.180 0.926 1.150 8.35e-02 35.6 16 22 0.9160
-1.140 0.927 0.968 5.02e-02 32.6 17 24 0.9520 -1.120 0.961 0.828
2.89e-02 29.9 18 26 0.9520 -1.120 0.944 0.716 1.77e-02 27.5 19 28 0.9380
-1.120 0.922 0.626 1.08e-02 25.4 20 30 0.9620 -1.110 0.951 0.551
6.49e-03 23.5
par(mfrow = c(1,2)) cex1 = 0.9
plot(sftfitIndices[,3])sftfitIndices[,1], -sign(sft$fitIndices[,3])sft$fitIndices[,2], labels=powers,cex=cex1,col=”red”)
abline(h=0.85,col=”red”)
plot(sftfitIndices[,5], xlab=”Soft Threshold (power)”,ylab=”Mean
Connectivity”, type=”n”, main = paste(“Mean connectivity”))
text(sftfitIndices[,5], labels=powers, cex=cex1, col=”red”)
power = sft$powerEstimate power
[1] 6 经验power (无满足条件的power时选用)
if (is.na(power)){ power = ifelse(nSamples<20, ifelse(type ==
“unsigned”, 9, 18), ifelse(nSamples<30, ifelse(type == “unsigned”, 8,
16), ifelse(nSamples<40, ifelse(type == “unsigned”, 7, 14),
ifelse(type == “unsigned”, 6, 12)) ) ) }
网络构建 一步法网络构建:One-step network construction and module detection
net = blockwiseModules(dataExpr, power = power, maxBlockSize =
nGenes, TOMType = type, minModuleSize = 30, reassignThreshold = 0,
mergeCutHeight = 0.25, numericLabels = TRUE, pamRespectsDendro = FALSE,
saveTOMs=TRUE, corType = corType, maxPOutliers=maxPOutliers,
loadTOMs=TRUE, saveTOMFileBase = paste0(exprMat, “.tom”), verbose = 3)
Calculating module eigengenes block-wise from all genes Flagging
genes and samples with too many missing values… ..step 1 ..Working on
block 1 . TOM calculation: adjacency.. ..will use 47 parallel threads.
Fraction of slow calculations: 0.000000 ..connectivity.. ..matrix
multiplication (system BLAS).. ..normalization.. ..done. ..saving TOM
for block 1 into file WGCNA/LiverFemaleClean.txt.tom-block.1.RData
….clustering.. ….detecting modules.. ….calculating module eigengenes..
….checking kME in modules.. ..removing 3 genes from module 1 because
their KME is too low. ..removing 5 genes from module 12 because their
KME is too low. ..removing 1 genes from module 14 because their KME is
too low. ..merging modules that are too close.. mergeCloseModules:
Merging modules whose distance is less than 0.25 Calculating new MEs…
table(net$colors)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 135 472 356 333 307 303 177 158 102 94 69 66 63 62 层级聚类树展示各个模块 灰色的为未分类到模块的基因。
moduleLabels = net$colors moduleColors = labels2colors(moduleLabels)
plotDendroAndColors(netblockGenes[[1]]], “Module colors”, dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05)
绘制模块之间相关性图
MEs = net$MEs
不需要重新计算,改下列名字就好 官方教程是重新计算的,起始可以不用这么麻烦
MEs_col = MEs colnames(MEs_col) = paste0(“ME”, labels2colors(
as.numeric(str_replace_all(colnames(MEs),”ME”,””)))) MEs_col =
orderMEs(MEs_col)
plotEigengeneNetworks(MEs_col, “Eigengene adjacency heatmap”,
marDendro = c(3,3,2,4), marHeatmap = c(3,4,2,2), plotDendrograms = T,
xLabelsAngle = 90)
如果有表型数据,也可以跟ME数据放一起,一起出图
可视化基因网络 (TOM plot)
load(net$TOMFiles[1], verbose=T)
Loading objects: TOM
TOM <- as.matrix(TOM)
dissTOM = 1-TOM
plotTOM = dissTOM^7
diag(plotTOM) = NA
TOMplot(plotTOM, net$dendrograms, moduleColors, main = “Network heatmap plot, all genes”)
导出网络用于Cytoscape
Cytoscape绘制网络图见我们更新版的视频教程或https://bioinfo.ke.qq.com/。
probes = colnames(dataExpr) dimnames(TOM) <- list(probes, probes)
cyt = exportNetworkToCytoscape(TOM, edgeFile = paste(exprMat,
“.edges.txt”, sep=””), nodeFile = paste(exprMat, “.nodes.txt”, sep=””),
weighted = TRUE, threshold = 0, nodeNames = probes, nodeAttr =
moduleColors)
关联表型数据
trait <- “WGCNA/TraitsClean.txt”
if(trait != “”) { traitData <- read.table(file=trait, sep=’\t’,
header=T, row.names=1, check.names=FALSE, comment=’’,quote=””)
sampleName = rownames(dataExpr) traitData = traitData[match(sampleName,
rownames(traitData)), ] }
模块与表型数据关联
if (corType==”pearsoon”) { modTraitCor = cor(MEs_col, traitData, use =
“p”) modTraitP = corPvalueStudent(modTraitCor, nSamples) } else {
modTraitCorP = bicorAndPvalue(MEs_col, traitData, robustY=robustY)
modTraitCor = modTraitCorPp }
Warning in bicor(x, y, use = use, …): bicor: zero MAD in variable
‘y’. Pearson correlation was used for individual columns with zero (or
missing) MAD.
textMatrix = paste(signif(modTraitCor, 2), “\n(“, signif(modTraitP,
1), “)”, sep = “”) dim(textMatrix) = dim(modTraitCor)
labeledHeatmap(Matrix = modTraitCor, xLabels = colnames(traitData),
yLabels = colnames(MEs_col), cex.lab = 0.5, ySymbols =
colnames(MEs_col), colorLabels = FALSE, colors = blueWhiteRed(50),
textMatrix = textMatrix, setStdMargins = FALSE, cex.text = 0.5, zlim =
c(-1,1), main = paste(“Module-trait relationships”))
模块内基因与表型数据关联, 从上图可以看到MEmagenta与Insulin_ug_l相关,选取这两部分进行分析。
从上图可以看到MEmagenta与Insulin_ug_l相关 模块内基因与表型数据关联
计算模块与基因的相关性矩阵
if (corType==”pearsoon”) { geneModuleMembership =
as.data.frame(cor(dataExpr, MEs_col, use = “p”)) MMPvalue =
as.data.frame(corPvalueStudent( as.matrix(geneModuleMembership),
nSamples)) } else { geneModuleMembershipA = bicorAndPvalue(dataExpr,
MEs_col, robustY=robustY) geneModuleMembership = geneModuleMembershipAp }
只有连续型性状才能进行计算,如果是离散变量,在构建样品表时就转为0-1矩阵。
if (corType==”pearsoon”) { geneTraitCor = as.data.frame(cor(dataExpr,
traitData, use = “p”)) geneTraitP = as.data.frame(corPvalueStudent(
as.matrix(geneTraitCor), nSamples)) } else { geneTraitCorA =
bicorAndPvalue(dataExpr, traitData, robustY=robustY) geneTraitCor =
as.data.frame(geneTraitCorAp) }
Warning in bicor(x, y, use = use, …): bicor: zero MAD in variable
‘y’. Pearson correlation was used for individual columns with zero (or
missing) MAD.
module = “magenta” pheno = “Insulin_ug_l” modNames = substring(colnames(MEs_col), 3)
module_column = match(module, modNames) pheno_column = match(pheno,colnames(traitData))
moduleGenes = moduleColors == module
sizeGrWindow(7, 7) par(mfrow = c(1,1))
verboseScatterplot(abs(geneModuleMembership[moduleGenes,
module_column]), abs(geneTraitCor[moduleGenes, pheno_column]), xlab =
paste(“Module Membership in”, module, “module”), ylab = paste(“Gene
significance for”, pheno), main = paste(“Module membership vs. gene
significance\n”), cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col =
module)
分步法展示每一步都做了什么 计算邻接矩阵
adjacency = adjacency(dataExpr, power = power)
把邻接矩阵转换为拓扑重叠矩阵,以降低噪音和假相关,获得距离矩阵。
TOM = TOMsimilarity(adjacency) dissTOM = 1-TOM
层级聚类计算基因之间的距离树
geneTree = hclust(as.dist(dissTOM), method = “average”)
模块合并
minModuleSize = 30
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
deepSplit = 2, pamRespectsDendro = FALSE, minClusterSize =
minModuleSize)
dynamicColors = labels2colors(dynamicMods)
通过计算模块的代表性模式和模块之间的定量相似性评估,合并表达图谱相似
的模块 MEList = moduleEigengenes(datExpr, colors = dynamicColors) MEs = MEList$eigengenes
MEDiss = 1-cor(MEs)
METree = hclust(as.dist(MEDiss), method = “average”) MEDissThres = 0.25
merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3)
mergedColors = merge$colors;
分步法完结 Reference: