Automated Optimal Parameters for T-Distributed Stochastic Neighbor Embedding Improve Visualization and Allow Analysis of Large Datasets

PhenoGraph提供了与UMAP类似的算法过程进行单细胞组学数据的细胞分型处理操作。与UMAP方法相比,PhenoGraph并不会产生数据降维效果,仅仅产生数据点Cluster信息。如果需要将数据进行可视化,还需要借助于t-SNE算法将PhenoGraph的分型结果数据投影到一个二维平面上完成。

相关的知识点:

PhenoGraph算法流程

在PhenoGraph算法之中,大致上可以分为4个计算步骤:

  • ANN(Approximate Near Neighbor)查找每一个样本的最邻近点
  • 计算杰卡德相关性
  • 对相关性矩阵建立jaccard graph
  • 基于louvain算法对jaccard graph做节点集群发现(得到的每一个节点集群都是一种细胞分型)

Levine JH, Simonds EF, Bendall SC, Davis KL, Amir ED, Tadmor MD, Litvin O, Fienberg HG, Jager A, Zunder ER, Finck R, Gedman AL, Radtke I, Downing JR, Pe'er D, Nolan GP. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis, Cell 2015

在上面的4个步骤之中,前三个步骤与我们常用的UMAP数据降维方法类似:在UMAP方法之中,会首先通过KNN进行最临近点的计算,接着基于KNN的计算结果构建出一个稀疏矩阵网络图,UMAP就是基于这个稀疏矩阵网络图进行数据降维。

而在PhenoGraph算法之中,则不是对得到的稀疏网络图做降维。而是通过网络节点的聚类分析来实现分型效果。对于UMAP算法而言,得到的数据降维结果一般需要结合kmeans或者DBSCAN聚类才可以实现分型的效果。

  • Schematic overview of the nucleolus and its substructures: fibrillar center (FC), dense fibrillar component (DFC), and granular component (GC).
  • UTP6 in U-2 OS cells exemplify proteins localized to whole nucleoli (HPA025936).
  • Fibrillar center localization shown by a UBTF IF staining in U-2 OS cells (HPA006385).
  • UMAP visualization of the IF images generated in the HPA Cell Atlas (also shown in Fig 3A). The images from singularly localizing nuclear proteins are highlighted in purple (nucleoli), blue (fibrillar center), brown (nuclear speckles), and orange (nuclear bodies). Multilocalizing nucleolar proteins are highlighted in yellow.
  • Dual localization of LEO1 to both the fibrillar center and nucleoplasm in GFP-tagged HeLa cells (magenta), also supported by IF antibody staining using HPA040741 (green).
  • The multilocalizing ribosomal protein RPL13 detected in the nucleoli, cytosol, and ER in MCF-7 cells (HPA051702).

Data information: Protein of interest is shown in green, nuclear marker DAPI in blue, and the reference marker of microtubules in red. Scale bar 10 μm.
Mapping the nucleolar proteome reveals a spatiotemporal organization related to intrinsic protein disorder. Mol Syst Biol (2020)16:e9469. https://doi.org/10.15252/msb.20209469

杰卡德相关度

在我之前的博客文章之中提到,因为对于高维度数据而言,会存在维度灾难,即欧几里得距离会失效。所以我们一般不可以直接使用基于欧几里得距离的kmeans或者DBSCAN方法进行高维度数据聚类。解决的方法有两条路:进行数据降维或者换一种方式计算距离。

在这里PhenoGraph算法选择了第二种数据处理方式:在PhenoGraph方法之中,我们会首先通过KD树对每一个高维度的样本点数据进行k个以内的临近点的搜索计算。之后呢,每一个样本点都会有一个自己的索引向量,每一个索引向量都是和目标样本点通过KD树搜索出来的距离很近的点。已经提示到集合了,那么我们可以很自然而然的就想到使用杰卡德这种通过集合计算的方式来计算相似度的方法了。没错,在PhenoGraph方法之中,第二步骤就是通过杰卡德相似度基于每一个样本点的最邻近点的集合计算出任意两个样本点之间的相似度。

两个集合A和B交集元素的个数在A、B并集中所占的比例,称为这两个集合的杰卡德系数,用符号 J(A,B) 表示。

#include <Rcpp.h>
using namespace Rcpp;

// Compute jaccard coefficient between nearest-neighbor sets
//
// Weights of both i->j and j->i are recorded if they have intersection. In this case
// w(i->j) should be equal to w(j->i). In some case i->j has weights while j<-i has no
// intersections, only w(i->j) is recorded. This is determinded in code `if(u>0)`. 
// In this way, the undirected graph is symmetrized by halfing the weight 
// in code `weights(r, 2) = u/(2.0*ncol - u)/2`.
//
// Author: Chen Hao, Date: 25/09/2015

// [[Rcpp::export]]
NumericMatrix jaccard_coeff(NumericMatrix idx) {
    int nrow = idx.nrow(), ncol = idx.ncol();
    NumericMatrix weights(nrow*ncol, 3);
    int r = 0;
    for (int i = 0; i < nrow; i++) {
        for (int j = 0; j < ncol; j++) {
            int k = idx(i,j)-1;
            NumericVector nodei = idx(i,_);
            NumericVector nodej = idx(k,_);
            int u = intersect(nodei, nodej).size();  // count intersection number
            if(u>0){ 
                weights(r, 0) = i+1;
                weights(r, 1) = k+1;
                weights(r, 2) = u/(2.0*ncol - u)/2;  // symmetrize the graph
                r++;
            }
        }
    }

    return weights;
}

这样子,我们就可以得到一个样本点之间的相似度矩阵了。基于一定的阈值作为cutoff,我们就可以通过这个相似度矩阵构建出一个相似度网络图。最后呢,我们基于网络图聚类算法就可以对相似度网络图中的节点进行聚类。网络节点聚类得到的分类结果就是我们的细胞分型结果了。

关于通过使用Louvain算法做网络图中的节点的集群聚类分析,大家可以阅读我的上一篇博客文章:【基于Louvain算法的网络集群发现】

PhenoGraph的R代码实现

https://github.com/JinmiaoChenLab/Rphenograph

我在Github上找到了一个非常简洁明了的实现PhenoGraph算法的R程序包,下面是一段在这个R程序包之中对PhenoGraph方法的R代码实现:

message("Run Rphenograph starts:","\n", 
        "  -Input data of ", nrow(data)," rows and ", ncol(data), " columns","\n",
        "  -k is set to ", k)

cat("  Finding nearest neighbors...")
t1 <- system.time(neighborMatrix <- find_neighbors(data, k=k+1)[,-1])
cat("DONE ~",t1[3],"s\n", " Compute jaccard coefficient between nearest-neighbor sets...")

t2 <- system.time(links <- jaccard_coeff(neighborMatrix))
cat("DONE ~",t2[3],"s\n", " Build undirected graph from the weighted links...")
links <- links[links[,1]>0, ]
relations <- as.data.frame(links)
colnames(relations)<- c("from","to","weight")
t3 <- system.time(g <- graph.data.frame(relations, directed=FALSE))

# Other community detection algorithms: 
#    cluster_walktrap, cluster_spinglass, 
#    cluster_leading_eigen, cluster_edge_betweenness, 
#    cluster_fast_greedy, cluster_label_prop  
cat("DONE ~",t3[3],"s\n", " Run louvain clustering on the graph ...")
t4 <- system.time(community <- cluster_louvain(g))
cat("DONE ~",t4[3],"s\n")

message("Run Rphenograph DONE, totally takes ", sum(c(t1[3],t2[3],t3[3],t4[3])), "s.")
cat("  Return a community class\n  -Modularity value:", modularity(community),"\n")
cat("  -Number of clusters:", length(unique(membership(community))))

Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis

Latest posts by xie guigang (see all)

Attachments

2 Responses

  1. […] 在这篇博客文章之中,我主要是来详细介绍一下是如何从头开始实现Phenograph单细胞分型算法的。在之前的一篇博客文章《【单细胞组学】PhenoGraph单细胞分型》之中,我们介绍了Phenograph算法的简单原理,以及一个在R语言之中所实现的Phenograph算法的程序包Rphenograph。在这里我主要是详细介绍在GCModeller软件之中所实现的VisualBasic语言版本的Phenograph单细胞分型算法。 […]

Leave a Reply

Your email address will not be published.