估计阅读时长: 10 分钟

https://github.com/xieguigang/ms-imaging


TrIQ 算法流程

TrIQ算法的原理就是主要在整体信号的基础上寻找到一个可以将整体信号强度离散化分布程度最大化的一个阈值,基于这个信号阈值,可以将质谱成像图像整体的对比度进行提升。在TrIQ算法流程之中,具有以下的几个大致步骤:

  1. 获取得到所有信号强度值
  2. 对信号值进行等宽分bin,现在每一个bin之中我们可以得到处于每一个bin之中的信号值的数量
  3. 从得到的信号bin列表的下限循环至信号bin列表的上限,计算当前信号分bin区域的频率的CDF累加值
  4. 给定一个概率期望值,查找到CDF离期望值最近的信号bin,当前的信号bin的上限或者下限即为我们所需要的信号阈值

下面我们来讲解一下TrIQ算法的具体代码实现,下文所示的所有的代码,大家可以阅读源代码文件【Drawing2D/Colors/Scaler/TrIQ.vb】。对于等宽分bin方法,我们可以直接通过sciBASIC框架中的分bin函数进行处理,之后所有的计算处理都是基于等宽分bin的结果基础上来完成:

<Extension>
Public Function FindThreshold(data As IEnumerable(Of Double), q As Double,
                              Optional N As Integer = 100,
                              Optional eps As Double = 0.1) As Double

    Return CutBins _
        .FixedWidthBins(data, N, Function(x) x) _
        .FindThreshold(q, eps)
End Function

options(strict = FALSE);

# Create a sequence of numbers between -10 and 10 incrementing by 0.1.
x <- seq(-10, 10, by = .1);

bitmap(file = `${@dir}/CDF.png`) {
    ecdf = CDF(x -> dnorm(x, mean = 2.5, sd = 2), [min(x), max(x)], resolution = 120);
    str(ecdf);

    plot(ecdf$x, ecdf$y, 
       grid.fill = "white", 
       color = "steelblue", 
       point_size = 16
    );
}

累积分布函数(Cumulative Distribution Function),又叫分布函数,是概率密度函数的积分,能完整描述一个实随机变量X的概率分布。在之后的循环过程中呢,我们就可以对分bin区域进行CDF计算:大致的粗略计算就是将所有的频数加起来,然后除以总样本数,就可以得到一个概率累加结果的近似值了。

<Extension>
Public Function CDF(bin As IEnumerable(Of DataBinBox(Of Double)), N As Integer) As Double
    Dim sumHk As Double = Aggregate hi As DataBinBox(Of Double) In bin Into Sum(hi.Count)
    Dim p As Double = sumHk / N

    Return p
End Function

最后,我们在For循环之中查找到一个离所期望的分布概率值最接近的分bin结果即可。对于拿到的分bin簇,我们取出bin簇的上限或者下限即为所得到的目标阈值:

Public Module TrIQ

    <Extension>
    Public Function FindThreshold(data As IEnumerable(Of DataBinBox(Of Double)),
                                  q As Double,
                                  Optional eps As Double = 0.1) As Double
        Dim sample As DataBinBox(Of Double)() = data _
            .OrderBy(Function(b) b.Boundary.Min) _
            .ToArray
        Dim N As Integer = Aggregate point As DataBinBox(Of Double)
                           In sample
                           Into Sum(point.Count)
        Dim minK As Integer = 1
        Dim minD As Double = Double.MaxValue

        For k As Integer = 1 To sample.Length - 1
            Dim cdf As Double = sample.Take(k).CDF(N)

            If cdf > q Then
                Exit For
            End If

            Dim d As Double = stdNum.Abs(cdf - q)

            If d <= eps Then
                Return sample(k).Boundary.Min
            ElseIf d < minD Then
                minD = d
                minK = k
            End If
        Next

        Return sample(minK - 1).Boundary.Max
    End Function
End Module

TrIQ 信号处理算法应用于质谱成像可视化

质谱成像信号分布可视化

首先我们来查看质谱成像的原始信号值的分布一般是怎样的。我们先通过mzkit程序包所提供的质谱成像相关的函数进行响应度值的获取,接着呢,就可以通过直方图来简单的查看信号值的分布情况:

require(mzkit);

imports "MsImaging" from "mzplot";

options(strict = FALSE);
options(memory.load = "max");

data = "CleanSample.mzPack"
|> open.mzpack
|> viewer
|> MSIlayer(mz = 153.14)
|> intensity
;

data = data.frame(intensity = data);

bitmap(file = `${@dir}/intensity_hist.png`, size = [1800, 1400]) {
    ggplot(data, aes(x = "intensity"), padding = "padding:200px 400px 200px 200px;")
      + geom_histogram(bins = 100,  color = "steelblue")
      + ggtitle("Frequency of intensity")
      + scale_x_continuous(labels = "G2")
      + scale_y_continuous(labels = "F0")
      + theme_default()
      + theme(plot.title = element_text(size = 16), axis.text=element_text(size=8))
      ;
}

可以看得见,最原始的质谱成像信号的分布非常的异常,最大的极值信号的像素点数量非常少,但是信号值异常的高。导致我们直接在原始信号响应值的基础上使用线性映射,将无法产生颜色梯度。因为大部分的像素点的颜色都被几种在相同的颜色等级上了。现在我们再在原始信号值的基础上,取一下log变换,看看响应值的分布情况是否会好一些:

bitmap(file = `${@dir}/logIntensity_hist.png`, size = [1800, 1400]) {
    ggplot(data.frame(log_intensity = log(data[, "intensity"])), aes(x = "log_intensity"), padding = "padding:200px 400px 200px 200px;")
      + geom_histogram(bins = 100,  color = "steelblue")
      + ggtitle("Frequency of intensity")
      + scale_x_continuous(labels = "G2")
      + scale_y_continuous(labels = "F0")
      + theme_default()
      + theme(plot.title = element_text(size = 16), axis.text=element_text(size=8))
      ;
}

情况稍微好了一些,但是极值任然比较高,数量较少。并且大部分的像素点都集中在中间的部分。导致任然会出现很大的一部分像素点的颜色等级相同,整个成像图的对比度偏低。那现在我们测试一下通过TrIQ算法直接处理原始信号看看情况又会怎么样。

data = data[, "intensity"];

qcut = MsImaging::TrIQ(data, q = 0.6);
qcut = max(data) * max(qcut);
data[data > qcut] = qcut;

bitmap(file = `${@dir}/TrIQ_intensity_hist.png`, size = [2400, 1400]) {
    ggplot(data.frame(TrIQ_intensity = data), aes(x = "TrIQ_intensity"), padding = "padding:200px 500px 200px 200px;")
      + geom_histogram(bins = 100,  color = "steelblue")
      + ggtitle("Frequency of intensity")
      + scale_x_continuous(labels = "G2")
      + scale_y_continuous(labels = "F0")
      + theme_default()
      + theme(plot.title = element_text(size = 16), axis.text = element_text(size = 8))
      ;
}

可以看得见信号的整体分布大致比较平衡了。将TrIQ算法应用到质谱成像的渲染引擎之中看看效果:


下载论文原文:《Contrast optimization of mass spectrometry imaging (MSI) data visualization by threshold intensity quantization (TrIQ)

高级数据科学家 at 苏州帕诺米克
Working on Engineered bacteria CAD design on its genome from scratch. Writing scientific computing software for Tianhe & Sunway TaihuLight supercomputer. Do scientific computing programming in R/R# language, he is also the programming language designer of the R# language on the .NET runtime.
谢桂纲
Latest posts by 谢桂纲 (see all)

Order by Date Name
Attachments

2 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

博客文章
July 2024
S M T W T F S
 123456
78910111213
14151617181920
21222324252627
28293031  
  1. 针对图对象进行向量化表示嵌入: 首先,通过node2vec方法,将node表示为向量 第二步,针对node向量矩阵,进行umap降维计算,对node进行排序,生成node排序序列 第三步,针对node排序序列进行SGT序列图嵌入,实现将网络图对象嵌入为一维向量