KEGG的BHR评分注释直系同源

pxocgx01_blastx against multiple related xanthomonas species

文章阅读目录大纲

估计阅读时长: 8 分钟

https://gcmodeller.org/

KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from genomic and molecular-level information. It is a computer representation of the biological system, consisting of molecular building blocks of genes and proteins (genomic information) and chemical substances (chemical information) that are integrated with the knowledge on molecular wiring diagrams of interaction, reaction and relation networks (systems information). It also contains disease and drug information (health information) as perturbations to the biological system.

https://www.kegg.jp/kegg/kegg1a.html

在之前所记录的《记一次浅尝辄止的FBA代谢网络分析》工作之中：对于FBA模型所需要的代谢网络的重建，我们会需要将目标基因组中的蛋白序列集合与从UniProt序列库中抽取出来的具有KO编号的参考序列库做比较。基于KO编号的注释结果得到对应的KEGG代谢反应结果数据的注释，从而重建出整个代谢网络。

关于进行非参考基因组的KEGG代谢途径功能注释的操作，大家也可以观看之前的课程视频《【GCModeller教程】KEGG代谢途径注释原理》

对于酶来说，40-70%的序列相似性对于功能的预测有90%的准确性（Tian,W）。直系同源基于是来自于相同的祖先的基因分化，保存在不同的物种中的功能基因。在实际操作中，他们能够通过BBH（bi-directional best hit）来推测出来。因此，对在许多物种中的直系同源基因的鉴定是对新测序的基因功能预测的最便捷的途径。

蛋白质同源性注释的两个等级

大家在使用KEGG数据库之中的KAAS系统进行蛋白序列的KO编号的注释分析的时候，KAAS系统提供了两种等级的注释方法供大家进行选择：

single-directional best hit(SBH method)
bi-directional best hit(BBH method)

简单的说起来，上面的两个注释等级第一个就是只做query vs subject单向注释，取出blastp结果中的每一个query的第一条同源序列作为结果；第二个等级BBH要严格一些：就是除了forward方向的比较，还存在有subject vs query的reverse方向的比较，然后都同时取query结果中的两个方向的第一条序列结果的交集作为注释结果。

上面提到的两个方法非常的简单，naive。但是也会存在一些问题，例如：

SBH方法会得到很多重复的注释结果
而BBH方法则可能会出现注释结果偏少的问题，因为在BBH等级中，序列必须要符合不管是正向比对还是反向比对，双方都必须要处于对方的得分最高的第一条的位置上

BHR评分注释直系同源

BHR评分是在BBH的方法上改进而来的，这个评分目前广泛的应用于KO编号的注释之上。

# Bi-directional hit rate
BHR = Rf * Rr

理解BHR评分算法其实很容易，就是因为BBH方法因为是取Top1的交集，所以结果会很少。而BHR方法则是认为直系同源的注释的结果不一定为第一条，所以在这里通过一个评分来做阈值筛选完成可能不取第一条的双向比对注释，提升注释结果数量。

Database: ./UniProt_KO.fasta
           8,771,660 sequences; 3,406,738,582 total letters

Query= Synpcc7942_0001 DNA polymerase III, beta subunit
Length=390
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

K02338 Beta sliding clamp                                             786     0.0
K02338 Beta sliding clamp                                             751     0.0
K02338 Beta sliding clamp                                             485     1e-169
K02338 Beta sliding clamp                                             482     1e-168
K02338 Beta sliding clamp                                             480     7e-168
K02338 Beta sliding clamp                                             476     4e-166

现在我们遵从一个比对的顺序定义，就是forward方向为query vs subject，此为Rf；而reverse方向则是为subject vs query，此为Rr。那在这个比对方向的前提基础上，我们可以得到：

Rf = Bits_score[q_s] / MaxBits_score[q_s]
Rr = Bits_score[s_q] / MaxBits_score[s_q]
Blast bits score > 60
BHR(Rf*Rr) > 0.95

Blast Bits Score 是在 Blast raw score 换算过来的。上面的两个公式中，分子都是 query 和 subject 中的一个基因的Bit_score，分母是某一个方向上的一个query比对结果中最大的 bit_score。对的，BHR是不是非常简单？

Yuki Moriya, Masumi Itoh, Shujiro Okuda, Akiyasu C. Yoshizawa, Minoru Kanehisa, KAAS: an automatic genome annotation and pathway reconstruction server, Nucleic Acids Research, Volume 35, Issue suppl_2, 1 July 2007, Pages W182–W185, https://doi.org/10.1093/nar/gkm321

gkm321.PDF

BHR的代码实现

实现上面所描述的BHR评分的计算代码，大家可以阅读GCModeller之中的这个源代码文件：BBH/Algorithm/BHR.vb。我们在已经解析出blastp的输出结果的基础上，针对某一个方向的blastp比对得到的query基础结果之上，有：

''' <summary>
''' Calculate R score for hit
''' 
''' ```
''' R = bits / max(bits)
''' ```
''' </summary>
''' <returns></returns>
<Extension> Public Function HitRate(query As Query) As IEnumerable(Of NamedValue(Of Double))
    Dim bits = query.SubjectHits _
        .Select(Function(hit)
                    Return New NamedValue(Of Double) With {
                        .Name = hit.Name,
                        .Value = hit.Score.Score
                    }
                End Function) _
        .ToArray
    Dim maxBits As Double = bits.Max(Function(hit) hit.Value)

    Return bits _
        .Select(Function(hit)
                    Return New NamedValue(Of Double) With {
                        .Name = hit.Name,
                        .Value = hit.Value / maxBits
                    }
                End Function)
End Function

从上面的代码中也可以看得出来，计算Rf或者Rr得分其实就是对query中的每一个hits的得分做一次[0,1]正则化，将数据缩放到0到1之间。那再计算出Rf以及Rr之后呢，将两个乘起来，取得分最高的结果即可，下面的代码会将最高得分的匹配结果取出来：

Private Function getTopBHR(q As NamedCollection(Of NamedValue(Of Double)), r As Dictionary(Of String, Double)) As Map(Of (q$, r$), Double)
    Return q.Where(Function(hit) r.ContainsKey(hit.Name)) _
        .Select(Function(hit)
                    Dim bhr_score = hit.Value * r(hit.Name)
                    Dim align = (q.name, hit.Name)

                    Return New Map(Of (String, String), Double) With {
                        .Key = align,
                        .Maps = bhr_score
                    }
                End Function) _
        .OrderByDescending(Function(bhr) bhr.Maps) _
        .FirstOrDefault
End Function

Author
Recent Posts

谢桂纲

高级数据科学家 at 苏州帕诺米克

Working on Engineered bacteria CAD design on its genome from scratch. Writing scientific computing software for Tianhe & Sunway TaihuLight supercomputer. Do scientific computing programming in R/R# language, he is also the programming language designer of the R# language on the .NET runtime.

Attachments

kegg_overview • 313 kB • 248 click
2021年9月12日
kegg_variation • 110 kB • 298 click
2021年9月12日
RdUzbCF • 24 kB • 280 click
2021年9月12日
gkm321 • 388 kB • 415 click
2021年9月12日

KAAS: an automatic genome annotation and pathway reconstruction server

打赏赞(6)

bbh bhr blastp clustering GCModeller kegg localblast pathway sbh uniprot VisualBasic

7 Responses

Elisabeth Bable says:

2023年9月15日 at 7:52 PM

I would like to know where you got your website theme or design. Could you please share that information with me?

来自美国

Reply
- 谢桂纲 says:
  
  2023年9月20日 at 8:16 PM
  
  Hi, you can search theme ‘Teluro’ from WP theme store
  
  来自江苏
  
  Reply
【基因组学】全基因组调控网络的构建实例 – この中二病に爆焔を！ says:

2022年6月27日 at 10:18 PM

[…] 在转录调控因子注释方面，如果我们需要进行本地注释的话，可以基于双向最佳blastp的方法做转录因子蛋白的搜索注释。如果想在线做相应的转录因子注释，那么我们可以基于KEGG网站或者NCBI网站上的相应的blastp注释工具来完成。进行蛋白搜索注释之后，我们可以得到我们的目标基因组内同源的基因编号，将得到的基因编号列表与RegPrecise数据库之中所记录的基因编号列表取交集，我们就可以得到目标基因组内的转录因子结果了。 […]

来自中国

Reply
Finansowe Blogi says:

2022年5月11日 at 2:06 PM

Thanks for taking turns that good written content on your site. I discovered it on the internet. I will check to come back after you post extra aricles.

来自法国

Reply
- xie guigang says:
  
  2022年5月11日 at 8:59 PM
  
  Hi, thx for the comments
  
  来自江苏
  
  Reply
JXozlGpfUaMW says:

2021年10月22日 at 12:34 PM

TIGlsWeHw

来自泰国

Reply
记一次浅尝辄止的FBA代谢网络分析 – この中二病に爆焔を！ says:

2021年9月12日 at 2:42 PM

[…] 在这里我们使用的是KEGG的KAAS自动化注释系统中的BHR评分算法来进行直系同源的注释分析。关于BHR评分算法进行直系同源注释的更加详细的讲解，大家可以阅读我的这篇博客文章：《KEGG的BHR评分注释直系同源》，在这里不做过多的赘述。 […]

来自中国

Reply

https://gcmodeller.org/

蛋白质同源性注释的两个等级

BHR评分注释直系同源

BHR的代码实现

Attachments

Related

7 Responses

Leave a Reply Cancel reply

Recent Posts

Archives

博客文章

KEGG的BHR评分注释直系同源

https://gcmodeller.org/

蛋白质同源性注释的两个等级

BHR评分注释直系同源

BHR的代码实现

Order by Date Name Attachments

Related

7 Responses

Leave a Reply Cancel reply

Recent Posts

Archives

博客文章

Tags

Attachments