所有物种基因Symbol别名转换为最新Symbol

发布于:2024-05-13 ⋅ 阅读:(183) ⋅ 点赞:(0)

需求

在数据分析中会经常出现感兴趣的基因不在矩阵中,可能的原因是没有测到或旧版Symbol。因此需要找到旧版Symbol(Alias别名)和最新Symbol(Current Symbol)之间的对应关系。

 

bq.tl.current_symbol可以把(表达)矩阵中的Symbol变为最新版

 

第一个参数数据框(index为Symbol)

第二个参数Symbol与Alias对应关系文件路径

第三个参数物种tax_id比如人的是9606。

SymbolAlias_20230317.feather的获取可以发送邮件到victor@bioquest.cn

 

从NCBI下载最新的基因信息https://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz

 

import numpy as np

import pandas as pd

 

import bioquest as bq

得到Symbol与Alias对应关系

g=pd.read_csv("gene_info_20230317.gz",sep='\t',usecols=['#tax_id','GeneID','Symbol','Synonyms'])

g.rename(columns={"#tax_id":"tax_id"},inplace=True)

g.loc[:,"Alias"]=g.Synonyms.str.split('|')

g = g.explode("Alias")

g = bq.tl.select(g,columns=["tax_id","GeneID","Symbol","Alias"])

g.reset_index(drop=True,inplace=True)

g.replace({'Alias': {'-':''}},inplace=True)

g.to_feather("SymbolAlias_20230317.feather",compression='zstd',compression_level=1)

           tax_id GeneID Symbol Alias

0 7 5692769 NEWENTRY      

1 9 2827857 NEWENTRY      

2 11 10823747 NEWENTRY      

3 14 6951813 NEWENTRY      

4 19 3758873 NEWENTRY      

... ... ... ... ...

44205723 3032134 60460443 ND6      

44205724 3032134 60460444 ND1      

44205725 3032134 60460445 I9997_mgr02      

44205726 3032134 60460446 I9997_mgt22      

44205727 3032134 60460447 I9997_mgr01      

 

[44205728 rows x 4 columns]

使用示例

示例数据

df = pd.read_csv("BLCA.csv",index_col="Gene Symbol")

# Gene Name Species

# Gene Symbol                                                                 

# ATP2B1 ATPase, Ca++ transporting, plasma membrane 1 Homo sapiens

# MYL6 myosin, light chain 6, alkali, smooth muscle a... Homo sapiens

# RPS16 ribosomal protein S16 Homo sapiens

# HIST1H2BA histone cluster 1, H2ba Homo sapiens

# H2AFY2 H2A histone family, member Y2 Homo sapiens

# ... ... ...

# UBB ubiquitin B Homo sapiens

# PYGB phosphorylase, glycogen; brain Homo sapiens

# HLA-A major histocompatibility complex, class I, A Homo sapiens

# HSPA1A heat shock 70kDa protein 1A Homo sapiens

# HSP90AB1 heat shock protein 90kDa alpha (cytosolic), cl... Homo sapiens

 

转换

bq.tl.current_symbol(frame=df,reference="SymbolAlias_20230317.feather", tax_id=9606)

# Gene Name Species \

# H2BC1 histone cluster 1, H2ba Homo sapiens   

# MACROH2A2 H2A histone family, member Y2 Homo sapiens   

# H3-3B H3 histone, family 3B (H3.3B) Homo sapiens   

# H1-5 histone cluster 1, H1b Homo sapiens   

# DARS1 aspartyl-tRNA synthetase Homo sapiens   

# ... ... ...   

# UBB ubiquitin B Homo sapiens   

# PYGB phosphorylase, glycogen; brain Homo sapiens   

# HLA-A major histocompatibility complex, class I, A Homo sapiens   

# HSPA1A heat shock 70kDa protein 1A Homo sapiens   

# HSP90AB1 heat shock protein 90kDa alpha (cytosolic), cl... Homo sapiens   

 

# Alias  

# H2BC1 HIST1H2BA  

# MACROH2A2 H2AFY2  

# H3-3B H3F3B  

# H1-5 HIST1H1B  

# DARS1 DARS  

# ... ...  

# UBB NaN  

# PYGB NaN  

# HLA-A NaN  

# HSPA1A NaN  

# HSP90AB1 NaN  

 

# [378 rows x 3 columns]


网站公告

今日签到

点亮在社区的每一天
去签到