26年更新:使用Diamond比对NR数据库获取物种注释 之前用Kraken2注释宏基因组的contig发现只有30%左右可以被Kraken2注释Kraken2Bracken宏基因组物种注释-CSDN博客不信邪再用NR库试试参考将NR数据库diamond比对结果做物种注释_diamond 物种注释-CSDN博客NR下载nohup wget -t 0 -c -b https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz wget -c https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz.md5 #检查下载的完整性一样的话就是完整了 md5sum nr.gz ### 899ac219cac213c60fede9c3d9ef8f7b nr.gz cat nr.gz.md5 ### 899ac219cac213c60fede9c3d9ef8f7b nr.gz nohup gunzip nr.gz mv nr nr.fa ###下载NR物种相关信息和taxid信息 wget -t 0 -c -b https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz wget -t 0 -c -b https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz.md5 wget -t 0 -c -b https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz wget -t 0 -c -b https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz.md5 #检查完整性 md5sum prot.accession2taxid.FULL.gz cat prot.accession2taxid.FULL.gz.md5 md5sum taxdump.tar.gz cat taxdump.tar.gz.md5 #解压 tar -zxvf taxdump.tar.gz gunzip prot.accession2taxid.gz #移动需要的文件 cd ~ mkdir .taxonkit cp *.dmp ~/.taxonkit更新2026.6.52024年后NR数据库没更新好像改为NR_cluster了for i in $(seq -w 0 159)dowget -c https://ftp.ncbi.nlm.nih.gov/blast/db/experimental/nr_cluster_seq.${i}.tar.gzwget -c https://ftp.ncbi.nlm.nih.gov/blast/db/experimental/nr_cluster_seq.${i}.tar.gz.md5done必要软件下载conda create -n NR_database_search conda activate NR_database_search conda install -c bioconda taxonkit0.15.0 conda install -c bioconda diamond2.1.8 conda install -c bioconda csvtk0.28.0diamond建库nohup diamond makedb --threads 180 --in nr.fa --db NR_2023_07_23 diamond比对diamond blastx --db NR_2023_07_23.dmnd --query nucleic_reads.fna\ -o nucleic_matches_fmt6.txt --threads 180 --evalue 0.00001 \ --max-target-seqs 5 --outfmt 6 diamond blastp --db NR_2023_07_23.dmnd --query protein_reads.fna\ -o protein_matches_fmt6.txt --threads 180 --evalue 0.00001 \ --max-target-seqs 10000 --outfmt 6 ## --outfmt 6 最好别改变这些参数可以调整diamond比对的速度or准确性这几个参数可以调整比对的coverageidentityscore结果如下 这个表头后面python会加taxonkit获得物种分类信息表感谢大佬一文完成nt库序列快速下载及blast结果注释物种 (qq.com)得到seqid注释之后可以搜索注释## 一些主要的物种编号(不含藻类和高级生物) # 2 bacteria # 2157 archaea # 4751 fungi # 10239 virus # 554915 Amoebozoa # 2686027 Ancyromonadida # 554296 Apusozoa # 1401294 Breviatea # 2608240 CRuMs # 2611352 Discoba # 2489521 Hemimastigophora # 2611341 Metamonada # 2795258 Malawimonadida # 3004206 Provora # 2698737 SAR #看taxnokit安好了么 taxonkit -h #创建目录 cd /home/zhongpei/database/NR_2023_07_23 mkdir /home/zhongpei/database/NR_2023_07_23/NCBI_Main_tax cd NCBI_Main_tax #开始构建不同的子库 taxonkit list -j 4 --ids 2 --indent NCBI_Main.taxid.bacteria.txt taxonkit list -j 4 --ids 2157 --indent NCBI_Main.taxid.archaea.txt taxonkit list -j 4 --ids 4751 --indent NCBI_Main.taxid.fungi.txt taxonkit list -j 4 --ids 10239 --indent NCBI_Main.taxid.virus.txt taxonkit list -j 4 --ids 554915,2686027,554296,1401294,2608240,2611352,2489521,2611341,2795258,3004206,2698737 --indent NCBI_Main.taxid.Eukaryotic.txt # -j 是线程软件说4个够了--ids 是需要的物种编号用逗号分隔 # 提取taxid和taxonomy界门纲目科属种的对应信息到NCBI_Main.taxid.txt less NCBI_Main.taxid.bacteria.txt | taxonkit reformat -I 1 -r Unassigned -f {k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}| sed 1i\Taxid\tKingdom\tPhylum\tClass\tOrder\tFamily\tGenus\tSpecies\tStrain NCBI_Main.taxid.bacteria_new.txt less NCBI_Main.taxid.archaea.txt | taxonkit reformat -I 1 -r Unassigned -f {k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}| sed 1i\Taxid\tKingdom\tPhylum\tClass\tOrder\tFamily\tGenus\tSpecies\tStrain NCBI_Main.taxid.archaea_new.txt less NCBI_Main.taxid.fungi.txt | taxonkit reformat -I 1 -r Unassigned -f {k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}| sed 1i\Taxid\tKingdom\tPhylum\tClass\tOrder\tFamily\tGenus\tSpecies\tStrain NCBI_Main.taxid.fungi_new.txt less NCBI_Main.taxid.virus.txt | taxonkit reformat -I 1 -r Unassigned -f {k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}| sed 1i\Taxid\tKingdom\tPhylum\tClass\tOrder\tFamily\tGenus\tSpecies\tStrain NCBI_Main.taxid.virus_new.txt less NCBI_Main.taxid.Eukaryotic.txt | taxonkit reformat -I 1 -r Unassigned -f {k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}| sed 1i\Taxid\tKingdom\tPhylum\tClass\tOrder\tFamily\tGenus\tSpecies\tStrain NCBI_Main.taxid.Eukaryotic_new.txt # -I 1 一个制表符分隔-r 没有找到的用什么字符去填充这里用的“Unassigned” # -f 输出的格式1i表示在第行之前插入文本(sed用法不太会)完成vim NCBI_Main.taxid_new.txt把第一个Taxid改成小写seqid和taxid的对应还记不记得第一步下载过一个 prot.accession2taxid 现在要派上用场了其实python也能做csvtk太不熟悉了先来学习一下吧感觉还挺方便这一步比较慢cat prot.accession2taxid.FULL | csvtk -t grep -f taxid -P NCBI_Main.taxid.txt | csvtk -t cut -f accession.version,taxid NCBI_seqid_taxid.txt # cat prot.accession2taxid | 把 prot.accession2taxid 的内容到下面的 csvtk # -t 输入内容是制表符分隔grep 这是csvtk的1个子命令用于在文件中搜索匹配的行 # -P 搜索那些taxid字段的值出现在NCBI_Main.taxid.txt文件中的行 # cut 这是csvtk的1个子命令用于从输入中选择特定的字段 # 如果细菌的库太大了可以分割 nohup split -l 2000000000 NCBI_seqid_taxid_bacteria.txt NCBI_seqid_taxid_bacteria_NCBI_seqid_taxid.txt 就是目标文件diamond seqid和taxid对应再和界门纲目科属种对应筛选diamond结果里面得分最高的第一个出现的#!/usr/bin/env python import argparse parser argparse.ArgumentParser(descriptionFilter input file to retain only the first occurrence of each qseqid.) parser.add_argument(--input, -i, requiredTrue, helpPath to the input file.) parser.add_argument(--output, -o, requiredTrue, helpPath to the output file.) args parser.parse_args() input_file args.input output_file args.output seen_qseqids set() with open(input_file, r) as infile, open(output_file, w) as outfile: first_line infile.readline().strip() if qseqid in first_line.lower(): outfile.write(first_line \n) else: parts first_line.split(\t) qseqid parts[0] if qseqid not in seen_qseqids: outfile.write(first_line \n) seen_qseqids.add(qseqid) for line in infile: parts line.strip().split(\t) qseqid parts[0] if qseqid not in seen_qseqids: outfile.write(line) seen_qseqids.add(qseqid) print(fFiltered file saved as {output_file}.)把diamond结果文件与NCBI_seqid_taxid.txt对应#!/usr/bin/env python # ########################################################## # match diamond blast NR result.txt and NCBI_seqid_taxid.txt # written by PeiZhong in IFR of CAAS import argparse import os parser argparse.ArgumentParser(descriptionMatch Diamond BLAST NR result.txt and NCBI_seqid_taxid.txt. !!! All files should end with .txt !!!) parser.add_argument(--file_maker, -m, nargs, requiredTrue, helpKeywords to filter input files (e.g., diamond txt)) parser.add_argument(--input_dir, -i, requiredTrue, helpInput directory containing result files) parser.add_argument(--database, -db, requiredTrue, helpFile NCBI_seqid_taxid.txt) parser.add_argument(--out_name, -o, requiredTrue, helpOutput name for processed result files) args parser.parse_args() file_makers args.file_maker input_dir args.input_dir database args.database out_name args.out_name db {} # Step 1: Read and build the database print(Start reading database...) with open(database, r) as f_table: for line in f_table: line line.split(\t) acc_num line[0].strip() tax_num line[1].strip() db[acc_num] tax_num print(Finished building database.) # Step 2: Find files in input directory matching all makers print(fScanning input directory: {input_dir}) all_files os.listdir(input_dir) filtered_files [ os.path.join(input_dir, file) for file in all_files if all(maker in file for maker in file_makers) ] if not filtered_files: print(fNo files matched the makers: { .join(file_makers)} in directory {input_dir}) exit(1) print(fFound {len(filtered_files)} file(s) matching the criteria.) # Step 3: Process each filtered file for result in filtered_files: print(fProcessing file: {result}) # Add header to output file header qseqid\taccession.version\ttaxid with open(result, r) as f_result, open(out_name, w) as f_out: f_out.write(header \n) # Match each line in the result file with the database for line in f_result: if qseqid in line: # Skip header continue line line.strip().split(\t) query_num line[0].strip() acc_q_num line[1].strip() # Check if the accession number exists in the database if acc_q_num in db: tax_id db[acc_q_num] else: continue f_out.write(f{query_num}\t{acc_q_num}\t{tax_id}\n) print(fFinished processing file: {result}. Output saved to {out_name}) print(All files processed.)chmod x diamond_NR_tax.py(完整地址) diamond_NR_tax.py(完整地址) -h nohup /home/zhongpei/hard_disk_sda2/zhongpei/Software/my_script/diamond_NR_tax.py --file_maker diamond --input_dir /home/zhongpei/diarrhoea/fan/metagenomic/bin/CAT_bins --database NCBI_seqid_taxid.txt --outdir . # 可以写个shell代码批量比对所有seqid_taxid文件 for i in NCBI_seqid_taxid_*.txt do num${i%%.txt} /home/zhongpei/hard_disk_sda2/zhongpei/Software/my_script/diamond_NR_tax.py -m diamond txt -i /home/zhongpei/diarrhoea/fan/metagenomic/assembly_result -db ${num}.txt -o ${num}_result.txt done做完结果是这样的接下来再写个python代码根据taxid把这个文件和界门纲目科属种联系起来就行(不好意思只会python我检讨。。。。但是python简单呀)#!/usr/bin/env python # ########################################################## # Match Diamond BLAST NR taxid results and NCBI_Main.taxid_new.txt # Written by PeiZhong in IFR of CAAS import os import argparse # Step 1: Set up argument parser parser argparse.ArgumentParser(descriptionMatch Diamond BLAST NR taxid result.txt and NCBI_Main.taxid_new.txt. !!! All files should end with .txt !!!) parser.add_argument(--input_txt, -i, requiredTrue, helppath of the folder containing your seqid_taxid files) parser.add_argument(--ncbi_db, -db, requiredTrue, helppath to NCBI_Main.taxid_new.txt file) parser.add_argument(--out, -o, requiredTrue, helpOutput directory to save results) args parser.parse_args() input_txt args.input_txt ncbi_db args.ncbi_db out args.out # Step 1: Build the taxid database print(Reading NCBI database...) db {} with open(ncbi_db, r) as f_db: for line in f_db: line line.strip().split(\t) taxid line[0].strip() tax_info line[1:8] # Extract additional taxonomic information db[taxid] tax_info print(fDatabase loaded: {len(db)} entries.) # Step 2: Process result file print(fProcessing file: {input_txt}) with open(input_txt, r) as f_result, open(out, w) as f_out: # Add header to output file header qseqid\taccession.version\ttaxid\tkingdom\tphylum\tclass\torder\tfamily\tgenus\tspecies f_out.write(header \n) for line in f_result: line line.strip().split(\t) if len(line) 3: # Skip lines with insufficient columns continue query_num line[0].strip() acc_q_num line[1].strip() taxid_1 line[2].strip() if taxid_1 in db: tax_info \t.join(db[taxid_1]) # Retrieve taxonomic information f_out.write(f{query_num}\t{acc_q_num}\t{taxid_1}\t{tax_info}\n) print(All files processed.)和上面一样也需要给权限chmod 755 NR_taxid_tax.py # 批量 for i in NCBI_seqid_taxid_bacteria_*_result.txt do num${i%%_result.txt} /home/zhongpei/hard_disk_sda2/zhongpei/Software/my_script/NR_taxid_tax.py -i ${i} -db NCBI_Main.taxid.bacteria_new.txt -o ${num}_tax.txt done /home/zhongpei/hard_disk_sda2/zhongpei/Software/my_script/NR_taxid_tax.py -i NCBI_seqid_taxid_virus_result.txt -db NCBI_Main.taxid.virus_new.txt -o NCBI_seqid_taxid_virus_tax.txt /home/zhongpei/hard_disk_sda2/zhongpei/Software/my_script/NR_taxid_tax.py -i NCBI_seqid_taxid_fungi_result.txt -db NCBI_Main.taxid.fungi_new.txt -o NCBI_seqid_taxid_fungi_tax.txt /home/zhongpei/hard_disk_sda2/zhongpei/Software/my_script/NR_taxid_tax.py -i NCBI_seqid_taxid_archaea_result.txt -db NCBI_Main.taxid.archaea_new.txt -o NCBI_seqid_taxid_archaea_tax.txt /home/zhongpei/hard_disk_sda2/zhongpei/Software/my_script/NR_taxid_tax.py -i NCBI_seqid_taxid_Eukaryotic_result.txt -db NCBI_Main.taxid.Eukaryotic_new.txt -o NCBI_seqid_taxid_Eukaryotic_tax.txt好了可以交差了 我宣布python是我们初学者的yyds根据物种注释计算宏基因组物种TPM相对丰度这一步我也是想了一会是按taxid进行group之后求和还是按照物种注释的各种界门纲目科属聚类呢最后还是决定使用界门纲目科属种这样比较直觉很多公司也是给的这个结果应该。首先我们需要一个TPM表格是基于所有基因的。这个可以参考之前的BWA-MEM2, Samtools: TPM定量-CSDN博客其次我们需要NR_taxid_tax.py这一步的结果并且合并在一起cat *_tax.txt ALL_tax.txt编写个代码按照物种去求和TPM同时可以处理界门纲目科属种水平#! /usr/bin/env python ######################################################### # TPM add up according to taxonomy file # written by PeiZhong in IFR of CAAS import argparse import pandas as pd parser argparse.ArgumentParser(descriptionTPM count file combine) parser.add_argument(--tpm, helpinput tpm txt name) parser.add_argument(--tax, helpinput tax file name) parser.add_argument(--out_prefix, helpoutput file prefix) parser.add_argument(--level, -l, typestr, helpKingdom, Phylum, Class, Order, Family, Genus, Species, only use one in a time) args parser.parse_args() tpm args.tpm tax args.tax out_prefix args.out_prefix level args.level ls_level [Kingdom,Phylum,Class,Order,Family,Genus,Species] if level not in ls_level: raise ValueError(Invalid level specified. Use Kingdom, Phylum, Class, Order, Family, Genus, Species.) # read TPM table df pd.read_csv(tpm, sep \t) # read tax file df2 pd.read_csv(tax, sep\t, header0, dtypestr) # delete some columns df2.drop(columns[accession.version, taxid], inplaceTrue) # delete some headers df2 df2[~df2[qseqid].str.contains(qseqid, naFalse)] # add new header df2.columns [GeneID, Kingdom, Phylum, Class, Order, Family, Genus, Species] # get level if level Kingdom: df2 df2[[GeneID, Kingdom]] else: df2 df2[[GeneID, Kingdom, level]] # merge df and df2 by column merge_df pd.merge(df, df2, onGeneID, howleft) # merge df and df2 according to column1 (as key) merge_df[level] merge_df[level].fillna(Unassigned) # dont forget genes unmap by NR # add TPM together according to bin_contig kingdom_groups merge_df.groupby(Kingdom) if level Kingdom: result_df kingdom_groups.sum() output_file f{out_prefix}_Kingdom.txt result_df.to_csv(output_file, sep\t, float_format%.4f) print(fSaved data to {output_file}) else: for kingdom, group in kingdom_groups: group[level] group[level].apply( lambda x: fUnassigned_{kingdom} if x Unassigned else x ) result_df group.groupby(level).sum() output_file f{out_prefix}_{kingdom}_{level}.txt result_df.to_csv(output_file, sep\t, float_format%.4f) print(fSaved {kingdom} data to {output_file})chmod 755 TPM_together_by_taxonomy.py # Kingdom TPM_together_by_taxonomy.py --tpm /home/zhongpei/diarrhoea/fan/metagenomic/assembly_result/goat_metagenomic_TPM.txt --tax ALL_tax.txt --out_prefix fan_all_cds_taxonomy -l Kingdom # 门水平 TPM_together_by_taxonomy.py --tpm /home/zhongpei/diarrhoea/fan/metagenomic/assembly_result/goat_metagenomic_TPM.txt --tax ALL_tax.txt --out_prefix fan_all_cds_taxonomy -l Phylum得出结果大功告成