千家信息网

perl如何分离NR NT库

发表于:2025-02-01 作者:千家信息网编辑
千家信息网最后更新 2025年02月01日,这篇文章主要为大家展示了"perl如何分离NR NT库",内容简而易懂,条理清晰,希望能够帮助大家解决疑惑,下面让小编带领大家一起研究并学习一下"perl如何分离NR NT库"这篇文章吧。分离NR N
千家信息网最后更新 2025年02月01日perl如何分离NR NT库

这篇文章主要为大家展示了"perl如何分离NR NT库",内容简而易懂,条理清晰,希望能够帮助大家解决疑惑,下面让小编带领大家一起研究并学习一下"perl如何分离NR NT库"这篇文章吧。

分离NR NT库,快速blast本地比对同源注释基因

我们需要对自己的基因做注释,需要blast同源比对NCBI当中的NR NT库;通常做无参转录组,会组织出10几万的unigene ,如果比对全库的话,就太浪费时间了,我们可以根据NCBI的分类数据库将数据库分开,可下载如下文件,然后利用下面的perl脚本就可以把NR或者NT库分开成小库:

wget -c ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gzwget -c ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gzwget -c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gzwget -c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gzwget -c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz

perl脚本如下,由于脚本会把分类文件全部读入内存,所有脚本内存消耗较大,建议确保100G以上的内存空间:

#The gi_taxid_nucl.dmp is about 160 MB and contains  two columns: the nucleotide's gi and taxid.#The gi_taxid_prot.dmp is about 17 MB and contains two columns:  the protein's gi  and taxid.#Divisions file (division.dmp):#       division id                                -- taxonomy database division id#       division cde                               -- GenBank division code (three characters)#       division name                              -- e.g. BCT, PLN, VRT, MAM, PRI...#       comments#0       |       BCT     |       Bacteria        |               |#1       |       INV     |       Invertebrates   |               |#2       |       MAM     |       Mammals |               |#3       |       PHG     |       Phages  |               |#4       |       PLN     |       Plants and Fungi        |               |#5       |       PRI     |       Primates        |               |#6       |       ROD     |       Rodents |               |#7       |       SYN     |       Synthetic and Chimeric  |               |#8       |       UNA     |       Unassigned      |       No species nodes should inherit this division assignment        |#9       |       VRL     |       Viruses |               |#10      |       VRT     |       Vertebrates     |               |#11      |       ENV     |       Environmental samples   |       Anonymous sequences cloned directly from the environment        |#nodes.dmp file consists of taxonomy nodes. The description for each node includes the following#fields:#        tax_id                                  -- node id in GenBank taxonomy database#        parent tax_id                           -- parent node id in GenBank taxonomy database#        rank                                    -- rank of this node (superkingdom, kingdom, ...)#        embl code                               -- locus-name prefix; not unique#        division id                             -- see division.dmp file#        inherited div flag  (1 or 0)            -- 1 if node inherits division from parent#        genetic code id                         -- see gencode.dmp file#        inherited GC  flag  (1 or 0)            -- 1 if node inherits genetic code from parent#        mitochondrial genetic code id           -- see gencode.dmp file#        inherited MGC flag  (1 or 0)            -- 1 if node inherits mitochondrial gencode from parent#        GenBank hidden flag (1 or 0)            -- 1 if name is suppressed in GenBank entry lineage#        hidden subtree root flag (1 or 0)       -- 1 if this subtree has no sequence data yet#        comments                                -- free-text comments and citationsdie "perl $0     " unless(@ARGV==5);use Math::BigFloat;use Bio::SeqIO;use Bio::Seq;use Data::Dumper;use PerlIO::gzip;use FileHandle;use Cwd qw(abs_path getcwd);if ($ARGV[3]=~/gz$/){        open $Fa, "<:gzip", "$ARGV[3]" or die "$!";}else{        open $Fa, "$ARGV[3]" or die "$!";}my$od=$ARGV[4];$od||=getcwd;$od=abs_path($od);unless(-d $od){    mkdir $od;}my $in  = Bio::SeqIO->new(-fh => $Fa ,                               -format => 'Fasta');my %division2name=();open IN,"$ARGV[0]" or die "$!";while(){        chomp;        my($taxid,$name)=split(/\s+\|\s+/);        $division2name{$taxid}=$name;}close(IN);my%fout=();for my$i (keys %division2name){                                my$FO=FileHandle->new("> $od/$division2name{$i}.fa");                my $out = Bio::SeqIO->new(-fh => $FO ,  -format => 'Fasta');        $fout{$i}=$out;}print "$ARGV[0] readed\n";#print Dumper(\%fout);#print Dumper(\%division2name);open IN,"$ARGV[1]" or die "$!";my%taxid2division=();while(){        chomp;        my@tmp=split(/\s+\|\s+/);        $taxid2division{$tmp[0]}=$tmp[4];        #last if $.>100;}close(IN);print "$ARGV[1] readed\n";my %gi2taxid=();open IN,"$ARGV[2]" or die "$!";while(){        chomp;        my@tmp=split(/\s+/);        $gi2taxid{$tmp[0]}=$tmp[1];        #last if $.>100;}close(IN);print "$ARGV[2] readed\n";#print Dumper(\%gi2taxid);while( my $seq = $in->next_seq()){                my $id=$seq->id;        my ($gi)=($id=~/gi\|(\d+)\|ref\|/);        if(exists($gi2taxid{$gi}) and exists($taxid2division{$gi2taxid{$gi}})){                $fout{$taxid2division{$gi2taxid{$gi}}}->write_seq($seq);        }else{                print "unknown gi:$gi\n";        }}for my $i (keys %fout){                $fout{$i}->close();                }

以上是"perl如何分离NR NT库"这篇文章的所有内容,感谢各位的阅读!相信大家都有了一定的了解,希望分享的内容对大家有所帮助,如果还想学习更多知识,欢迎关注行业资讯频道!

0