sort | awk |

 

1. perl 截取字符串

#   要求,在下面字符串中,得到第三个‘-’的位置。

#!/usr/bin/perl
#use strict;
use warnings;

my $text='asdf-asdfasdf/asdfasdf-asdfasdfasd,werdfsadf-asdfasdf-asd';
my $len=length($text);
#print($len);
$sum=0;
for($i=0;$i<3;$i++)
{
$s=index($text,'-');
$sum+=$s+1;
$text=substr($text,$s+1,$len);
}
print($sum);

2. perl index和rindex的用法

perl中,如果我们要查找一个字符子串在字符串中的位置,perl给我们提供了两个函数,一个为index,另外一个rindex。从两个函数上我们其实就可以看出,index为从前往后查找(从左到右),而rindex则是从后往前(从右到左)查找。下面我们就通过一个例子来看看index和rindex是怎么工作的。

  1. #! /usr/bin/perl
  2. use strict;
  3. use warnings;
  4.  
  5. my $string = "I can learn much form perlcn.com";
  6. my $loc = index($string,"perlcn");
  7. print "$loc\n";
  8. $loc = index($string,"I");
  9. print "$loc\n";
  10. $loc = index($string,"Perlcn");
  11. print "$loc\n";
  12. $loc = rindex($string,"perlcn");
  13. print "$loc\n";

3. 截取第一个空格前的字符串

#!/usr/bin/perl
#use strict;
use warnings;

my $text='comp0_c0_seq1 N/A N/A N/A';

$s=index($text,' ');

$text=substr($text,0,$s);

print $text;

 

4.截取第二个空格前的字符串

#!/usr/bin/perl
#use strict;
use warnings;

my $text='comp0_c0_seq1 N/A N/A N/A';
my $len=length($text);
$text1=$text;
#print($len);
$sum=0;
for($i=0;$i<2;$i++)
{
$s=index($text,' ');
$sum+=$s+1;
$text=substr($text,$s+1,$len);
}
$text2=substr($text1,0,$sum);
print $text2;

 

5. 截取第一列ID号,原数据用tab-delimited间隔。

#!/usr/bin/env perl

#comp0_c0_seq1 N/A N/A N/A
#comp2_c0_seq1 N/A N/A N/A
#comp4_c0_seq1 N/A N/A N/A
#comp5_c0_seq1 N/A N/A N/A

open $fd, '<', 'test_duan.txt' or die $!;
while (<$fd>) {
($ID) = split;
print "$ID\n";
}
close $fd;

 

6. index函数详解

The index() function is used to determine the position of a letter or a substring in a string. For example, in the word "frog" the letter "f" is in position 0, the "r" in position 1, the "o" in 2 and the "g" in 3. The substring "ro" is in position 1. Depending on what you are trying to achieve, the index() function may be faster or more easy to understand than using a regular expression or the split() function.

Example 1a. Where in a string is a certain letter?

Let's say you are trying to determine if a word has a particular letter in it. Let's look for the letter "l" in the string "perlmeme.org".

#!/usr/bin/perl
use strict;
use warnings;

my $string = 'perlmeme.org';
my $char = 'l';

my $result = index($string, $char);

print "Result: $resultn";

This program gives you:

Result: 3  

Example 1b. The string doesn't contain the letter?

If the string doesn't contain the letter, index() will return a -1. For example, we can look for the letter "L" in the string "perlmeme.org":

#!/usr/bin/perl
use strict;
use warnings;

my $string = 'perlmeme.org';
my $char = 'L';

my $result = index($string, $char);

print "Result: $resultn";

The program outputs:

Result: -1  

Example 1c. The string contains more than one of the letter?

If the letter we're searching for appears more than once in the string, index() return the index of the first occurrence of the letter.

#!/usr/bin/perl
use strict;
use warnings;

my $string = 'perlmeme.org';
my $char = 'e';

my $result = index($string, $char);

print "Result: $resultn";

This program gives you:

Result: 1  

Looking for a substring (rather than a single character) works in exactly the same way as looking for a single character:

#!/usr/bin/perl
use strict;
use warnings;

my $string = 'perlmeme.org';
my $substr = 'me';

my $result = index($string, $substr);

print "Result: $resultn";

This program gives you:

Result: 4  

The index() function let's you specify an offset. This tells index() where to start looking in the string. In our example, we found the first occurrence of the letter 'e' at position 1. Let's try to find the second:

#!/usr/bin/perl

use strict;
use warnings;

my $string = 'perlmeme.org';
my $char = 'e';

# The first 'e' was at position 1, so let's start
# looking again at position 2
my $offset = 2;

my $result = index($string, $char, $offset);

print "Result: $resultn";

The program outputs:

Result: 5  

Example 3b. How to find every occurrence

To find (and do something with) every occurrence of a character in a string, you could use index() in a loop, incrementing the offset each time:

#!/usr/bin/perl
use strict;
use warnings;

my $string = 'perlmeme.org';
my $char = 'e';
my $offset = 0;

my $result = index($string, $char, $offset);

while ($result != -1) {

print "Found $char at $resultn";

$offset = $result + 1;
$result = index($string, $char, $offset);

}

When we run this program, we get the following output:

Found e at 1
Found e at 5
Found e at 7

Example 4. How to find the last occurrence

Instead of looping through every occurrence of a letter in a string to find the last one, you can use the rindex() function. This works exactly the same as index() but it starts at the end of the string. The index value returned is string from the start of the string though.

#!/usr/bin/perl
use strict;
use warnings;

my $string = 'perlmeme.org';
my $char = 'e';

my $result = rindex($string, $char);

print "Result: $resultn";

This would produce:

Result: 7  

7. 合并ID相同的列

 

 

file1.txt

chr:123 a  
chr:123 b  
chr:456 a

file2.txt

chr:123 aa  
chr:456 bb

output should look like:

chr:123 a aa  
chr:123 b aa  
chr:456 a bb

code:

use Modern::Perl;  
  
my %file1Hash;  
  
open my $file1, "<file1.txt" or die $!;  
do { my ( $key, $value ) = split; push @{ $file1Hash{$key} }, $value }    
    for <$file1>;  
close $file1;    

open my $file2, "<file2.txt" or die $!;  
do {  
    my ( $key, $value ) = split;      
    do { say "$key $_ $value" for @{ $file1Hash{$key} } } if $file1Hash{$key};    
    }    
    for <$file2>;  
close $file2;

Output:

chr:123 a aa  
chr:123 b aa  
chr:456 a bb


8.用B文件数据更新A文件数据:

#!/usr/bin/perl
use strict;
use warnings;
my $data = +{};
open my $fh_b ,"<","B.txt";
while (<$fh_b>) {
chomp;
s/\t/ /g;
my ($t,@c) = split /\s\s+/, $_;
$data->{$t} = \@c;
}
undef $fh_b;
open my $fh_a, "<","A.txt";
open my $fh_c, ">","C.txt";
while (<$fh_a>) {
chomp;
next if m/^\s+$/;
my @tmp = split;
local $" = "\t";
print {$fh_c} "@tmp\n" unless defined $data->{$tmp[0]};
print {$fh_c} "$tmp[0]\t@{$data->{$tmp[0]}}\n" if exists $data->{$tmp[0]};
}
undef $fh_c;
undef $fh_a;

9. 合并两个文件的列

(1)在linux/Unix下做:

paste file1 file2 > file3

就这么简单。

(2) 用perl

#!/usr/bin/perl
open my $fh_a,"<","A.txt" or die;
open my $fh_b,"<","B.txt" or die;
open my $fh_c,">","C.txt" or die;
my @A = <$fh_a>;
my @B = <$fh_b>;
chomp $A[$_],print {$fh_c} "$A[$_]\t$B[$_]" for 0 .. $#A;

也很简单。

合并三个以上文件可以类推:

#!/usr/bin/perl

open my $fh_a,"<","A.txt" or die;
open my $fh_b,"<","B.txt" or die;
open my $fh_c,"<","C.txt" or die;

open my $fh_d,">","D.txt" or die;
my @A = <$fh_a>;
my @B = <$fh_b>;
my @C = <$fh_c>;

chomp $A[$_],chomp $B[$_],print {$fh_d} "$A[$_]\t$B[$_]\t$C[$_]" for 0 .. $#A;

10. ID相同的数据行合并问题。











A1 a 1 3

A1 a|b 1 15
A1 b 0 12

A2 c|d|e 5 18
A2 c 2 16

A3 f 3 3
A2 d 1 0

A4 g 6 9
A2 e 2 2

H2 h|i 10 25
A3 f 3 3

D4 j 12 4
A4 g 6 9

S3 k|l 18 22
H2 h 2 22





H2 i 8 3





D4 j 12 4





S3 k 1 7





S3 l 17 15


#!/usr/bin/perl

use warnings;

use strict;

 

my $data = +{};

 

while (<>) {

my @tmp = split /\s+/, $_; # you can change this line if the separator is not '\s+'

if (defined $data->{$tmp[0]}) {

$data->{$tmp[0]} = [($data->{$tmp[0]}->[0] . "|" . $tmp[1]), ($data->{$tmp[0]}->[1] + $tmp[2]), ($data->{$tmp[0]}->[2] + $tmp[3]) ];

} else {

$data->{$tmp[0]} = [$tmp[1], $tmp[2],$tmp[3]];

}

}

 

foreach my $key ( sort keys %{$data} ) {

print "$key\t$data->{$key}->[0]\t$data->{$key}->[1]\t$data->{$key}->[2]\n";

}

用法: perl hebing.pl "input" > "output"

 

11. 求两个字符集合的交集

 

 

12. fasta格式截取某种长度范围内的序列

#!/usr/bin/perl

 

my $line_info;

 

while (<>) {

chomp;

if ($. % 2) {

$line_info .= "$_\n";

} elsif( length $_ <=15 && length $_ >= 10 ) {

$line_info .= $_,"\n";

print $line_info,"\n";

$line_info = '';

} else {

$line_info = '';

}

}

13. 截取长度不大于某一长度的序列

#!/usr/bin/perl

 

my $line_info;

 

while (<>) {

chomp;

if ($. % 2) {

$line_info .= "$_\n";

} elsif( length $_ <=15 ) {

$line_info .= $_,"\n";

print $line_info,"\n";

$line_info = '';

} else {

$line_info = '';

}

}

14. 截取长度大于某一长度的序列

#!/usr/bin/perl

 

my $line_info;

 

while (<>) {

chomp;

if ($. % 2) {

$line_info .= "$_\n";

} elsif( length $_ >=10 ) {

$line_info .= $_,"\n";

print $line_info,"\n";

$line_info = '';

} else {

$line_info = '';

}

}

15. 截取序列中含有某段序列(TGAC)的序列

#!/usr/bin/perl

 

my $line_info;

 

while (<>) {

chomp;

if ($. % 2) {

$line_info .= "$_\n";

} elsif( /TGAC/ ) {

$line_info .= $_,"\n";

print $line_info,"\n";

$line_info = '';

} else {

$line_info = '';

}

}

16. 截取序列中含有某段序列(TGA*C,或者TGAC)的序列

#!/usr/bin/perl

 

my $line_info;

 

while (<>) {

chomp;

if ($. % 2) {

$line_info .= "$_\n";

} elsif( /TGA.C|TGAC/ ) {

$line_info .= $_,"\n";

print $line_info,"\n";

$line_info = '';

} else {

$line_info = '';

}

}

17.文件行计数

#!/usr/bin/perl

open TFILE, "CH_0915-29-30.txt";
while(<TFILE>){
$cnt++;
}
print $cnt;
close (TFILE);

18.拆分文件

#!/usr/bin/perl

open (IN, "$ARGV[0]" ) || do {
print "\n没有辦法打開文件$ARGV[0]。\n\n使用方法:\nsplitfile 文件名 拆分行數\n";
exit 0; };
@char = <IN>;
close IN;

!$ARGV[1] && do {
print "\n請指定拆分行數。\n\n使用方法:\nsplitfile 文件名拆分行數\n";
exit 0; };
@char<=$ARGV[1] && do {
print "\n指定嘅拆分行數大過文件總行數,$ARGV[0]没有必要拆分。\n";
exit 0; };
($j = @char/$ARGV[1]) =~ s/\..*//;

@char%$ARGV[1] && $j++;

for ($i=1, $f=A; $i<=$j; $i++, $f++) {
open ($f, ">分割$i.txt" ) || do {
print "\n創建文件「分割$i.txt」失敗!\n";
exit 0; };
&RW($f);
close ($f);
}

sub RW {
my ($a) = @_;
@$a = splice (@char, 0, $ARGV[1]);
select $a;
print @$a;
}

用法: splitfile abc.txt 250000

19.统计匹配次数

grep -o "aaa" file | wc -l

20.计算fasta文件序列长度:

For example (this would the current fasta file)
>gi|546265522| SOX6
acgtcaatccag
cgattagcaaat
gtcctgattttgg

>gi|678457845| CMYC
gttaccatgcgatg
caatttgggacacc

I want (notice the ">" is removed:
gi|546265522| SOX6
Seq length: 36
gi|678457845| CMYC
Seq length: 28

代码:

#!/usr/bin/perl

use strict;
use warnings;

my $inFile = shift;
open (IN, "$inFile");

# By default Perl pulls in chunks of text up to a newline (\n) character; newline is
# the default Input Record Separator. You can change the Input Record Separator by
# using the special variable "$/". When dealing with FASTA files I normally change the
# Input Record Separator to ">" which allows your script to take in a full, multiline
# FASTA record at once.

$/ = ">";

# At each input your script will now read text up to and including the first ">" it encounters.
# This means you have to deal with the first ">" at the begining of the file as a special case.

my $junk = <IN>; # Discard the ">" at the begining of the file

# Now read through your input file one sequence record at a time. Each input record will be a
# multiline FASTA entry.

while ( my $record = <IN> ) {
chomp $record; # Remove the ">" from the end of $record, and realize that the ">" is already gone from the begining of the record

# Now split up your record into its definition line and sequence lines using split at each newline.
# The definition will be stored in a scalar variable and each sequence line as an
# element of an array.

my ($defLine, @seqLines) = split /\n/, $record;

# Join the individual sequence lines into one single sequence and store in a scalar variable.

my $sequence = join('',@seqLines); # Concatenates all elements of the @seqLines array into a single string.

print "$defLine\n"; # Print your definition; remember the ">" has already been removed. Remember to print a newline.

print "Seq Length: ", length($sequence), "\n"; # Print the sequence length and a newline

}

21. 计算fasta序列的行数

#!/usr/bin/perl

use strict;
use warnings;

my $inFile = shift;
open (IN, "$inFile");

# By default Perl pulls in chunks of text up to a newline (\n) character; newline is
# the default Input Record Separator. You can change the Input Record Separator by
# using the special variable "$/". When dealing with FASTA files I normally change the
# Input Record Separator to ">" which allows your script to take in a full, multiline
# FASTA record at once.

$/ = ">";

# At each input your script will now read text up to and including the first ">" it encounters.
# This means you have to deal with the first ">" at the begining of the file as a special case.

my $junk = <IN>; # Discard the ">" at the begining of the file

# Now read through your input file one sequence record at a time. Each input record will be a
# multiline FASTA entry.

while ( my $record = <IN> ) {
chomp $record; # Remove the ">" from the end of $record, and realize that the ">" is already gone from the begining of the record

# Now split up your record into its definition line and sequence lines using split at each newline.
# The definition will be stored in a scalar variable and each sequence line as an
# element of an array.

my ($defLine, @seqLines) = split /\n/, $record;

# Join the individual sequence lines into one single sequence and store in a scalar variable.

# my $sequence = join('',@seqLines); # Concatenates all elements of the @seqLines array into a single string.
my $num=@seqLines;

print "$defLine", "\t", $num, "\n"; # Print your definition; remember the ">" has already been removed. Remember to print a newline.

# print "Seq Length: ", $num, "\n"; # Print the sequence length and a newline

}

22. 计算序列的有关数据

数据:

>Cluster 904279
0 25nt, >HWI-1115:5559:2049... at +/100.00%
1 27nt, >HWI-1203:17149:5352... *
2 25nt, >HWI-1212:9354:38923... at +/96.00%
3 25nt, >HWI-1312:12753:2639... at +/100.00%
4 26nt, >HWI-1314:8582:10117... at +/100.00%
5 22nt, >HWI-2111:17761:8639... at +/100.00%
6 26nt, >HWI-2113:10976:2591... at +/100.00%
7 23nt, >HWI-2115:18474:4611... at +/100.00%
8 27nt, >HWI-2115:15416:5988... at +/96.30%
9 26nt, >HWI-2315:4233:12495... at +/100.00%
>Cluster 904280
0 27nt, >HWI-1203:17870:5352... *
>Cluster 904281
0 27nt, >HWI-1203:19923:5354... *
>Cluster 904282
0 25nt, >HWI-1105:13237:9766... at +/96.00%
1 27nt, >HWI-1203:20605:5351... *
2 25nt, >HWI-1209:5911:48690... at +/100.00%
>Cluster 904283
0 27nt, >HWI-1203:20901:5363... *
>Cluster 904284
0 24nt, >HWI-1201:21092:7836... at +/95.83%
1 27nt, >HWI-1203:1213:53789... *
2 25nt, >HWI-2101:17011:8320... at +/96.00%
>Cluster 904285
0 27nt, >HWI-1203:2188:53964... *
1 26nt, >HWI-2206:9782:92952... at -/96.15%
>Cluster 904286
0 27nt, >HWI-1203:2645:53941... *
>Cluster 904287
0 27nt, >HWI-1203:3160:53790... *
1 27nt, >HWI-1213:21360:3379... at +/96.30%
2 27nt, >HWI-1310:18182:9638... at +/96.30%
>Cluster 904288
0 27nt, >HWI-1203:4123:53893... *
>Cluster 904289
0 27nt, >HWI-1203:16145:5396... *
1 26nt, >HWI-2101:11750:6949... at +/96.15%
>Cluster 904290
0 27nt, >HWI-1203:16384:5379... *
>Cluster 904291
0 27nt, >HWI-1203:16896:5391... *
1 18nt, >HWI-2111:18752:5142... at +/100.00%
>Cluster 904292
0 27nt, >HWI-1203:18938:5380... *
1 27nt, >HWI-1207:15121:5399... at +/96.30%
>Cluster 904293
0 27nt, >HWI-1203:18800:5388... *
>Cluster 904294
0 27nt, >HWI-1203:20296:5378... *
>Cluster 904295
0 27nt, >HWI-1203:20890:5379... *
>Cluster 904296
0 27nt, >HWI-1203:5127:54010... *
>Cluster 904297
0 27nt, >HWI-1203:5004:54015... *
>Cluster 904298
0 27nt, >HWI-1203:6148:54093... *
1 24nt, >HWI-1208:5271:45926... at +/100.00%
2 27nt, >HWI-1303:8218:15416... at +/100.00%
3 27nt, >HWI-2113:20501:4152... at +/100.00%
4 27nt, >HWI-2209:11684:2284... at +/100.00%
5 26nt, >HWI-2311:17890:5590... at +/100.00%
>Cluster 904299
0 27nt, >HWI-1203:11343:5414... *
1 26nt, >HWI-1213:12148:9876... at +/96.15%
2 25nt, >HWI-1301:20814:5731... at +/100.00%
3 26nt, >HWI-1301:20796:7352... at +/96.15%
4 26nt, >HWI-2108:12828:4589... at +/96.15%
5 26nt, >HWI-2201:2810:16897... at +/96.15%
6 26nt, >HWI-2208:6779:41808... at +/96.15%

期望结果:

cluster 904279 1 27nt, >HWI-1203:17149:5352... * 10
cluster 904280 0 27nt, >HWI-1203:17870:5352... * 1
cluster 904281 0 27nt, >HWI-1203:19923:5354... * 1
cluster 904282 1 27nt, >HWI-1203:20605:5351... * 3
cluster 904283 0 27nt, >HWI-1203:20901:5363... * 1
cluster 904284 1 27nt, >HWI-1203:1213:53789... * 3
cluster 904285 0 27nt, >HWI-1203:2188:53964... * 2
cluster 904286 0 27nt, >HWI-1203:2645:53941... * 1
cluster 904287 0 27nt, >HWI-1203:3160:53790... * 3
cluster 904288 0 27nt, >HWI-1203:4123:53893... * 1
cluster 904289 0 27nt, >HWI-1203:16145:5396... * 2
cluster 904290 0 27nt, >HWI-1203:16384:5379... * 1
cluster 904291 0 27nt, >HWI-1203:16896:5391... * 2
cluster 904292 0 27nt, >HWI-1203:18938:5380... * 2
cluster 904293 0 27nt, >HWI-1203:18800:5388... * 1
cluster 904294 0 27nt, >HWI-1203:20296:5378... * 1
cluster 904295 0 27nt, >HWI-1203:20890:5379... * 1
cluster 904296 0 27nt, >HWI-1203:5127:54010... * 1
cluster 904297 0 27nt, >HWI-1203:5004:54015... * 1
cluster 904298 0 27nt, >HWI-1203:6148:54093... * 6
cluster 904299 0 27nt, >HWI-1203:11343:5414... * 7

代码:

#!/usr/bin/perl

use strict;
use warnings;

my $inFile = shift;
open (IN, "$inFile");
open (OUT, ">>outFile");

$/ = ">Cluster";

 

my $junk = <IN>; # Discard the ">" at the begining of the file

 

while ( my $record = <IN> ) {
chomp $record; # Remove the ">" from the end of $record, and realize that the ">" is already gone from the begining of the record

 

my ($defLine, @seqLines) = split /\n/, $record;

my $num=@seqLines;

for (my $i=0;$i<=$num-1;$i++){

print "cluster","$defLine", "\t","$seqLines[$i]","\t", "$num\n" if ( $seqLines[$i]=~ />HWI-.*\*$/) ;

}

}

close (IN);
close (OUT);

 

23. 上面的数据中$num>10的留下来(count_seq-row.pl)。

#!/usr/bin/perl

use strict;
use warnings;

my $inFile = shift;
open (IN, "$inFile");
open (OUT, ">>outFile");

$/ = ">Cluster";

 

my $junk = <IN>; # Discard the ">" at the begining of the file

 

while ( my $record = <IN> ) {
chomp $record; # Remove the ">" from the end of $record, and realize that the ">" is already gone from the begining of the record

 

my ($defLine, @seqLines) = split /\n/, $record;

my $num=@seqLines;
if ($num >=10){
for (my $i=0;$i<=$num-1;$i++){

print "cluster","$defLine", "\t","$seqLines[$i]","\t", "$num\n" if ( $seqLines[$i]=~ />HWI-.*\*$/) ;

}
}

}

close (IN);
close (OUT);

 

24. N50,N90计算

#/usr/bin/perl -w
use strict;
my ($len,$total)=(0,0);
my @x;
while(<>){
if(/^[\>\@]/){
if($len>0){
$total+=$len;
push @x,$len;
}
$len=0;
}
else{
s/\s//g;
$len+=length($_);
}
}
if ($len>0){
$total+=$len;
push @x,$len;
}
@x=sort{$b<=>$a} @x;
my ($count,$half)=(0,0);
for (my $j=0;$j<@x;$j++){
$count+=$x[$j];
if (($count>=$total/2)&&($half==0)){
print "N50: $x[$j]\n";
$half=$x[$j]
}elsif ($count>=$total*0.9){
print "N90: $x[$j]\n";
exit;
}
}

 

25. ~ /^\"(.*)\"\s+\<(\d+)-?(\d*)\>\s*$/ perl模式匹配中这个表示什么意思?

开头+双引号+0个或多个任意字符+双引号+一个以上空白符+一个小于号+一个以上数字+一个或者0个减号+任意长度的数字+大于号+0个或多个空白+结束

 

26. \S 匹配什么,\s匹配什么?

\S 表示匹配非空白字符,范围可广了,只要不是空格、换行符、制表符、换页符即可

\s匹配空白符

s/\s+// 表示只删除字符串中出现的第一处连续空白( " ", "\n ", "\r ", "\t " 等), s/\s+//g表示删除字符串中所有的连续空白.

 

27.如何从fasta文件中提取特定序列?

If you have a large number of sequences that you want to extract, then you most likely have the sequence identifiers in a separate file. Assuming that you have one sequence identifier per line in the file ids.file, then you can use this one line:

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.file fasta.file > out.txt

==========

The first one liner is useful if you only want to extract a few sequences by their identifier from a FASTA file.

perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(id1 id2)}print if $c' fasta.file

This will extract the two sequences with the sequence idenfiers id1 and id2. You only have to change the identifiers within the parentheses and separate them by space to extract the sequences you need

 

28. 如何计算序列长度等信息?

#!/usr/bin/perl

use warnings;
use strict;
use Statistics::Descriptive;

my $stat = Statistics::Descriptive::Full->new();
my (%distrib);
my @bins = qw/18 19 20 21 22 23 24 25 26 27 28/;

my $fastaFile = shift;
open (FASTA, "d:/perl/psh0306.fasta");
$/ = ">";

my $junkFirstOne = <FASTA>;

while (<FASTA>) {
chomp;
my ($def,@seqlines) = split /\n/, $_;
my $seq = join '', @seqlines;
$stat->add_data(length($seq));
}

%distrib = $stat->frequency_distribution(\@bins);

print "Total reads:\t" . $stat->count() . "\n";
print "Total nt:\t" . $stat->sum() . "\n";
print "Mean length:\t" . $stat->mean() . "\n";
print "Median length:\t" . $stat->median() . "\n";
print "Mode length:\t" . $stat->mode() . "\n";
print "Max length:\t" . $stat->max() . "\n";
print "Min length:\t" . $stat->min() . "\n";
print "Length\t# Seqs\n";
foreach (sort {$a <=> $b} keys %distrib) {
print "$_\t$distrib{$_}\n";
}

29. 如何解析blast结果文件?

#!/usr/bin/perl

print "Content-type: text/html\n\n";
print "<html><h1>Hello!</h1></html>\n";

#!/bin/perl -w
use warnings;

use strict;
use Bio::SearchIO;

#my ($infile,$numHits,$outfile)=@ARGV;

my $infile="DM0313_new.txt";

my $numHits=1;

my $outfile="out_blast0314-3.txt";

print "Parse the blast result!\n";
my $in=Bio::SearchIO->new(-format=>'blast',-file=>$infile);

open OUT,">$outfile" or die "Cannot open $outfile:$!"; ###

#print OUT "query_name\taccession_number\tdescription\t";
#print OUT "query_end\thit_start\thit_end\tpositives\tidentical\n";

while (my $result=$in->next_result){

# output "no hits found if there is no hits
if ($result->num_hits==0){
print OUT "\tNo hits found\n";
} else {

my $count=0;
# process each hit recursively
while (my $hit=$result->next_hit){
print OUT $result->query_name . "\t";

# the length of thr query sequence
#print OUT $result->query_length, "\t"; ##
print OUT "\t" if ($count>0);
# get te accession numbers of the hits
print OUT "\t" . $hit->accession . "\t";
# get the length of the hit seqence
#print OUT $hit->length . "\t";
# get the description of the hit
print OUT $hit->description . "\t";
# get the E value
#print OUT $hit->significance . "\t";

# 50 get the hit score of the hit
#print OUT $hit->bits . "\t";
print OUT "\n";

$count++;
last if ($count==$numHits);
}
}
}

close OUT;
print "DONE!!!\n";

 

30. 如何检测与安装perl模块?

检测PERL模块是否安装
#查看所有的
perldoc perllocal
#查看 Getopt::Long.pm是否安装
perl -MGetopt::Long -e "print\"module installed\n\""
module installed

如何安装perl模块


运行perl脚本时,经常会发现如下类似的错误:
forrest@forrest-desktop:~/study/perl/log4perl$ ./logToScreen.pl 
Can't locate Log/Log4perl.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.10.0 /usr/local/share/perl/5.10.0 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10 /usr/share/perl/5.10 /usr/local/lib/site_perl .) at ./logToScreen.pl line 3.
BEGIN failed--compilation aborted at ./logToScreen.pl line 3.
这个是因为
use Log::Log4perl;
Log::Log4perl模块没有安装。

forrest@ubuntu:~/study/perl$ ./memcached.pl 
Can't locate Cache/Memcached.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.10.1 /usr/local/share/perl/5.10.1 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10 /usr/share/perl/5.10 /usr/local/lib/site_perl .) at ./memcached.pl line 5.
BEGIN failed--compilation aborted at ./memcached.pl line 5.
这个是因为
use Cache::Memcached;
Cache::Memcached模块没有安装。

解决方法:
网上搜索了一下:发现有一篇文章写的非常好:Perl (Pete's notes) [http://www.cisl.ucar.edu/nets/intro/staff/siemsen/tools/perl.html#log4perl]
Installing Perl modules with CPAN.pm (best way)

Use the CPAN.pm module. To read about it, do "perldoc CPAN", or in XEmacs use the Perldoc pull-down when you're editing a Perl file.

The first time you use CPAN.pm, it will ask a long series of questions, the answers for which can be found below. Don't answer them until you've installed ncftp on the local machine.

If you've already installed CPAN and just want to use it, do like
(as root)
(sudo) perl -MCPAN -e shell
install Log::Log4perl
install HTML::TokeParser::Simple
h
q

The above will install Log4perl in /usr/lib/perl5/site_perl/5.6.1/Log/Log4perl.
需要注意的是必须使用root权限才能安装成功。


今天翻看了一下《Learn Perl, 5th》,第十一章是Perl Modules,一开始就是介绍怎么安装perl模块的。感觉总结的非常好。
1. Fining Modules:
Modules come in two types: those that come with Perl and that you should have available to you, and those that you can get from CPAN to install yourself. 

To find modules that don’t come with Perl, start at either CPAN Search (http://search.cpan.org) or Kobes’ Search (http://kobesearch.cpan.org/).* You can browse through the categories or search directly.

tips: 如何检查一个perl模块是否已经安装了?
可以使用perldoc moduleName检查。
不过首先要现安装perl-doc
$sudo apt-get install perl-doc
$ perldoc CGI
Try it with a module that does not exist and you’ll get an error message.
$ perldoc Llamas
$ No documentation found for "Llamas".

最佳实践:使用MCPAN的m moduleName命令
cpan[1]> m DBI
CPAN: Storable loaded ok (v2.20)
Going to read '/home/forrest/.cpan/Metadata'
  Database was generated on Wed, 21 Jul 2010 07:35:04 GMT
Module id = DBI
    DESCRIPTION  Generic Database Interface (see DBD modules)
    CPAN_USERID  TIMB (Tim Bunce <Tim.Bunce@pobox.com>)
    CPAN_VERSION 1.612
    CPAN_FILE    T/TI/TIMB/DBI-1.612.tar.gz
    UPLOAD_DATE  2010-07-16
    DSLIP_STATUS MmcOp (mature,mailing-list,C,object-oriented,Standard-Perl)
    MANPAGE      DBI - Database independent interface for Perl
    INST_FILE    /usr/local/lib/perl/5.10.1/DBI.pm
    INST_VERSION 1.612


cpan[2]> m DBD::Oracle
Module id = DBD::Oracle
    DESCRIPTION  Oracle Driver for DBI
    CPAN_USERID  DBIML (DBI Mailing Lists <dbi-users@perl.org>)
    CPAN_VERSION 1.24
    CPAN_FILE    P/PY/PYTHIAN/DBD-Oracle-1.24b.tar.gz
    DSLIP_STATUS MmcO? (mature,mailing-list,C,object-oriented,)
    INST_FILE    (not installed)

2. Installing Modules
三种方法:
法一:下载安装包手动安装
you can download the distribution, unpack it, and run a series of commands from the shell. Check for a README or INSTALL file that gives you more information. If the module uses MakeMaker,? the sequence will be something like this:
    $ perl Makefile.PL
    $ make install
If you can’t install modules in the system-wide directories, you can specify another directory with a PREFIX argument to Makefile.PL:
    $ perl Makefile.PL PREFIX=/Users/fred/lib
Some Perl module authors use another module, Module::Build, to build and install their creations. That sequence will be something like this:
    $ perl Build.PL
    $ ./Build install
缺点:无法自动安装依赖的包(Some modules depend on other modules though, and they won’t work unless you install yet more modules.)

法二:使用Perl自带的模块——CPAN.pm模块
$ perl -MCPAN -e shell
就是我们前面介绍的方法,这里就不赘述了。

法三:使用Perl自带的一个perl脚本——cpan脚本
$ cpan Module::CoreList LWP CGI::Prototype

链接: http://fhqdddddd.blog.163.com/blog/static/1869915420120189647801/

 

31. perl模块安装

 

32.perl提取特定文本行的下一行如何实现?

open F, "d:/perl/output-0621.txt";
open OUT,">d:/perl/parse_out-0622.txt";
@array=<F>;
$count=-1;
foreach (@array){
$count++;
if(/Seq1,Seq2,Tot Score,Tot Energy,Max Score,Max Energy,Strand,Len1,Len2,Positions/){$start=$count+1; print OUT"$array[$start]\n";}}

 

33. Linux中fasta文件的拆分与合并

FASTA文件的拆分:
(1)如果从一个文件a提取第10至20个序列存到另一个文件b:

awk -v RS='>' 'NR>1{i++}i>=10&&i<=20{print ">"$0}' a.fasta|sed '/^$/d'>b.fasta

(2)将某一文件a中每一条序列保存到一个文件中:

awk '/^>/{f=++d".fasta"} {print > f}' input.fasta

FASTA文件合并:

cat *.fasta > output.fasta

34. 计算序列长度

如果想直接显示每条序列的长度,可以运行如下命令:

awk '/^>/&&NR>1{print "";}{ printf "%s",/^>/ ? $0" ":$0 }'  contig.fa |awk '{print $1"\t"length($3)}'

35. 如何从GFF文件中提取CDS序列

Here is the perl script to get the cds sequences from gff3 file.

If you have any questions about the script or parsing gff3 file, pls link to:http://bioops.info/2011/03/parse-gff3-file/


#!/usr/bin/perl
# parse_gff3.pl
# get the cds sequences from gff3 file.
# usage: perl parse_gff3.pl gff3_input_file genome_sequence_directory >output.txt
 
use strict;
use warnings;
 
use Bio::Tools::GFF;
use Bio::DB::Fasta;
 
my $gff3_file=$ARGV[0];  #gff3 input file
my $genome=$ARGV[1]; #genome sequence directory
my $gffio = Bio::Tools::GFF -> new(-file =>$gff3_file , -gff_version => 3);
my $db    = Bio::DB::Fasta  -> new($genome);
 
my $i=0;
my $first=0;
my $strand=0;
my $seq='';
while(my $feature = $gffio->next_feature()) {
# Sometimes, the gff3 file format is a little different with the standard format,
# So that the keys of the hashes are different.
# Use the following lines of masked code to see the keys of the hashes. Some Change may be needed.
#  $i++;
#  if ($i==2){
#    my @a=keys %{$feature};
#    print "@a\n";
#    my @b=keys %{$feature->{_location}};
#    print "@b\n";
#    print "$feature->{_primary_tag}\n";
#  }
  if($feature->{_primary_tag}=~/mrna/i){
    $first++;
    if($first==1){
      $seq='';
    }else{
      print_sequence($seq,80,$strand);
      $seq='';
    }
    print ">$feature->{_gsf_tag_hash}->{Name}->[0]|$feature->{_gsf_tag_hash}->{ID}->[0]\n";
  }elsif($feature->{_primary_tag}=~/cds/i){
    my $seq_temp=$db->seq($feature->{_gsf_seq_id}, $feature->{_location}->{_start}=>$feature->{_location}->{_end});
    if($feature->{_location}->{_strand}=~/-/){  # to know the strand
      $seq=$seq_temp.$seq;
      $strand=0;
    }else{
      $seq.=$seq_temp;
      $strand=1;
    }
  }
}
print_sequence($seq,80,$strand);
 
$gffio->close();
 
sub print_sequence {
  my($sequence, $length,$strand) = @_;
  if($strand==0){  # if the sequence is on the minus strand, get its reverse-complement counterpart
    $sequence=~tr/ACGTacgt/TGCAtgca/;
    $sequence=reverse $sequence;
  }
  for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) {
    print substr($sequence, $pos, $length),"\n";
  }
}

36. 如何得到scafftig

#!/usr/bin/perl -w
#
#Author: Ruan Jue <ruanjue@genomics.org.cn>
#
use warnings;
use strict;
my $min_length = 0;
my $name = '';
my $seq = '';
while(<>){
if(/^>(\S+)/){
&print_scafftig($name, $seq) if($seq);
$name = $1;
$seq = '';
} else {
chomp;
$seq .= $_;
}
}
&print_scafftig($name, $seq) if($seq);
1;
sub print_scafftig {
my $name = shift;
my $seq = shift;
my $id = 1;
while($seq=~/([ATGCatgc]+)/g){
my $s = $1;
next if(length($s) < $min_length);
print ">$name\_$id length=".length($s)."\n";
while($s=~/(.{1,60})/g){
print "$1\n";
}
$id ++;
}
}

 

应用:perl scafftig.pl ***.fasta > out.fasta

37 提取特定文本行的下一行

#!/usr/bin/perl -w

open F, "in.txt"; #输入文件

open OUT,"out.txt";#把结果写入F:盘下的out.txt文件中
@array=<F>;
$count=-1;
foreach (@array){
    $count++; #自己可以调节$count+1,+2,+3....来调节固定文本下的任意一行
    if(/固定文本/){$start=$count+1; print OUT"$array[$start]\n";}}

38 把fasta格式文件固定长度行变成1行

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n } ' test.fasta