Part 2 の練習問題解答

ファイル中の行数と語数

from Scientific.IO.TextFile import TextFile
from string import split

lines = 0
words = 0
for line in TextFile('some_file'):
    lines = lines + 1
    words = words + len(split(line))

print lines, " lines, ", words, " words."

UNIX コマンドの wc の結果と比較することによって、このプログラムをテストすることができます。

PDB ファイル中の炭素原子数

from Scientific.IO.TextFile import TextFile
from Scientific.IO.FortranFormat import FortranFormat, FortranLine
from string import strip

generic_format = FortranFormat('A6')

atom_format = FortranFormat('A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,' +
                            '3X,3F8.3,2F6.2,7X,A4,2A2')

carbons = 0
for line in TextFile('protein.pdb'):
    record_type = FortranLine(line, generic_format)[0]
    if record_type == 'ATOM  ' or record_type == 'HETATM':
        data = FortranLine(line, atom_format)
        atom_name = strip(data[2])
	if atom_name[0] == 'C':
	    carbons = carbons + 1

print carbons, " carbon atoms"

原子名の前後の空白を取り除くために strip を使っていることに注意。正しくフォーマットされた PDB ファイルには、このような空白があります。（しかし、大抵のプログラムは、これを含まない PDB ファイルを生成します。）

このプログラムをテストするのは、前ほど一筋縄ではありません。一つの方法は、α炭素だけを数える様に少し変更を加えることです。そうすれば、ファイル中のアミノ酸残基数と同じになるはずです。

PDB から XYZ への変換

from Scientific.IO.TextFile import TextFile
from Scientific.IO.FortranFormat import FortranFormat, FortranLine
from string import join, strip

generic_format = FortranFormat('A6')

atom_format = FortranFormat('A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,' +
                            '3X,3F8.3,2F6.2,7X,A4,2A2')

atom_lines = []

for line in TextFile('protein.pdb'):
    record_type = FortranLine(line, generic_format)[0]
    if record_type == 'ATOM  ' or record_type == 'HETATM':
        data = FortranLine(line, atom_format)
        atom_name = strip(data[2])
	element = atom_name[0]
	atom_lines.append(join([element, str(data[8]),
				str(data[9]), str(data[10])]) + '\n')

output_file = TextFile('protein.xyz', 'w')
output_file.write(`len(atom_lines)` + '\n\n')
output_file.write(join(atom_lines, ''))
output_file.close()

原子数を得るためにファイルを２度読むことを避ける方法は幾つかあります。しかし、いずれの主旨も、情報を一つのリストに収集しておき、その後の段階で書き出すことです。上の例では、出力ファイルのテキスト行からなるリストが該当するものですが、それはまた、入力行を格納したり、あるいは何らかの中間データとして用いることもできます。

出力行からなるリストを用いる利点は、出力ファイルを生成するために再度ループを使う必要がなくなることです；すなわち、書き出し操作と join の呼び出しを一回ずつで十分となります。実は、この組合せは頻発するので、さらに効率の良い記法が導入されています：output_file.write(join(atom_lines, '')) の代りに、output_file.writelines(atom_lines) としても良いのです。