Solutions to the exercises of part 2 / Part 2 の練習問題解答

Counting the lines and words in a file / ファイル中の行数と語数

from Scientific.IO.TextFile import TextFile
from string import split

lines = 0
words = 0
for line in TextFile('some_file'):
    lines = lines + 1
    words = words + len(split(line))

print lines, " lines, ", words, " words."

This program can be tested by running the Unix utility wc for comparison.

UNIX コマンドの wc の結果と比較することによって、このプログラムをテストすることができます。

Counting the carbon atoms in a PDB file / PDB ファイル中の炭素原子数

from Scientific.IO.TextFile import TextFile
from Scientific.IO.FortranFormat import FortranFormat, FortranLine
from string import strip

generic_format = FortranFormat('A6')

atom_format = FortranFormat('A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,' +
                            '3X,3F8.3,2F6.2,7X,A4,2A2')

carbons = 0
for line in TextFile('protein.pdb'):
    record_type = FortranLine(line, generic_format)[0]
    if record_type == 'ATOM  ' or record_type == 'HETATM':
        data = FortranLine(line, atom_format)
        atom_name = strip(data[2])
	if atom_name[0] == 'C':
	    carbons = carbons + 1

print carbons, " carbon atoms"

Note the use of strip to remove the spaces around the atom name that are present in a properly formatted PDB file (but not in the PDB files produced by most programs).

原子名の前後の空白を取り除くために strip を使っていることに注意。正しくフォーマットされた PDB ファイルには、このような空白があります。（しかし、大抵のプログラムは、これを含まない PDB ファイルを生成します。）

Testing this program is less straightforward. One possibility is to apply a slight change to make it count only C-alpha atoms. That must give the number of residues in the file.

このプログラムをテストするのは、前ほど一筋縄ではありません。一つの方法は、α炭素だけを数える様に少し変更を加えることです。そうすれば、ファイル中のアミノ酸残基数と同じになるはずです。

Conversion from PDB to XYZ / PDB から XYZ への変換

from Scientific.IO.TextFile import TextFile
from Scientific.IO.FortranFormat import FortranFormat, FortranLine
from string import join, strip

generic_format = FortranFormat('A6')

atom_format = FortranFormat('A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,' +
                            '3X,3F8.3,2F6.2,7X,A4,2A2')

atom_lines = []

for line in TextFile('protein.pdb'):
    record_type = FortranLine(line, generic_format)[0]
    if record_type == 'ATOM  ' or record_type == 'HETATM':
        data = FortranLine(line, atom_format)
        atom_name = strip(data[2])
	element = atom_name[0]
	atom_lines.append(join([element, str(data[8]),
				str(data[9]), str(data[10])]) + '\n')

output_file = TextFile('protein.xyz', 'w')
output_file.write(`len(atom_lines)` + '\n\n')
output_file.write(join(atom_lines, ''))
output_file.close()

There are several ways to avoid reading the file twice for obtaining the number of atom, but the general idea is to collect information into a list and then writing it out in a separate step. In the example above the list contains the text lines of the output files, but it could also be used to store the input lines or some intermediate data representation.

原子数を得るためにファイルを２度読むことを避ける方法は幾つかあります。しかし、いずれの主旨も、情報を一つのリストに収集しておき、その後の段階で書き出すことです。上の例では、出力ファイルのテキスト行からなるリストが該当するものですが、それはまた、入力行を格納したり、あるいは何らかの中間データとして用いることもできます。

Using the output lines in the list has the advantage of not requiring a second loop to generate the output file; a single write operation plus a call to join is sufficient. In fact, this combination occurs so frequently that a shorthand was introduced which is even more efficient: instead of output_file.write(join(atom_lines, '')) you can write output_file.writelines(atom_lines).

出力行からなるリストを用いる利点は、出力ファイルを生成するために再度ループを使う必要がなくなることです；すなわち、書き出し操作と join の呼び出しを一回ずつで十分となります。実は、この組合せは頻発するので、さらに効率の良い記法が導入されています：output_file.write(join(atom_lines, '')) の代りに、output_file.writelines(atom_lines) としても良いのです。

Table of Contents / 目次