Part 2: Dealing with text files / テキストファイルの取り扱い

Sequence objects / 列オブジェクト

Sequence objects are objects that contain elements which are referred to by indices: lists, arrays, text strings, etc. The elements of any sequence object seq can be obtained by indexing: seq[0], seq[1]. It is also possible to index from the end: seq[-1] is the last element of a sequence, seq[-2] the next-to-last etc.

列オブジェクトは、添字で参照される要素から成るオブジェクトです：リスト (list)、配列 (array)、テキスト文字列、などがあります。列オブジェクト seq の要素は、添字により得られます： seq[0], seq[1] などです。また、終端の方から添字付けすることも可能です：すなわち、seq[-1] は列の最後の要素、seq[-2] は最後から２番目の要素、といった具合です。

Example / 例：

text = 'abc'
print text[1]
print text[-1]

Subsequences can be extracted by slicing: seq[0:5] is a subsequence containing the first five elements, numbered 0, 1, 2, 3, and 4, but not element number 5. Negative indices are allowed: seq[1:-1] strips off the first and last element of a sequence.

部分列は、スライス (slicing) によって取り出せます： seq[0:5] は、最初の五つ（添字 0, 1, 2, 3, 4）の要素から成る部分列であり、添字 5 の要素は含みません。負の添字も使えます：seq[1:-1] は、列の最初と最後の要素を取り除いたものです。

Example / 例：

text = 'A somewhat longer string.'
print text[2:10]
print text[-7:-1]

The length of a sequence can be determined by len(seq).

列の長さは len(seq) によって得られます。

Lists / リスト

Lists are sequences that can contain arbitrary objects (numbers, strings, vectors, other lists, etc.):

リストは、任意のオブジェクト（数値、文字列、ベクトル、他のリストなど）を要素とすることのできる列です：

some_prime_numbers = [2, 3, 5, 7, 11, 13]
names = ['Smith', 'Jones']
a_mixed_list = [3, [2, 'b'], Vector(1, 1, 0)]

Elements and subsequences of a list can be changed by assigning a new value:

リストの要素や部分列は、新たに値を代入することによって変更可能です：

names[1] = 'Python'
some_prime_numbers[3:] = [17, 19, 23, 29]

A new element can be appended at the end with list.append(new_element). A list can be reversed with list.reverse() and sorted with list.sort().

list.append(new_element) によって、新しい要素をリストの最後に追加することができます。リストは、list.reverse() によって反転したり、list.sort() によってソートしたりできます。

Two lists can be concatenated like text strings: [0, 1] + ['a', 'b'] gives [0, 1, 'a', 'b'].

テキスト文字列の場合と同様に、二つのリストを結合できます：[0, 1] + ['a', 'b'] は [0, 1, 'a', 'b'] となります。

Lists can also be repeated like text strings: 3*[0] gives [0, 0, 0].

リストの反復も、テキスト文字列の場合と同様です：3*[0] は [0, 0, 0] を与えます。

Tuples / タプル

Tuples are much like lists, except that they cannot be changed in any way. Once created, a tuple will always have the same elements. They can therefore be used in situations where a modifiable sequence does not make sense (for example, as a key in a database).

タプルはリストと非常に良く似ていますが、変更ができないという点が異なります。タプルは一度生成されたら同じ要素を保ちます。したがって、変更可能な列を使っても意味のない場合に有用です（例えば、データベースのキー (key) として）。

Example / 例：

tuple_1 = (1, 2)
tuple_2 = ('a', 'b')
combined_tuple = tuple_1 + 2*tuple_2

Loops over sequences / 列に関するループ

It is often necessary to repeat some operation for each element of a sequence. This is called a loop over the sequence.

列の各要素について、ある演算を繰り返す必要がしばしば生じます。これを、列に関するループと呼びます。

for prime_number in [2, 3, 5, 7]:
    square = prime_number**2
    print square

Note: The operations that are part of the loop are indicated by indentation.

注意：字下げ (indentation) によって、一連の演算行がそのループ内にあることが示されます。

Loops work with any sequence. Here is an example with text strings:

ループは任意の列に対して使えます。これは、テキスト文字列を用いた例です：

for vowel in 'aeiou':
    print 10*vowel

Loops over a range of numbers are just a special case:

ある数範囲に関するループは、一つの特別な場合に過ぎません：

from Numeric import sqrt

for i in range(10):
    print i, sqrt(i)

The function range(n) returns a list containing the first n integers, i.e. from 0 to n-1. The form range(i, j) returns all integers from i to j-1, and the form range(i, j, k) returns i, i+k, i+2*k, etc. up to j-1. range(n) という関数は、初めの n 個（すなわち 0 から n-1）の整数を要素とするリストを返します。range(i, j) は i から j-1 までの全ての整数を、range(i, j, k) は i, i+k, i+2*k, などを j-1 まで生成して返します。

Testing conditions / 条件判断

The most frequent conditions that are tested are equality and order:

最も多く用いられる条件判断は、一致と大小関係です：

equal a == b

not equal a != b

greater than a > b

less than a < b

greater than or equal a >= b

less than or equal a <= b

equal	`a == b`
not equal	`a != b`
greater than	`a > b`
less than	`a < b`
greater than or equal	`a >= b`
less than or equal	`a <= b`

Several conditions can be combined with and and or, and negations are formed with not. The result of a condition is 1 for "true" and 0 for "false".

複数の条件を and と or で結合したり、not で否定したりできます。条件判断の結果は、"真" の場合は 1、"偽" の場合は 0 を返します。

Most frequently conditions are used for decisions:

条件判断は、分岐において最も多く使われます：

if a == b:
    print "equal"
elif a > b:
    print "greater"
else:
    print "less"

There can be any number of elsif branches (including none), and the else branch is optional.

elif 分岐は幾つあっても（無くても）良いし、else 分岐はあっても無くても構いません。

Conditions can also be used to control a loop:

条件判断は、ループの制御にも使えます：

x = 1.
while x > 1.e-2:
    print x
    x = x/2

Text files / テキストファイル

Text file objects are defined in the module Scientific.IO.TextFile.

テキストファイルは、Scientific.IO.TextFile モジュール内でオブジェクトとして定義されます。

Reading / 読み込み

Text files can be treated as sequences of lines, with the limitation that the lines must be read in sequence. The following program will print all lines of a file:

テキストファイルは、行 (lines) の列 (sequences) として扱うことができます。ただし、１行ずつ順に読まなくてはならないという制限があります。次のプログラムは、一つのファイル中の全行を印字します：

from Scientific.IO.TextFile import TextFile

for line in TextFile('some_file'):
    print line[:-1]

Why line[:-1]? The last character of each line is the new-line character, which we don't want to print (it would create blank lines between the lines of the file).

何故 line[:-1] なのでしょう？各行の最終文字は改行 (new-line) 文字なので、印字したくないからです（それを含めると、各行の間に空行が余分に入ります）。

Text file objects can also deal with compressed files. Any file name ending in ".Z" or ".gz" will be assumed to refer to a compressed file. Programs will of course receive the uncompressed version of the data. You can even use a URL (Universal Resource Locator, familiar from Web addresses) instead of a filename and thus read data directly via the internet!

テキストファイルオブジェクトは、圧縮ファイルを扱うこともできます。ファイル名が ".Z" または ".gz" で終るものは全て圧縮ファイルと見なされます。もちろん、プログラムが受け取るのは解凍されたデータです。ファイル名の代りに、URL (ウエブアドレスで良く知られている Universal Resource Locator) を用いる、つまり、インターネット経由でデータを直接読み込むことさえも出来ます。

Writing / 書き出し

Text files can be opened for writing instead of reading:

テキストファイルは、読み込みだけでなく、書き出しのために開くこともできます：

from Scientific.IO.TextFile import TextFile

file = TextFile('a_compressed_file.gz', 'w')
file.write('The first line\n')
file.write('And the')
file.write(' second li')
file.write('ne')
file.write('\n')
file.close()

Files opened for writing should be closed at the end to make sure that all data is actually written to the file. At the end of a program, all open files will be closed automatically, but it is better not to rely on this.

書き出しに開いたファイルは、最後に close で閉じなくてはなりません。これによって、全てのデータがファイルに書き出されたことが確認されます。プログラム終了時には、開かれた全てのファイルが自動的に close されるのですが、これをあてにしない方が良いでしょう。

Note that automatic compression is available for writting too, but not URLs, because most servers on the internet do not permit write access for good reasons!

書き出しにおいても、圧縮が自動的になされることに注意して下さい。ただし、URL の場合は駄目です。当然ながら、インターネット上の殆どのサーバーは、書き込みアクセスを許可しないからです！

Some useful string operations / いくつかの便利な文字列操作

The module string contains common string operations that are particularly useful for reading and writing text files. Only the most important ones will be described here; see the Python Library Reference for a complete list.

string モジュールには、よく使われる文字列操作が含まれており、テキストファイルを読み書きする際に特に便利です。ここでは、最も重要なものだけいくつか取り上げます；Python Library Reference で全体を一覧できますので、そちらを参照して下さい。

Getting data out of a string / 文字列からデータを取り出す

The function strip(string) removes leading and trailing white space from a string. The function split(string) returns a list of the words in the string, with "words" being anything between spaces. The word separator can be changed to any arbitrary string by using split(string, separator).

strip(string) 関数は、文字列から先頭と終末の空白を取り除きます。 split(string) 関数は、文字列中の単語をリストにして返します。ここで、"単語" とは空白の間にあるものを指します。split(string, separator) とすれば、任意の文字列を単語の区切り (separator) に用いることができます。

To extract numbers from strings, use the functions atoi(string) (returns an integer) and atof(string) (returns a real number).

文字列から数値を取り出すには、整数を返す atoi(string) と、実数を返す atof(string) を使います。

To find a specific text in a string, use find(string, text). It returns the first index at which text occurs in string, or -1 if it doesn't occur at all.

文字列中からある特定の文を探すには、find(string, text) を使います。これは、string 中で text が見つかった最初の添字を返します。見つからなかった場合の値は -1 です。

Example: The following program reads a file and prints the sum of all numbers in the second column.

例：次のプログラムは、一つのファイルを読み込み、その第２カラムにある全ての数値の和を印字します。

from Scientific.IO.TextFile import TextFile
import string

sum = 0.
for line in TextFile('data'):
    sum = sum + string.atof(string.split(line)[1])

print "The sum is: ", sum

Example: The following program prints the name of all user accounts on your computer:

例：次のプログラムは、あなたのコンピュータに登録されている全ての利用者名を印字します：

from Scientific.IO.TextFile import TextFile
from string import split

for line in TextFile('/etc/passwd'):
    print split(line, ':')[0]

Converting data into a string / データを文字列に変換する

Any Python object (numbers, strings, vectors, ...) can be turned into a string by writing it in inverse apostrophes:

任意の Python オブジェクト（数値、文字列、ベクトルなど）は、逆アポストロフィー中に挿入して書くことによって、文字列に変換できます：

from Scientific.Geometry import Vector

a = 42
b = 1./3.
c = Vector(0, 2, 1)

print `a` + ' ' + `b` + ' ' + `c`

This program prints "42 0.333333333333 Vector(0,2,1)". このプログラムは、"42 0.333333333333 Vector(0,2,1)" と印字します。

Another way to convert anything into a string is the function str(data) (which is not in the module string, but part of the core language). The two methods do not always give the same result, although they do for numbers. In general, str(data) produces a "nice" representation of the value, whereas the inverse apostrophes return a representation that indicates not only the value, but also the type of the data. For example, if s is a string, then str(s) is the same as s, whereas `s` returns s enclosed in apostrophes to indicate that the data is a string. In practice, try both and use the one you like best.

文字列に変換する別の方法に、str(data) 関数があります。（これは、string モジュールに含まれているのではなく、言語の標準関数の一つです。）これら二つの方法は、必ずしも同じ結果を与えません。ただし、数値に関しては同じ結果になります。一般的に言って、str(data) は値を "綺麗に" 表示しますが、逆アポストロフィーは値だけでなくそのデータの型を示すような表示を返します。例えば、s を文字列とすると、str(s) は s と同じですが、`s` は s をアポストロフィーで囲んだものを返し、そのデータが文字列であることを明示します。実際には、両方を試してみて好きな方を使うのが良いでしょう。

The function join(words) takes a list of strings and returns the concatenation with words separated by a space. The last line of the preceding program could therefore simply be print string.join(`a`, `b`, `c`). The word separator can again be changed to an arbitrary string.

関数 join(words) は、文字列のリストを受け取り、空白を区切りにして結合したものを返します。したがって、前のプログラムの最後の行は、単に print string.join(`a`, `b`, `c`) とできます。ここでも、任意の文字列を単語の区切りに使うことができます。

The functions lower(string) and upper(string) convert a string to lower- or uppercase letters.

関数 lower(string) と upper(string) は、文字列を小文字あるいは大文字に変換します。

The function ljust(string, width) returns a string of at least width characters in which string is left-justified. The functions rjust and center work similarly, but place the supplied text at the end or in the center.

関数 ljust(string, width) は、少なくとも width 文字分の長さの幅に、文字列 string を左詰めにした文字列を返します。同じ様に、関数 rjust は右詰めに、center は中央に揃えます。

Some useful functions not described here / ここでは取り上げなかった便利な関数について

Python has a very large collection of code for dealing with more or less specialized forms of text. It is impossible to describe them all here, or even list them. You can find all the information you need in the Python Library Reference.

Python には、もっと特殊なものからそうでないものまで、様々な形式のテキスト処理をするためのプログラムが膨大に集められています。ここでは、それらを全て記述することはもとより、一覧することさえ不可能です。必要な情報は全て、Python Library Reference から探し出すことができます。

First, there are many functions in the module href="http://www.python.org/doc/lib/module-string.html"> string that have not been described here.

まず第一に、string モジュールには、ここで取り上げなかった多くの関数があります。

An important set of functions deals with finding and changing data according to patterns, called regular expressions. These functions are located in the module re. They are very powerful, but the syntax of regular expressions (also used by Unix tools like grep and editors like vi and emacs) is a bit complicated.

正規表現と呼ばれる重要な一組の関数群は、あるパターンに従ってデータを検索したり変更したりします。これらの関数は、re モジュールにあります。これらは非常に強力ですが、正規表現の文法は（grep のような UNIX ツールや、vi、emacs といったエディターでも使われていますが）少し複雑です。

The module htmllib contains functions to extract data from HTML files, which are typically used on the World-Wide Web. The module formatter provides a way to create HTML files.

htmllib モジュールには、HTML ファイルからデータを取り出す関数があり、通常 World-Wide Web において使用されます。 formatter モジュールは、HTML ファイルを作成する道具を提供します。

Fortran-formatted files / Fortran フォーマットのファイル

Fortran programs use text files that are emulations of punched card decks, and therefore use different formatting conventions. Items of data are identified by their position in the line rather than by separators like spaces. The layout of a line is defined by a format specification. For example, 2I4, 3F12.5 indicates two four-character fields containing integers, followed by three twelve-character fields containing real numbers with five decimal digits.

Fortran プログラムでは、印字カードの代用 (emulation) としてテキストファイルを用いることから、特異なフォーマットが使われます。各々のデータは、空白などの区切り文字によってではなく、行中の位置で指定されます。行の割付けは、フォーマット仕様によって定義されます。例えば、2I4, 3F12.5 は、４文字分の領域を割付けられた整数が二つ、その次に１２文字分の領域を割付けられた小数点以下５桁の実数が三つです。

The module Scientific.IO.FortranFormat takes care of converting between Python data objects and Fortran-formatted text strings. The first step is the creation of a format object, representing the Fortran format specification. The second step is creating a Fortran-formatted line from a list of data values (for output), or the inverse operation (for input).

Scientific.IO.FortranFormat モジュールは、Python のデータオブジェクトと Fortran フォーマットのテキスト文字列との間の変換を扱います。まず第一のステップで、Fortran のフォーマット仕様を表すフォーマットオブジェクトを生成します。第二ステップで、データ値のリストから Fortran フォーマットの行を生成します（出力の場合）。入力の場合には、その逆の操作をします。

The following example reads a PDB file and prints the name and position of each atom. Note that each line must be analyzed twice: the first time only the initial six characters are extracted, to identify the record type, and in the case of an atom definition the actual data is extracted using the specific format.

下の例は、PDB ファイルを読み込み、各原子の名前と位置を印字します。各行を２度ずつ解析しなければならない点に注意してください：最初は先頭の６文字だけを取り出して記録の種類を見分けます。それが原子の定義であった場合に、一定のフォーマットを用いて実際のデータを取り出します。

from Scientific.IO.TextFile import TextFile	
from Scientific.IO.FortranFormat import FortranFormat, FortranLine
from Scientific.Geometry import Vector

generic_format = FortranFormat('A6')

atom_format = FortranFormat('A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,' +
                            '3X,3F8.3,2F6.2,7X,A4,2A2')
# Contents of the fields:
# record type, serial number, atom name, alternate location indicator,
# residue name, chain id, residue sequence number, insertion code,
# x coordinate, y coordinate, z coordinate, occupancy, temperature factor,
# segment id, element symbol, charge

# 各領域の内容：
# 記録種類、連番号、原子名、代替位置指標、残基名、鎖名、残基配列番号、挿入記号、
# ｘ座標、ｙ座標、ｚ座標、占有数、温度因子、
# 区分名、元素記号、電荷

for line in TextFile('protein.pdb'):
    record_type = FortranLine(line, generic_format)[0]
    if record_type == 'ATOM  ' or record_type == 'HETATM':
        data = FortranLine(line, atom_format)
        atom_name = data[2]
        position = Vector(data[8:11])
        print "Atom ", atom_name, " at position ", position

The next example shows how to write data in a Fortran-compatible way. The output file contains a sequence of numbers in the first column and their square roots in the second column.

次の例は、Fortran 形式でデータを書き出す方法を示します。出力ファイルには、第一列と第二列に各々一連の数値とその平方根が書き出されます。

from Scientific.IO.TextFile import TextFile	
from Scientific.IO.FortranFormat import FortranFormat, FortranLine
from Numeric import sqrt

format = FortranFormat('2F12.5')

file = TextFile('sqrt.data', 'w')
for n in range(100):
    x = n/10.
    file.write(str(FortranLine([x, sqrt(x)], format)) + '\n')
file.close()

Exercises / 練習問題

Write a program that counts the number of lines and words in a file.
Write a program that reads a PDB file and counts the number of carbon atoms (i.e. the atoms whose name begins with 'C').
Write a program that converts a PDB file to the XYZ format used by XMol (and some other programs). The XYZ format is very simple: The first line contains the number of atoms, the second line a comment (use whatever you like), and the remaining lines contain one atom each, with four entries: first the element symbol (e.g. 'C' for carbon), and then the coordinates x, y, and z. The entries are separated by one or more spaces. This is an example for a single water molecule:
一つのファイル中の行数と語数を数えて印字するプログラムを書きなさい。
PDB ファイルを読み込んで炭素原子（名前が 'C' で始まる原子）の数を印字するプログラムを書きなさい。
PDB ファイルを XMol（やその他のプログラム）で使われる XYZ フォーマットに変換するプログラムを書きなさい。XYZ フォーマットはとても単純です：１行目は原子数、２行目はコメント（何でも好きな用途に使って下さい）、残りは１行が１原子を表し各々四つの項目からなります：最初が元素記号（例えば炭素ならば 'C'）、その次にｘ、ｙ、ｚ座標です。これらの項目は一つ以上の空白で区切ります。下は、一個の水分子の例です：
```
3
One water molecule
O 0.  0.     0.
H 0.  0.957  0.
H 0. -0.24  -0.927
```
Note that the data does not have to be lined up in columns.
注意：データは列（縦）方向に整列している必要はありません。

Table of Contents / 目次