gencollagen: generate idealized segments of collagen

The gencollagen command reads the input file (or standard input if omitted), which specifies the amino acid sequence and helical parameters, and generates a Protein Data Bank (PDB) file containing the idealized atomic coordinates for a segment of collagen.

Command Line Flags

gencollagen [ -etv ] [ -a amino-acid-file ] [ -l library-file ] [ -o output-file ] [ input-file ]

-a amino-acid-file: Use a different amino acid data file than the standard one. This is usually used when the input sequence contains non-standard amino acids. The format of the amino acid data file is described below.
-e: Echo user input to standard output. All values, whether user-specified or default, are printed. This is a good way to find out what actual default values the program is using, regardless of what this manual page says.
-h: Do not convert prolines in position Y to hydroxyprolines. By default, prolines specified in the Y position of GXY triplets are automatically converted to hydroxyprolines. This option disables the conversion.
-l library-file: Use a different rotamer library than the standard one. This is usually only used if there is a newer version of the Dunbrack and Karplus library available. The format of the library is described in RotamerLibrary(3).
-o output-file: Send atomic coordinate data to output-file. If this flag is not supplied, the data is sent to standard output.
-t: Print timing data to standard error. The user, system and wall-clock times are printed for each phase of the program. Specifying -t automatically sets the -v flag as well.
-v: Print a message at the start of each phase of the program.

Description

The gencollagen command reads the input file and generates the idealized atomic coordinates for a segment of the triple-helical collagen. The amino acid backbone information was obtained from Cornell et al. and the AMBER `parm94.dat' force field parameter file [21 October 1994]. The amino acid internal coordinates were obtained from AMBER94 geometry optimizations of the molecules X-A-Z; where X is the protecting group ACE, Z is the group NME and A is the amino acid in question. The minimizations where performed using a constant dielectric. The default values for the helical and symmetry parameters are from Miller, Nemethy and Scheraga. The program is divided into several phases, which are described below.

Reading the amino acid data file: The amino acid data file provides standard bond length, bond angle and dihedral angle data. Later, Cartesian coordinates will be computed from these internal coordinates. The format of the data file is described below.
Reading the input file: The input file provides data specific to the segment of collagen to be generated. This information includes the amino acid sequences of the chains, the chain arrangement, individual helical parameters, triple-helical symmetry parameters, and how side-chain atoms coordinates are computed. The format of the input file is described below.
Creating the backbone atoms for a single chain: The backbone atoms, N, CA, C and O, for the first chain of collagen are computed in an arbitrary coordinate system by assigning coordinates to the atoms in the first amino acid, and then making sure that the coordinates of other amino acids satisfy the bond length, bond angle and dihedral angle constraints.
The coordinates for atom N of the first residue is arbitrarily defined as (0, 0, 0); the coordinates of atom CA is defined as (0, 0, z), where z is the bond length between N and CA; the coordinates of atom C is defined as (0, y, z2) such that the bond length and bond angle constraints are satisfied. The coordinates of all other backbone atoms may be derived from bond lengths, bond angles and phi and psi, by transforming from internal coordinates to Cartesian coordinates.
Computing the triple-helical axis based on the single chain: The triple-helical axis is computed from the coordinates from CA atoms of the first, fourth, seventh and tenth amino acids (i.e. CA atoms from the first amino acid of the first four repeat units). The mathematics for computing the helical axis derive from Sugeta and Miyazawa. The coordinates of the first chain are then transformed such that the triple-helical axis coincides with the Z axis.
Creating the backbone atoms for the other two chains from symmetry: The two other chains of collagen are related to the first chain by screw symmetry. Thus, the coordinates of the backbone atoms in the second and third chain may be computed by rotating and translating the first chain with respect to the triple-helical axis.
Ordering the amino acids for side-chain addition: The order of side-chain addition onto amino acids may be one of three modes: go from C-terminus to N-terminus, go from N-terminus to C-terminus, or completely randomly. This phase of the program simply sorts the amino acids into the correct order.
Reading the rotamer library: The side-chain atoms may be added in a number of ways. Ponder and Richards showed that each amino acid has its own preferred chi conformation. Dunbrack and Karplus showed further that the preferred chi conformation depends on the values of phi and psi. The rotamer library from Roland Dunbrack includes both backbone-independent (Ponder and Richards) and backbone-dependent (Dunbrack and Karplus) data. By using this library, gencollagen can start with reasonable chi conformations for side-chains.
Adding side-chain atoms: Side-chain atoms are added to amino acids using the chi conformation data from the rotamer library. If a newly added atom is closer than one angstrom to an existing atom, an alternative position (if available) is tried. If no viable position is found, then a warning is reported and the last position computed is used.
Writing the output PDB file: The atomic coordinates of the collagen segment are output in Protein Data Bank format. Only ATOM, HETATM and TER records are emitted.

Input File Format

To generate a segment of collagen, gencollagen needs information such as the order of the three chains, their amino acid sequences, their helical parameters, and how side-chain atoms will be added. This information is specified in the input file via a series of statements, in arbitrary order. The supported statements and their default values are listed below. All statements may span multiple lines and must be terminated by a semicolon.

sequence name seq ...: Define the amino acid sequence for a single chain. Name is the name for the sequence and will be used as part of the chain statement (see below). Seq and all subsequent strings specifies the amino acid sequence using one-letter codes. There is no difference between specifying a sequence using a single string or multiple strings. At least one sequence statement must be present in the input file and there is no default value.
chain seq1 seq2 seq3: Define the three chains that form the collagen. seq1, seq2 and seq3 are names of sequences (see above), and they do not need be unique (i.e., two or all three names may specify the same sequence). There must be exactly one chain statement in the input file and there is no default value.
set phi[position] = angle
set psi[position] = angle
set omega[position] = angle: Define the helical parameters for each chain. position must be one of G, X or Y, corresponding to the GXY triplet positions. angle is specified in degrees. The default values are (phi_G, psi_G, omega_G, phi_X, psi_X, omega_X, phi_Y, psi_Y, omega_Y) = (-74°, 170°, 180°, -75°, 168°, 180°, -75°, 153°, 180°).
sidechain order type: Define the order in which amino acids will be assigned side-chains. If type is CtoN, then amino acids nearer the C-terminus will have their side-chain atoms added prior to those farther from the C-terminus. If type is NtoC, then amino acids nearer the N-terminus will have their side-chain atoms added prior to those farther from the N-terminus. If type is random, then a random order is used. The default order is random.
sidechain method type: Define how values for chi₁ and chi₂ are determined. If type is PandR, gencollagen will use the most probable backbone-independent values, as described by Ponder and Richards. If type is DandK, gencollagen will use the most probable backbone-dependent values, as described by Dunbrack and Karplus. If type is random, gencollagen will select a random rotamer from the backbone-independent data in the Dunbrack and Karplus library. The default method is random.
chi n min-angle max-angle: Define the minimum and maximum values that chi angles may take. n must be one of 3, 4 or 5. min-angle and max-angle are in degrees and have default values of 120° and 240° respectively.
retries number-of-retries: Define the number of times gencollagen should try to add an atom whose position depends on a random dihedral angle (e.g., chi₃). The default number of retries is 3.

A typical (and minimal) input file looks like:

(1)	chain a1, a1, a1;
(2)	sequence a1 GPZ GLA GPZ GES GRE GAZ GAE GSZ GRD GSZ;

Line 1 specifies that all three chains of the collagen segment will have the same sequence, which is given by a1. Line 2 specifies the actual sequence of type a1 chains, using amino acid one-letter codes. Note that whitespace (spaces, tabs and newlines) may be interspersed with the one-letter codes to improve readability.

Amino Acid Data File Format

The amino acid data file describes bond lengths, bond angles and dihedral angles of amino acids. The backbone atoms, N, CA, C and O, are treated as identical in all amino acids, while the side-chain atoms are specific for individual types of amino acids.

The data file is divided into three sections: the first section lists the bond lengths between backbone atoms; the second section lists the bond angles among backbone atoms; and the last part lists side-chain atoms and their relationship to other atoms. The first part of the default data file looks like the following (the preceding line numbers do not appear in the file):

(1)	Backbone Lengths
(2)	N	CA	1.47
(3)	CA	C	1.53
(4)	C	N	1.32
(5)	C	O	1.24
(6)	Backbone Angles
(7)	N	CA	C	110
(8)	CA	C	N	118
(9)	CA	C	O	119
(10)	C	N	CA	126
(11)	AminoAcid ARG R
(12)	CB      CA      N       -C      1.525   111.1   phi - 120
(13)	CG      CB      CA      N       1.525   109.47  chi1
(14)	CD      CG      CB      CA      1.525   109.47  chi2
(15)	NE      CD      CG      CB      1.48    111.0   chi3
(16)	CZ      NE      CD      CG      1.33    123.0   chi4
(17)	NH1     CZ      NE      CD      1.33    122.0   chi5
(18)	NH2     CZ      NE      CD      1.33    118.0   chi5 + 180

The backbone bond length section runs from line 1 through line 5. Line 1 is the required title of the section. The subsequent lines consist of the names of two atoms and the distance (in angstroms) between them.

The backbone bond angle section runs from line 6 through 10. Line 6 is the required title of the section. The subsequent lines consist of the names of three atoms and the angle (in degrees) formed by them, with the second atom as the vertex.

The side-chain section actually consists of multiple sub-sections, with each sub-section describing one type of amino acid. The side-chain description of arginine runs from line 11 through 18. Line 11 is the required title, and give both the three-letter and one-letter code for the amino acid, ARG and R respectively in this case. The subsequent lines consist of the names of four atoms, followed by the bond length between the first two atoms, the bond angle among the first three (again, with the second atom as the vertex), and the dihedral form by all four atoms. The name of an atom may be preceded by a dash, as is the case with C on line 12; the dash indicates that the name refers to an atom in the preceding amino acid in the sequence, rather than the same amino acid. The bond length is a real number denoting the distance in angstroms. The bond angle is a real number denoting the angle in degrees. The dihedral angle must have one of the following four forms:

symbolic-angle + constant
symbolic-angle - constant
symbolic-angle
constant

Symbolic-angle may be one of phi, psi, chi1, chi2, chi3, chi4 or chi5. Constant is a real number denoting an angle in degrees. The actual value for the dihedral is computed at run time when the symbolic angle is replaced by values fetched from a rotamer library or generated randomly.

Files

/usr/local/otf/lib/gencollagen.aa default amino acid data file

References

W.D. Cornell, P. Cieplak, C.I. Bayly, I.R. Gould, K.M. Merz Jr., D.M. Ferguson, D.C. Spellmeyer, T. Fox, J.W. Caldwell and P.A. Kollman, ``A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules,'' JACS (1995), 117, 5179-5197.

M.H. Miller, G. Nemethy and H.A. Scheraga, ``Calculations of the Structures of Collagen Models. Role of Interchain Interactions in Determining the Triple-Helical Coiled-Coil Conformation. 2. Poly(glycyl-prolyl-hydroxyprolyl),'' Macromolecules (1980), 13, 470-478.

H. Sugeta and T. Miyazawa, ``General Method for Calculating Helical Parameters of Polymer Chains from Bond Lengths, Bond Angles, and Internal-Rotation Angles,'' Biopolymers (1967), 5, 673-679.

J.W. Ponder and F.M. Richards, ``Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes,'' J. Mol. Biol. (1987), 193, 775-791.

R.L. Dunbrack Jr and M. Karplus, ``Backbone-dependent Rotamer Library for Proteins Application to Side-chain Prediction,'' J. Mol. Biol. (1993), 230, 543-574.

Conrad Huang, UCSF Computer Graphics Laboratory