Changes between Initial Version and Version 1 of mmcif-issues


Ignore:
Timestamp:
Mar 10, 2016, 12:21:49 PM (10 years ago)
Author:
gregc
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • mmcif-issues

    v1 v1  
     1= DRAFT -- Practical mmCIF Issues -- DRAFT =
     2(And how to improve mmCIF's utility)
     3
     4=== Contents (in no particular order yet) ===
     5
     6  * [[#back|Background]]
     7  * [[#suffix|No unique file suffix]]
     8  * [[#conn|Missing connectivity]]
     9  * [[#case|Mixed-case keywords and data names]]
     10  * [[#fixed|Optional PDB Styling]]
     11  * [[#metal|Metal coordination bonds]]
     12  * [[#water|Nonunique Waters]]
     13  * [[#size|Different sized models]]
     14  * [[#standard|mmCIF doesn't follow CIF standard]]
     15  * [[#refs|References]]
     16
     17== Background == [=#back]
     18
     19The [[http://mmcif.wwpdb.org/|mmCIF]] file format is used by the [[http://www.wwpdb.org/|Worldwide Protein Data Bank (wwPDB)]] consortium to share deposited molecular data with the global community. For the wwPDB, the mmCIF file format is the new and improved version of the venerable [[http://www.wwpdb.org/documentation/file-format|PDB file format]]. With the new format, there is the expectation that the PDB file format's deficiencies will be addressed and fixed.
     20
     21As the authors of molecular visualization software, the PDB file format's main deficiencies were:
     22
     23  1. limited to 99999 atoms
     24  1. required lots of heuristics to reconstruct the molecule
     25  1. required lots of custom code to account for variations from non-wwPDB sources
     26
     27Other non-wwPDB software may have other concerns.
     28
     29The mmCIF file format cleanly fixes the first deficiency, larger molecules and molecular assemblies are easy to represent. However, heuristics are still heavily depended upon to reconstruct the atomic structures. The custom code issue is partially addressed by making the mmCIF file format extensible. However, other software still generates files that need additional heuristics to understand.
     30
     31Many problems with the mmCIF are from the lack of documentation about the protocols needed to reconstruct unspecified data in mmCIF files. The response has been that you just need to apply chemical knowledge. That means that every application that uses mmCIF files needs to recode solutions to the same problems. This is especially unfortunate when the problem can be fixed with a small amount of additional data in the file.
     32
     33We have also found that performance of reading mmCIF files can be drastically improved if some conventions were required.
     34
     35The mmCIF format is essentially a normalized relational database, with the label_* data values acting as the database table keys. When mmCIF writers ignore the database semantics, it creates problems for mmCIF readers.
     36
     37== No Unique File Suffix ==
     38
     39mmCIF files suffer identity confusion. The wwPDB distributes mmCIF files whose filenames have a '''.cif''' suffix. That suffix is shared with many different file types that use the [[http://www.iucr.org/resources/cif/|CIF]] file format. For example, small molecule structure factors, image data, DDL1 and DDL2 dictionaries, macromolecular data, ''et. al.''. Therefore, applications depend on the user to tell them what type of CIF file it is.
     40
     41The file contents are supposed to conform to a [[http://www.iucr.org/resources/cif/dictionaries| dictionary]], but most CIF files, including mmCIF files not from the wwPDB, do not mention which dictionary is the appropriate one. The dictionary used is needed to validate the file and should be required in every CIF file. Knowing which dictionary was used also identifies the file type. And conversely, knowing the file type, would limit the set of dictionaries that the file might conform to.
     42
     43Normally, different file types have different suffixes, so the computer's operating system can help the user and show a unique icon for the file type. That is not possible for '''.cif''' files.
     44
     45Usually there is a secondary way to identify files of a particular type, the file header, ''e.g.'', like '''`###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION 1.0`''' (for CBF/imgCIF files), but there isn't one for mmCIF. CIF files have an optional '''`#\#CIF 1.1`''' header, but that only tells you what the file format is, not the file type.
     46
     47=== Proposed fixes: ===
     48
     49The wwPDB should lead by example and:
     50
     51  * Adopt the '''.mmcif''' file suffix for all mmCIF files served by the members of the wwPDB.
     52  * Require that every mmcif file have the audit_conform category with the dictionary information (this really should have been a CIF requirement)
     53
     54== Missing Connectivity ==
     55
     56A major part of any molecular visualization application is showing the connectivity of the atoms, ''i.e.'', the bonds.
     57
     58Most bonds are missing in a mmCIF file. The connectivity is both (a) provided separately in a monolithic chemical component file, and (b) implied for polymers of amino and nucleic acids by the sequence given in the file. All other bonds should be explicitly given.
     59
     60101397 mmCIF files on 17 July 2014
     61
     62||= component in # of mmCIF files or more =||= # of component templates found =||= % of mmCIF files =||= zipped disk space =||
     63|| 100 || 184 || 62.47% || 388K ||
     64|| 90 || 202 || 63.23% || 448K ||
     65|| 80 || 219 || 64.08% || 491K ||
     66|| 70 || 236 || 64.74% || 525K ||
     67|| 60 || 263 || 65.63% || 581K ||
     68|| 50 || 299 || 66.74% || 682K ||
     69|| 40 || 359 || 68.34% || 846K ||
     70|| 30 || 434 || 69.79% || 1.1M ||
     71|| 20 || 595 || 72.02% || 1.5M ||
     72|| 10 || 1040 || 76.30% || 2.7M ||
     73|| 2 || 5243 || 88.32% || 15M ||
     74|| 1 || 17339 || 100.00% || 52M ||
     75
     76From the above table, all of the chemical component templates are about 52M zipped. mmCIF readers that want to correctly read all current mmCIF files need all of the templates. But new templates are frequently added, so if applications do not want to be frequently updated with the current templates, the applications need to use a web service to fetch the templates, with all of the problems that using a web service entail (the Internet has to be available, the template has to be up-to-date and correct, and the user is not worried about leaking what they are working on).
     77
     78=== Proposed fix: ===
     79
     80The [[http://www.rcsb.org/|RCSB PDB]]'s [[http://ligand-expo.rcsb.org/|Ligand Expo]] can be used as a web service for downloading individual chemical components. This is a partial solution to redistributing the monolithic chemical component file — it has the web service problems of Internet connectivity, timeliness, and privacy.
     81
     82This also does not support undeposited data, since, by definition, undeposited components are not available yet.
     83
     84A better solution would be to adopt the [[http://www.ebi.ac.uk/pdbe/|PDBe]]'s updated mmCIF solution of embedding the chemical component
     85
     86Whatever the solution, it needs to be documented by the wwPDB.
     87
     88== Mixed-case keywords and data names ==
     89
     90In our study, supporting mixed case keywords and data names slowed down the parsing by approximately six percent in a large file (the percentage would be higher in smaller files, but less noticeable).
     91
     92=== Proposed fix: ===
     93
     94There is no need to support mixed case anymore. Just mandate lowercase for keywords and consistent case for data names (thinking about atom_site.Cartn_[xyz], otherwise the wwPDB was consistent in using lowercase everywhere).
     95
     96Also data values should be in a consistent case. For example, there is no need for chem_comp_bond.value_order to be case insensitive.
     97
     98Being case insensitive is a waste of time and energy.
     99
     100== Optional PDB Styling ==
     101
     102The wwPDB kindly provides the mmCIF atom_site category as fixed-width column data as described in the mmCIF FAQ section [[http://mmcif.wwpdb.org/docs/faqs/pdbx-mmcif-faq-general.html#collapse3| "format styling plans for PDBx/mmCIF"]]. This can be a huge benefit for mmCIF files that are written once and read many, many times, the reading the file can be significantly sped up &mdash in our case, for 3j3q, 3.73 times faster!
     103
     104Yet, there is no annotation in an mmCIF file that indicates if a data block or particular categories follow the PDB format styling rules. Since applications can not reliably recover if the rules are violated (see below), the code is mostly unused.
     105
     106||  ||= 3j3q.cif =||= Speedup =||
     107|| fully tokenized || 1.81 sec || 1x ||
     108|| lowercase categories/keywords at start of line || 1.73 sec || 1.05x ||
     109|| with fixed columns || 0.603 sec || 3.00x ||
     110|| + fixed length rows (trailing spaces) || 0.594 sec || 3.05x ||
     111|| + tables terminated with comment || 0.570 sec || 3.18x ||
     112|| with everything || 0.485 sec || 3.73x ||
     113
     114=== Proposed fix: ===
     115
     116It would be nice if the fixed column width optimization were available for more categories (other than the large atom_site and atom_site_anisotrop categories). The problem is that some fields might have multiline values, and for a table that is assumed to have fixed column widths, it is not possible to guarantee the detection that the assumption is wrong. But the mmCIF writer could easily provide a clue, since the writer has to compute the appropriate columns widths before outputting the category, the writer will know if any field is a multiline field or not. Consequently, if there is a multiline field, the writer can split the columns of the first row on to multiple lines (assuming it is a multicolumn table). Then when the mmCIF reader is computing the column offsets based on the first row, it can see the newline clue (easy since it keeps track of the line number for error messages), and skip the fixed column width optimization for that category.
     117
     118So the recommendation is that the wwPDB publish that the "format styling plans" are required — in the sense that if they are used, then the file is annotated with that information. Different styling requirements could be annotated separately, ''i.e.'', having only lowercase keyword and category names ''vs.'' fixed width columns (either the whole data block or particular categories). The annotations could be in the audit_conform category or a new category. We don't care as long as they're there.
     119
     120== Metal coordination bonds ==
     121
     122One of the nice things in mmCIF are the explicit metal coordination bonds. Unfortunately, that information is missing if those bonds are within a chemical component. For example, the HEM chemical component gives four separate single bonds to the iron ion, but there can be at most two (chemical knowledge).
     123
     124=== Proposed fix: ===
     125
     126The value_order attribute values should be extended to support ionic bonds, and the appropriate chemical components should be updated.
     127
     128== Nonunique Waters ==
     129
     130The atom_site's label_comp_id, label_asym_id, label_entity_id, and label_seq_id data values are supposed to uniquely identify the residue. Unfortunately, in the wwPDB's mmCIF files, the HOH residues have identical label_* fields, so they are indistinguishable without chemical knowledge or by peeking at the non-required auth_seq_id data value.
     131
     132=== Proposed fix: ===
     133
     134Just give the waters separate sequence ids and preserve the integrity of the database.
     135
     136== Why is entity_poly_seq category not required? ==
     137
     138In the mmCIF dictionary documentation, the entity_poly_seq category is not required. But it doesn't say when it is required (it's required to give the inter-residue polymer connectivity).
     139
     140=== Proposed fix: ===
     141
     142Document the protocol used to determine connectivity. If there are no no polymers, then it makes since to leave it out, but in most mmCIF files, it is effectively required.
     143
     144== Different sized models ==
     145
     146mmCIF files with NMR ensembles sometimes have a different set of atoms in each model. This complicates restoring the connectivity. (TODO: haven't solved this yet.)
     147
     148=== Proposed fix: ===
     149
     150TODO: need to understand this more first
     151
     152== wwPDB doesn't follow CIF standard markup conventions ==
     153
     154CIF text has a standard markup to write greek letters, accented letters, super- and subscripts, and some typographic style codes. mmCIF uses old PDB format markup. Ideally, text data values would use UTF-8 Unicode standard.
     155
     156=== Proposed fix: ===
     157
     158The mmCIF documentation needs be explicit about which parts of the CIF standard it conforms to.
     159
     160== References ==
     161
     162* [[http://www.iucr.org/resources/cif/|CIF]]:: Crystallographic Information Framework, International Union of Crystallography
     163* [[http://mmcif.wwpdb.org/|mmCIF]]:: PDBx/mmCIF Dictionary Resources, wwPDB
     164* [[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702783/|PDBe]]:: ''PDBe: improved accessibility of macromolecular structure data from PDB and EMDB'', Velander ''et. al.'', Nucleic Acids Res. 2016 Jan 4; 44(Database issue): D385-D395
     165* readcif:: ''Benchmarking readcif'', Greg Couch, RBVI, University of California at San Francisco, 2014 June 17