Changes between Version 1 and Version 2 of mmcif-issues


Ignore:
Timestamp:
Mar 10, 2016, 1:05:11 PM (10 years ago)
Author:
gregc
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • mmcif-issues

    v1 v2  
    11= DRAFT -- Practical mmCIF Issues -- DRAFT =
    2 (And how to improve mmCIF's utility)
     2(And some proposed solutions)
    33
    44=== Contents (in no particular order yet) ===
    55
    66  * [[#back|Background]]
    7   * [[#suffix|No unique file suffix]]
    8   * [[#conn|Missing connectivity]]
    9   * [[#case|Mixed-case keywords and data names]]
    10   * [[#fixed|Optional PDB Styling]]
    11   * [[#metal|Metal coordination bonds]]
    12   * [[#water|Nonunique Waters]]
    13   * [[#size|Different sized models]]
    14   * [[#standard|mmCIF doesn't follow CIF standard]]
     7  * Performance issues
     8    * [[#case|Mixed-case keywords and data names]]
     9    * [[#fixed|Fixed-width columns]]
     10  * Semantics
     11    * [[#conn|Missing connectivity]]
     12    * [[#metal|Metal coordination bonds]]
     13    * [[#water|Nonunique Waters]]
     14    * [[#size|Different sized models]]
     15    * [[#optional|Missing why/when categories are optional]]
     16  * Miscellaneous
     17    * [[#suffix|Hard to identify]]
     18    * [[#standard|mmCIF doesn't follow CIF standard]]
    1519  * [[#refs|References]]
    1620
    17 == Background == [=#back]
     21
     22== Background == #back
    1823
    1924The [[http://mmcif.wwpdb.org/|mmCIF]] file format is used by the [[http://www.wwpdb.org/|Worldwide Protein Data Bank (wwPDB)]] consortium to share deposited molecular data with the global community. For the wwPDB, the mmCIF file format is the new and improved version of the venerable [[http://www.wwpdb.org/documentation/file-format|PDB file format]]. With the new format, there is the expectation that the PDB file format's deficiencies will be addressed and fixed.
     
    3540The mmCIF format is essentially a normalized relational database, with the label_* data values acting as the database table keys. When mmCIF writers ignore the database semantics, it creates problems for mmCIF readers.
    3641
    37 == No Unique File Suffix ==
     42
     43== Hard to Identify == #suffix
    3844
    3945mmCIF files suffer identity confusion. The wwPDB distributes mmCIF files whose filenames have a '''.cif''' suffix. That suffix is shared with many different file types that use the [[http://www.iucr.org/resources/cif/|CIF]] file format. For example, small molecule structure factors, image data, DDL1 and DDL2 dictionaries, macromolecular data, ''et. al.''. Therefore, applications depend on the user to tell them what type of CIF file it is.
    4046
    41 The file contents are supposed to conform to a [[http://www.iucr.org/resources/cif/dictionaries| dictionary]], but most CIF files, including mmCIF files not from the wwPDB, do not mention which dictionary is the appropriate one. The dictionary used is needed to validate the file and should be required in every CIF file. Knowing which dictionary was used also identifies the file type. And conversely, knowing the file type, would limit the set of dictionaries that the file might conform to.
     47The file contents are supposed to conform to a [[http://www.iucr.org/resources/cif/dictionaries|dictionary]], but most CIF files, including mmCIF files not from the wwPDB, do not mention which dictionary is the appropriate one. The dictionary used is needed to validate the file and should be required in every CIF file. Knowing which dictionary was used also identifies the file type. And conversely, knowing the file type, would limit the set of dictionaries that the file might conform to.
    4248
    4349Normally, different file types have different suffixes, so the computer's operating system can help the user and show a unique icon for the file type. That is not possible for '''.cif''' files.
     
    4955The wwPDB should lead by example and:
    5056
     57  * Require a unique header, ''e.g.'', '''#\# mmcif_pdbx 4.052''' (dictionary name and version), at the start of the file
    5158  * Adopt the '''.mmcif''' file suffix for all mmCIF files served by the members of the wwPDB.
    52   * Require that every mmcif file have the audit_conform category with the dictionary information (this really should have been a CIF requirement)
     59  * Require that every mmcif file have the audit_conform category with the dictionary information (partially redundant with the unique header).  This is really a deficiency of the CIF standard.
    5360
    54 == Missing Connectivity ==
     61
     62== Missing Connectivity == #conn
    5563
    5664A major part of any molecular visualization application is showing the connectivity of the atoms, ''i.e.'', the bonds.
     
    6169
    6270||= component in # of mmCIF files or more =||= # of component templates found =||= % of mmCIF files =||= zipped disk space =||
    63 || 100 || 184 || 62.47% || 388K ||
    64 || 90 || 202 || 63.23% || 448K ||
    65 || 80 || 219 || 64.08% || 491K ||
    66 || 70 || 236 || 64.74% || 525K ||
    67 || 60 || 263 || 65.63% || 581K ||
    68 || 50 || 299 || 66.74% || 682K ||
    69 || 40 || 359 || 68.34% || 846K ||
    70 || 30 || 434 || 69.79% || 1.1M ||
    71 || 20 || 595 || 72.02% || 1.5M ||
    72 || 10 || 1040 || 76.30% || 2.7M ||
    73 || 2 || 5243 || 88.32% || 15M ||
    74 || 1 || 17339 || 100.00% || 52M ||
     71||  100  ||  184  ||  62.47%  ||  388K ||
     72||  90  ||  202  ||  63.23%  ||  448K ||
     73||  80  ||  219  ||  64.08%  ||  491K ||
     74||  70  ||  236  ||  64.74%  ||  525K ||
     75||  60  ||  263  ||  65.63%  ||  581K ||
     76||  50  ||  299  ||  66.74%  ||  682K ||
     77||  40  ||  359  ||  68.34%  ||  846K ||
     78||  30  ||  434  ||  69.79%  ||  1.1M ||
     79||  20  ||  595  ||  72.02%  ||  1.5M ||
     80||  10  ||  1040  ||  76.30%  ||  2.7M ||
     81||  2  ||  5243  ||  88.32%  ||  15M ||
     82||  1  ||  17339  ||  100.00%  ||  52M ||
    7583
    76 From the above table, all of the chemical component templates are about 52M zipped. mmCIF readers that want to correctly read all current mmCIF files need all of the templates. But new templates are frequently added, so if applications do not want to be frequently updated with the current templates, the applications need to use a web service to fetch the templates, with all of the problems that using a web service entail (the Internet has to be available, the template has to be up-to-date and correct, and the user is not worried about leaking what they are working on).
     84From the above table, all of the chemical component templates are about 52Mbytes zipped. mmCIF readers that want to correctly read all current mmCIF files need all of the templates. But new templates are frequently added, so if applications do not want to be frequently updated with the current templates, the applications need to use a web service to fetch the templates, with all of the problems that using a web service entail (the Internet has to be available, the template has to be up-to-date and correct, and the user is not worried about leaking what they are working on).
    7785
    7886=== Proposed fix: ===
     
    8290This also does not support undeposited data, since, by definition, undeposited components are not available yet.
    8391
    84 A better solution would be to adopt the [[http://www.ebi.ac.uk/pdbe/|PDBe]]'s updated mmCIF solution of embedding the chemical component
     92A better solution would be to adopt the [[http://www.ebi.ac.uk/pdbe/|PDBe]]'s updated mmCIF solution of embedding the chemical component.
    8593
    8694Whatever the solution, it needs to be documented by the wwPDB.
    8795
    88 == Mixed-case keywords and data names ==
     96
     97== Mixed-case keywords and data names == #case
    8998
    9099In our study, supporting mixed case keywords and data names slowed down the parsing by approximately six percent in a large file (the percentage would be higher in smaller files, but less noticeable).
     
    92101=== Proposed fix: ===
    93102
    94 There is no need to support mixed case anymore. Just mandate lowercase for keywords and consistent case for data names (thinking about atom_site.Cartn_[xyz], otherwise the wwPDB was consistent in using lowercase everywhere).
     103There is no need to support mixed case anymore. Just mandate lowercase for keywords and consistent case for data names (thinking about ''atom_site.Cartn_[xyz]'', otherwise the wwPDB was consistent in using lowercase for data names).
    95104
    96 Also data values should be in a consistent case. For example, there is no need for chem_comp_bond.value_order to be case insensitive.
     105Also data values should be in a consistent case. For example, there is no need for ''chem_comp_bond.value_order'' to be case insensitive.
    97106
    98107Being case insensitive is a waste of time and energy.
    99108
    100 == Optional PDB Styling ==
    101109
    102 The wwPDB kindly provides the mmCIF atom_site category as fixed-width column data as described in the mmCIF FAQ section [[http://mmcif.wwpdb.org/docs/faqs/pdbx-mmcif-faq-general.html#collapse3| "format styling plans for PDBx/mmCIF"]]. This can be a huge benefit for mmCIF files that are written once and read many, many times, the reading the file can be significantly sped up &mdash in our case, for 3j3q, 3.73 times faster!
     110== Fixed-width Columns == #fixed
    103111
    104 Yet, there is no annotation in an mmCIF file that indicates if a data block or particular categories follow the PDB format styling rules. Since applications can not reliably recover if the rules are violated (see below), the code is mostly unused.
     112The wwPDB kindly provides the mmCIF ''atom_site'' category as fixed-width column data as described in the mmCIF FAQ section [[http://mmcif.wwpdb.org/docs/faqs/pdbx-mmcif-faq-general.html#collapse3| "format styling plans for PDBx/mmCIF"]]. This can be a huge benefit for mmCIF files that are written once and read many, many times.  For large categories, the reading the file can be significantly sped up: in our case, for '''3j3q''', 3.73 times faster!
     113
     114Yet, there is no annotation in an mmCIF file that indicates if a data block or particular categories use fixed-width columns. Since applications can not reliably recover if the rules are violated (see below), the code is mostly unused.
    105115
    106116||  ||= 3j3q.cif =||= Speedup =||
     
    118128So the recommendation is that the wwPDB publish that the "format styling plans" are required — in the sense that if they are used, then the file is annotated with that information. Different styling requirements could be annotated separately, ''i.e.'', having only lowercase keyword and category names ''vs.'' fixed width columns (either the whole data block or particular categories). The annotations could be in the audit_conform category or a new category. We don't care as long as they're there.
    119129
    120 == Metal coordination bonds ==
     130
     131== Metal coordination bonds == #metal
    121132
    122133One of the nice things in mmCIF are the explicit metal coordination bonds. Unfortunately, that information is missing if those bonds are within a chemical component. For example, the HEM chemical component gives four separate single bonds to the iron ion, but there can be at most two (chemical knowledge).
     
    124135=== Proposed fix: ===
    125136
    126 The value_order attribute values should be extended to support ionic bonds, and the appropriate chemical components should be updated.
     137The ''chem_comp_bond.value_order'' values should be extended to support ionic bonds, and the appropriate chemical components should be updated.
    127138
    128 == Nonunique Waters ==
    129139
    130 The atom_site's label_comp_id, label_asym_id, label_entity_id, and label_seq_id data values are supposed to uniquely identify the residue. Unfortunately, in the wwPDB's mmCIF files, the HOH residues have identical label_* fields, so they are indistinguishable without chemical knowledge or by peeking at the non-required auth_seq_id data value.
     140== Nonunique Waters == #water
     141
     142The atom_site's label_comp_id, label_asym_id, label_entity_id, and label_seq_id data values are supposed to uniquely identify the residue. Unfortunately, in the wwPDB's mmCIF files, the HOH residues have identical label_* fields, so they are indistinguishable without chemical knowledge or by peeking at the optional auth_seq_id data value.
    131143
    132144=== Proposed fix: ===
     
    134146Just give the waters separate sequence ids and preserve the integrity of the database.
    135147
    136 == Why is entity_poly_seq category not required? ==
     148== Missing why/when categories are optional == #optional
    137149
    138150In the mmCIF dictionary documentation, the entity_poly_seq category is not required. But it doesn't say when it is required (it's required to give the inter-residue polymer connectivity).
     
    140152=== Proposed fix: ===
    141153
    142 Document the protocol used to determine connectivity. If there are no no polymers, then it makes since to leave it out, but in most mmCIF files, it is effectively required.
     154Document the protocol used to reconstruct connectivity. If there are no no polymers, then it makes since to leave it out, but in most mmCIF files, it is effectively required.
    143155
    144 == Different sized models ==
     156
     157== Different sized models == #size
    145158
    146159mmCIF files with NMR ensembles sometimes have a different set of atoms in each model. This complicates restoring the connectivity. (TODO: haven't solved this yet.)
     
    150163TODO: need to understand this more first
    151164
    152 == wwPDB doesn't follow CIF standard markup conventions ==
     165
     166== wwPDB doesn't follow CIF standard markup conventions == #standard
    153167
    154168CIF text has a standard markup to write greek letters, accented letters, super- and subscripts, and some typographic style codes. mmCIF uses old PDB format markup. Ideally, text data values would use UTF-8 Unicode standard.
     
    158172The mmCIF documentation needs be explicit about which parts of the CIF standard it conforms to.
    159173
    160 == References ==
     174
     175== References == #refs
    161176
    162177* [[http://www.iucr.org/resources/cif/|CIF]]:: Crystallographic Information Framework, International Union of Crystallography
    163178* [[http://mmcif.wwpdb.org/|mmCIF]]:: PDBx/mmCIF Dictionary Resources, wwPDB
    164179* [[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702783/|PDBe]]:: ''PDBe: improved accessibility of macromolecular structure data from PDB and EMDB'', Velander ''et. al.'', Nucleic Acids Res. 2016 Jan 4; 44(Database issue): D385-D395
    165 * readcif:: ''Benchmarking readcif'', Greg Couch, RBVI, University of California at San Francisco, 2014 June 17
     180* [[http://www.cgl.ucsf.edu/home/gregc/readcif/|readcif]]:: ''Benchmarking readcif'', Greg Couch, RBVI, University of California at San Francisco, 2014 June 17