Opened 4 years ago

Closed 3 years ago

Last modified 3 years ago

#5601 closed enhancement (fixed)

Log templates from mmCIF files for theoretical models

Reported by: ben@… Owned by: pett
Priority: moderate Milestone:
Component: Structure Comparison Version:
Keywords: Cc: Tom Goddard
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

Ben is interested in being able to show template sequence alignments, template structures, and scores for theoretical models from ModBase and possibly other sources such as the Model Archive database and AlphaFold models. Maybe this mmCIF table info with links could be logged in a table when the file is opened.

Begin forwarded message:

From: Ben Webb via ChimeraX-users <chimerax-users@…>
Subject: [chimerax-users] Any interest in reading ModelArchive metadata from mmCIF files?
Date: November 12, 2021 at 10:20:19 AM PST
To: ChimeraX Users Help <chimerax-users@…>
Reply-To: Ben Webb

Do you have any plans to extend ChimeraX's mmCIF reader to parse and display metadata on theoretical models, such as quality scores or the alignments to template structures?

The folks at PDB have recently done a lot of work to standardize this metadata in the MA mmCIF dictionary:
https://mmcif.wwpdb.org/dictionaries/mmcif_ma.dic/Index/

The dictionary has already been adopted by ModelArchive (e.g. AlphaFold2 models) and by ModBase (Modeller models) and I believe that other repositories such as SwissModel are also moving in that direction. See e.g. mmCIF downloads at
https://www.modelarchive.org/doi/10.5452/ma-bak-cepc-0250
https://modbase.compbio.ucsf.edu/modbase-cgi/model_search.cgi?databaseID=Q12321

(My ulterior motive: we've previously built Chimera web data files to download a ModBase model and the accompanying alignment, and display them in Chimera; now that this data is embedded in the mmCIF file, in principle ChimeraX could do this itself in a less clunky and not ModBase-specific fashion.)

Ben

Change History (12)

comment:1 by Tom Goddard, 4 years ago

From: Tom Goddard
Subject: Re: [chimerax-users] Any interest in reading ModelArchive metadata from mmCIF files?
Date: November 12, 2021 at 2:56:49 PM PST
To: Ben Webb
Cc: ChimeraX Users Help <chimerax-users@…>

Hi Ben,

It would be nice to show templates and sequence alignments used for predicted models from AlphaFold and Modeller. We could output an html table in the log that lists the templates with a link to show the sequence alignment and a link to load and align the template if it is from the PDB.

The AlphaFold models in the EBI AlphaFold database don't appear to say what template structures were used, for instance, I looked at AF-P12004-F1-model_v1.cif

https://alphafold.ebi.ac.uk/entry/P12004

I believe AlphaFold2 finds the 20 best matching structures in the PDB and uses 4 (not sure how they are selected). I've run AlphaFold many times and the log output says what the 20 matches are but does not appear to say which 4 structures it actually used -- pretty unfortunate. The AlphaFold per-residue confidence scores are in an mmCIF table _ma_qa_metric_local:

#
loop_
_ma_qa_metric_local.label_asym_id
_ma_qa_metric_local.label_comp_id
_ma_qa_metric_local.label_seq_id
_ma_qa_metric_local.metric_id
_ma_qa_metric_local.metric_value
_ma_qa_metric_local.model_id
_ma_qa_metric_local.ordinal_id
A MET 1 2 91.95 1 1
A PHE 2 2 96.89 1 1
A GLU 3 2 98.01 1 1
A ALA 4 2 98.08 1 1
A ARG 5 2 97.76 1 1
A LEU 6 2 96.16 1 1
..

Currently ChimeraX colors AlphaFold models by confidence using the same scores taken from the bfactor column of the atom site table.

The Model Archive example you gave as an example (https://www.modelarchive.org/doi/10.5452/ma-bak-cepc-0250) has no templates sequences or alignments in the mmCIF file, and no per-residue scores, but does have some global scores.

Your ModBase example (https://modbase.compbio.ucsf.edu/modbase-cgi/model_search.cgi?databaseID=Q12321) does have a template sequence and alignment and global scores but no per-residue scores

#
loop_
_ma_template_ref_db_details.template_id
_ma_template_ref_db_details.db_name
_ma_template_ref_db_details.db_accession_code
1 PDB 3nc1

#
loop_
_ma_template_poly.template_id
_ma_template_poly.seq_one_letter_code
_ma_template_poly.seq_one_letter_code_can
1 DMACDTFIKIAQKCRRHFVQVQVGEVMPFIDEILNNINTIICDLQPQQVHTFYEAVGYMIGAQTDQTVQEHLIEKYMLLPNQVWDSIIQQATKNVDILKDPETVKQLGSILKTNVRACKAVGHPFVIQLGRIYLDMLNVYKCLSENISAAIQANGEMVTKQPLIRSMRTVKRETLKLISGWVSRSNDPQMVAENFVPPLLDAVLIDYQRNVPAAREPEVLSTMAIIVNKLGGHITAEIPQIFDAVFECTLNMINKDFEEYPEHRTNFFLLLQAVNSHCFPAFLAIPPAQFKLVLDSIIWAFKHTMRNVADTGLQILFTLLQNVAQEEAAAQSFYQTYFCDILQHIFSVVTDTSHTAGLTMHASILAYMFNLVEEGKISTPLNPN DMACDTFIKIAQKCRRHFVQVQVGEVMPFIDEILNNINTIICDLQPQQVHTFYEAVGYMIGAQTDQTVQEHLIEKYMLLPNQVWDSIIQQATKNVDILKDPETVKQLGSILKTNVRACKAVGHPFVIQLGRIYLDMLNVYKCLSENISAAIQANGEMVTKQPLIRSMRTVKRETLKLISGWVSRSNDPQMVAENFVPPLLDAVLIDYQRNVPAAREPEVLSTMAIIVNKLGGHITAEIPQIFDAVFECTLNMINKDFEEYPEHRTNFFLLLQAVNSHCFPAFLAIPPAQFKLVLDSIIWAFKHTMRNVADTGLQILFTLLQNVAQEEAAAQSFYQTYFCDILQHIFSVVTDTSHTAGLTMHASILAYMFNLVEEGKISTPLNPN

#
loop_
_ma_alignment.ordinal_id
_ma_alignment.alignment_id
_ma_alignment.target_template_flag
_ma_alignment.sequence
1 1 2 DMACDTFIKIAQKCRRHFVQVQVGEVMPFIDEILNNINTIICDLQPQQVHTFYEAVGYMIGAQTDQTVQEHLIEKYMLLPNQVWDSIIQQATKNVDILKDPETVKQLGSILKTNVRACKAVGHPFVIQLGRIYLDMLNVYKCLSENISAAIQANGEMVTKQPLIRSMRTVKRETLKLISGWVSRSNDPQMVAENFVPPLLDAVLI---------DYQRNVPAAREPEVLSTMAIIVNKLGGHITAEIPQIFDAVFECTLNMINKDFEE---------YPEHRTNFFLLLQAVNSHCFPAFLAIPPAQ---FKLVLDSIIWAFKHTMRNVADTGLQILFTLLQNVAQEEAAAQSFYQTYFCDILQHIFSVVTDTSHTAGLTMHASILAYMFNLVEEGKISTPLNPN
2 1 1 DSYVETLDSMIELFKDYKPGSITLENITRLCQTL-GLESFTEELSNELSR--LSTASKIIVIDVDYNKKQDRIQDVKLVLASNFDNFDYFNQRDGEHEKSNILLNSLTKYPDLKAFHNNLKFLYLLDAYSHIESDSTSHNNGSSDKSLDSSNASFNNQGKLDLFKYFTELSHYIRQCFQDNCCDFKVRTNLNDKFGIYILTQGINGKEVPLAKIYLEENKSDSQYRFYEYIYSQETKSWINESAENFSNGISLVMEIVANAKESNYTDLIWFPEDFISPELIIDKVTCSSNSSSSPPIIDLFSNNNYNSRIQLMNDFTTKLINIKKFDISNDNLDLISEILKWV------------QWSRIVLQNVFKLVSTPSSNSNSSELEPDYQAPFSTSTKDKNSSTSNTE

#
loop_
_ma_qa_metric.id
_ma_qa_metric.name
_ma_qa_metric.description
_ma_qa_metric.type
_ma_qa_metric.mode
_ma_qa_metric.other_details
_ma_qa_metric.software_group_id
1 MPQS 'ModPipe Quality Score' other global
'composite score, values >1.1 are considered reliable' 1
2 zDOPE 'Normalized DOPE' zscore global . 2
3 'TSVMod RMSD' 'TSVMod predicted RMSD (MSALL)' distance global . .
4 'TSVMod NO35' 'TSVMod predicted native overlap (MSALL)' other global . .

#
loop_
_ma_qa_metric_global.ordinal_id
_ma_qa_metric_global.model_id
_ma_qa_metric_global.metric_id
_ma_qa_metric_global.metric_value
1 1 1 0.665346
2 1 2 -0.11
3 1 3 14.527
4 1 4 0.036

So it looks like only ModBase would currently benefit from ChimeraX reading template sequences and alignments. I do not think it would be too hard to implement it. I've made a ChimeraX feature request for that

https://www.rbvi.ucsf.edu/trac/ChimeraX/ticket/5601

Tom

comment:2 by pett, 4 years ago

Status: assignedaccepted

comment:3 by pett, 4 years ago

I have a question. For any particular model structure, is there only one template? Looking at the mmCIF dictionary, it seems like it has to be, but I want to make sure I'm not misunderstanding something.

comment:4 by pett, 4 years ago

Status: acceptedfeedback

To get the ball rolling, tomorrow's build will show the template-target alignment if you open a Modbase mmCIF file. So in addition to my previous question, I'd like to know the immediate next steps you'd want to see to get to the information you are interested in, plus desired improvements to what I've already done. Also, should the alignment immediately show up (current behavior), or should there be a link in the log to show the alignment?

in reply to:  5 ; comment:5 by Ben Webb, 4 years ago

On 1/19/22 4:45 PM, ChimeraX wrote:>   I have a question.  For any 
particular model structure, is there only one

Disclaimer: I didn't develop the MA mmCIF dictionary (this was a 
collaboration between PDB and the MA folks) although I did review it.

You can certainly have multiple templates for a given alignment 
(although all ModBase models are single-chain single-template models). 
This would result in multiple lines in the _ma_alignment_details loop 
with the same alignment_id but different 
template_segment_id/target_asym_id pairs. The only issue is that the 
_ma_alignment table doesn't provide enough information to uniquely 
identify the template in this case (see 
https://github.com/ihmwg/MA-dictionary/issues/4)

BTW, my understanding is that the SwissModel repository folks plan to 
also use this dictionary for their mmCIF models in the near future.

	Ben

comment:6 by pett, 4 years ago

It was exactly the lack of a template_segment_id in the _ma_alignment data that misled me to think that there could only be one template. Without that info, the only name I can put next to the sequence is "template" -- not very informative.

in reply to:  7 ; comment:7 by Ben Webb, 4 years ago

Looks good to me just the way it is - thanks! Next steps? Ideally I'd 
like to see any quality scores for the models (ma_qa_metric* tables), 
perhaps in the log or as model attributes. ModBase models only have 
per-model scores, but AlphaFold models have a bunch of per-residue (and 
also I think per-residue-pair) scores too.

BTW, I shared this ticket with folks from SwissModel and AlphaFold so 
they may have additional suggestions.

Also, if it helps at all with interpretation of the various tables, my 
own Python code for parsing MA mmCIF is at 
https://github.com/ihmwg/python-ma.

	Ben

comment:8 by pett, 4 years ago

Thanks for the info. I'll work on getting the score information available and will let you know when something happens. Until the _ma_alignment table deficiency gets addressed, I'll do _something_ about files with multiple templates, probably resorting to just the name "template" instead of the more informative name I can use when I know which template is involved.

in reply to:  9 ; comment:9 by Ben Webb, 4 years ago

On 1/20/22 4:41 PM, ChimeraX wrote:

Sounds good. FWIW, the MA folks pointed out that the main advantages of 
using per-residue scores from the _ma_qa_metric_local table (if 
available) instead of relying on the b-factors are

"a) the type of metric used is described in _ma_qa_metric (i.e. instead 
of guessing, one can know that this is a "pLDDT" score)
b) there can be multiple per-residue scores (e.g. I could imagine to 
have separate predictions for accuracy and prob. for a residue being 
disordered)"

	Ben

comment:10 by pett, 4 years ago

Status: feedbackaccepted

Okay, in tomorrow's build if you click on the "more info..." link for the model in the Log, that table will show the global scores.

Local scores is next, but it could be awhile since I will likely work on some other things first.

comment:11 by pett, 3 years ago

Resolution: fixed
Status: acceptedclosed

Okay, Implemented local scores. Each score becomes a residue attribute (saved in sessions). Each score produces a log entry that makes it easy to color by that score, e.g.:

Color ma-bak-cepc-0250.cif by residue attribute pLDDT_score
color byattribute r:pLDDT_score #1 palette red:yellow:green
4185 atoms, 548 residues, atom pLDDT_score range 20.3 to 98.4

The "Color" link runs the color byattribute command, and the "attribute" link goes to the help page that describes what attributes are.

comment:12 by pett, 3 years ago

If you want additional functionality let me know and I will re-open this ticket.

Note: See TracTickets for help on using tickets.