Opened 8 months ago
Closed 8 months ago
#16903 closed enhancement (fixed)
Allow homolog chains with different residue numbering to be associated with mutation scores
| Reported by: | Owned by: | Tom Goddard | |
|---|---|---|---|
| Priority: | moderate | Milestone: | |
| Component: | Structure Analysis | Version: | |
| Keywords: | Cc: | Ever.ODonnell@…, Elaine Meng | |
| Blocked By: | Blocking: | ||
| Notify when closed: | Platform: | all | |
| Project: | ChimeraX |
Description
Willow would like to associate mutation scores with homologous structures which may use different residue numbering. This will need to use a sequence alignment between the mutation score sequence and homologous sequence.
Begin forwarded message:
From: "Coyote-Maestas, Willow"
Subject: Re: Small suggestion for DMS chimera
Date: February 2, 2025 at 4:43:32 PM PST
To: Tom Goddard
Cc: "O'Donnell, Ever"
Just adding a challenge that I faced when mapping scores across these structures that I think will be a common one - many structures are solved based on a mutated form of a protein. For instance, we did a DMS on human NTCP but structures have been solved of homologs but also of a consensus sequence that is more stable. Mapping scores across these is challenging because the specific numberings may be slightly different.
Again no rush on taking care of this just thought it would be helpful info.
Be well,
Willow
Change History (7)
comment:1 by , 8 months ago
| Cc: | added |
|---|
comment:2 by , 8 months ago
One problem with making sequence alignments from homologs to the mutation score sequence is that we don't have the mutation score sequence. The .csv file data has scores for many residues and identifies their amino acid type, but if there are no scores for a residue then it does not appear in the file and we don't know that part of the sequence. So if we want to do sequence alignments and properly handle the case where there are not missing scores for some residues the user may need to specify the mutation scores sequence separately.
comment:3 by , 8 months ago
It might make sense to use the ChimeraX "sequence align" command that uses the ClustalW alignment web service to create a multiple sequence alignment of the mutation scores sequence and each unique associated chain sequence. Showing the resulting alignment would be useful so alignment problems, like segments not being aligned or being misaligned, can be spotted. In order to correct such problems the user could provide their own multiple sequence alignment including mutation scores sequence and all associated sequences that the alignSequences option would use. These ideas provide a better quality alignment, allow the user to see the alignment, and allow the user to provide a custom alignment.
comment:4 by , 8 months ago
I changed the alignSequences option of the "mutationscores structure" command. It now takes a sequence of the mutation score data and then aligns the specified chain sequences using the Clustal Omega sequence alignment method (using the ChimeraX "sequence align" command). The reason it takes the mutation scores sequence is that the full sequence is not included in that data (we can figure out the sequence if every residue has scores, but in general scores may not exist for some of the residues).
mutationscores structure #1,2 alignSequences ABCG2_HUMAN
The sequence alignment uses a web service (takes about 10 seconds) so you have to be connected to the internet. The multiple sequence alignment will be shown in a separate panel. You can also provide your own multi-sequence alignment, for instance, by opening a FASTA file which will create an alignment window with an ID in its title and you can use the ID as with the alignSequences option.
open abcg2_ab25g.fasta
mutationscores structure #1,2 alignSequences abcg2_ab25g.fasta
I removed my previous alignSequences option (true or false value) that used Needleman-Wunsch alignment.
comment:5 by , 8 months ago
Here are some more details about the alignSequences option. You can specify a single sequence with any of the usual ways: uniprot name or accession, chain specifier, raw sequence string, sequence viewer id. If you instead specify a multiple sequence alignment the first sequence in the alignment must exactly match the mutation score data. If it does not match an error will be raised explaining that. The alignment should contain each of the exact sequences for the chains you are associating with the same residue numbering (ie the sequence starts with residue number 1). The alignment does not need to contain extra copies of the same sequence if multiple chains being associated have the same sequence. The alignment could contain extra sequences that will be ignored. If you specify a single sequence the Clustal Omega alignment will have that sequence first followed by all the unique sequences in the specified chains (in any order). A multiple sequence alignment in ChimeraX has associations between open structure chains and matching sequences. Those associations will be used by the alignSequences option to pair of specified chains with their sequences so it is important that those associations are correct. I think if there is an exact sequence match then ChimeraX will automatically make the correct association.
comment:6 by , 8 months ago
I updated my mutation scores tutorial web page
https://www.rbvi.ucsf.edu/chimerax/data/mutation-scores-oct2024/mutation_scores.html
to describe associating multiple chains and associate homolog chains with different residue numbering.
comment:7 by , 8 months ago
| Resolution: | → fixed |
|---|---|
| Status: | assigned → closed |
Added support for using Clustal Omega or custom alignment files to apply DMS scores to homolog structures.
I made a first try at this adding an "alignSequences" option to the "mutationscores structure" command that makes that command use Needleman-Wunsch sequence alignment to align the specified homologous structure sequence with the mutation scores sequence.
For example, here I try it on an arabidopsis homologous of human ABCg2 that has 31% sequence identity and different residue numbering
This associates any residues that get aligned -- they need not have matching amino acid types. But it is severely limited in its control over the sequence alignment. The Needleman-Wunsch algorithm has lots of parameters and it does not let you set any of them. Another serious weakness is it does not provide any way to show you the alignment it produced.
Although this first try does not offer control over the sequence alignment, it allows all the mutationscores commands to use the residue association which may have different residue numbers between the homolog and score data.
This may work ok for high sequence identity homologs (80% identity) where it will get the alignment entirely correct without ambiguities. But more thought is needed about how to allow control of the sequence alignment in cases where the alignment is not obvious. I'm open to suggestions about how to do this. Some ideas are 1) allow you to open a sequence alignment (e.g. fasta file) between the mutation score sequence and one or more homologs that you calculate however you want. 2) Give you some choice of sequence alignment algorithms (see the ChimeraX "sequence align" command, https://www.cgl.ucsf.edu/chimerax/docs/user/commands/sequence.html#align). 3) Allow you to specify parameters to tune the sequence alignment. 4) Whatever the method of specifying the alignment you probably want to be able to show the sequence alignment to investigate places where it may be wrong.