Opened 8 months ago

Closed 8 months ago

#16903 closed enhancement (fixed)

Allow homolog chains with different residue numbering to be associated with mutation scores

Reported by: Willow.Coyote-Maestas@… Owned by: Tom Goddard
Priority: moderate Milestone:
Component: Structure Analysis Version:
Keywords: Cc: Ever.ODonnell@…, Elaine Meng
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

Willow would like to associate mutation scores with homologous structures which may use different residue numbering. This will need to use a sequence alignment between the mutation score sequence and homologous sequence.

Begin forwarded message:

From: "Coyote-Maestas, Willow"
Subject: Re: Small suggestion for DMS chimera
Date: February 2, 2025 at 4:43:32 PM PST
To: Tom Goddard
Cc: "O'Donnell, Ever"

Just adding a challenge that I faced when mapping scores across these structures that I think will be a common one - many structures are solved based on a mutated form of a protein. For instance, we did a DMS on human NTCP but structures have been solved of homologs but also of a consensus sequence that is more stable. Mapping scores across these is challenging because the specific numberings may be slightly different.

Again no rush on taking care of this just thought it would be helpful info.
Be well,
Willow

Change History (7)

comment:1 by Tom Goddard, 8 months ago

Cc: Elaine Meng added

I made a first try at this adding an "alignSequences" option to the "mutationscores structure" command that makes that command use Needleman-Wunsch sequence alignment to align the specified homologous structure sequence with the mutation scores sequence.
For example, here I try it on an arabidopsis homologous of human ABCg2 that has 31% sequence identity and different residue numbering

open 8iwj
open abcg2.csv
mutation structure #1 alignSeq true

This associates any residues that get aligned -- they need not have matching amino acid types. But it is severely limited in its control over the sequence alignment. The Needleman-Wunsch algorithm has lots of parameters and it does not let you set any of them. Another serious weakness is it does not provide any way to show you the alignment it produced.

Although this first try does not offer control over the sequence alignment, it allows all the mutationscores commands to use the residue association which may have different residue numbers between the homolog and score data.

This may work ok for high sequence identity homologs (80% identity) where it will get the alignment entirely correct without ambiguities. But more thought is needed about how to allow control of the sequence alignment in cases where the alignment is not obvious. I'm open to suggestions about how to do this. Some ideas are 1) allow you to open a sequence alignment (e.g. fasta file) between the mutation score sequence and one or more homologs that you calculate however you want. 2) Give you some choice of sequence alignment algorithms (see the ChimeraX "sequence align" command, https://www.cgl.ucsf.edu/chimerax/docs/user/commands/sequence.html#align). 3) Allow you to specify parameters to tune the sequence alignment. 4) Whatever the method of specifying the alignment you probably want to be able to show the sequence alignment to investigate places where it may be wrong.

comment:2 by Tom Goddard, 8 months ago

One problem with making sequence alignments from homologs to the mutation score sequence is that we don't have the mutation score sequence. The .csv file data has scores for many residues and identifies their amino acid type, but if there are no scores for a residue then it does not appear in the file and we don't know that part of the sequence. So if we want to do sequence alignments and properly handle the case where there are not missing scores for some residues the user may need to specify the mutation scores sequence separately.

comment:3 by Tom Goddard, 8 months ago

It might make sense to use the ChimeraX "sequence align" command that uses the ClustalW alignment web service to create a multiple sequence alignment of the mutation scores sequence and each unique associated chain sequence. Showing the resulting alignment would be useful so alignment problems, like segments not being aligned or being misaligned, can be spotted. In order to correct such problems the user could provide their own multiple sequence alignment including mutation scores sequence and all associated sequences that the alignSequences option would use. These ideas provide a better quality alignment, allow the user to see the alignment, and allow the user to provide a custom alignment.

comment:4 by Tom Goddard, 8 months ago

I changed the alignSequences option of the "mutationscores structure" command. It now takes a sequence of the mutation score data and then aligns the specified chain sequences using the Clustal Omega sequence alignment method (using the ChimeraX "sequence align" command). The reason it takes the mutation scores sequence is that the full sequence is not included in that data (we can figure out the sequence if every residue has scores, but in general scores may not exist for some of the residues).

mutationscores structure #1,​2 alignSequences ABCG2_HUMAN

The sequence alignment uses a web service (takes about 10 seconds) so you have to be connected to the internet. The multiple sequence alignment will be shown in a separate panel. You can also provide your own multi-sequence alignment, for instance, by opening a FASTA file which will create an alignment window with an ID in its title and you can use the ID as with the alignSequences option.

open abcg2_ab25g.fasta
mutationscores structure #1,​2 alignSequences abcg2_ab25g.fasta

I removed my previous alignSequences option (true or false value) that used Needleman-Wunsch alignment.

comment:5 by Tom Goddard, 8 months ago

Here are some more details about the alignSequences option. You can specify a single sequence with any of the usual ways: uniprot name or accession, chain specifier, raw sequence string, sequence viewer id. If you instead specify a multiple sequence alignment the first sequence in the alignment must exactly match the mutation score data. If it does not match an error will be raised explaining that. The alignment should contain each of the exact sequences for the chains you are associating with the same residue numbering (ie the sequence starts with residue number 1). The alignment does not need to contain extra copies of the same sequence if multiple chains being associated have the same sequence. The alignment could contain extra sequences that will be ignored. If you specify a single sequence the Clustal Omega alignment will have that sequence first followed by all the unique sequences in the specified chains (in any order). A multiple sequence alignment in ChimeraX has associations between open structure chains and matching sequences. Those associations will be used by the alignSequences option to pair of specified chains with their sequences so it is important that those associations are correct. I think if there is an exact sequence match then ChimeraX will automatically make the correct association.

comment:6 by Tom Goddard, 8 months ago

I updated my mutation scores tutorial web page

https://www.rbvi.ucsf.edu/chimerax/data/mutation-scores-oct2024/mutation_scores.html

to describe associating multiple chains and associate homolog chains with different residue numbering.

comment:7 by Tom Goddard, 8 months ago

Resolution: fixed
Status: assignedclosed

Added support for using Clustal Omega or custom alignment files to apply DMS scores to homolog structures.

Note: See TracTickets for help on using tickets.