[Chimera-users] Analysis of the multiple alignments + structures

Mon Sep 1 13:07:39 PDT 2014

Hi James,
That’s a really broad question… usually that’s why people are interested in showing conservation on some structure: to highlight the important residues, and in combination with structural analysis and/or mutagenesis, to suggest their roles in structure and function.  I don’t have specific papers in mind, but there should be no shortage if you try a few keyword searches.

There is a bit more discussion in my page on “sources of sequence alignments” as to choosing the set of sequences, or source (web database or web server) of a set of sequences for calculating conservation.
<http://www.cgl.ucsf.edu/home/meng/sources.html>
(That page is also linked to the “mapping sequence conservation” tutorial.)  

However, your question is really about making judgement calls as they pertain to your specific research project, which may be beyond the scope of this Chimera list, and further, I don’t claim to be an authority on the subject.  Logic would dictate that if you are looking for residues important in function X, then you would want to include only sequences of proteins that perform function X, but if you are looking for patterns of conservation among related functions X,Y,Z in some set of homologous proteins, you would use a broader set.    Maybe it is known what percent IDs give the proper boundaries in your situation, but more often it is not known.  By boundaries I mean how much variability can be in your set without including some proteins that don’t have the function of interest.   

The percent ID filtering you mention is a little bit different - that filtering would remove sequences from the set that are similar to another within the specified cutoff, but could be considered separately from the issue of how broadly variable the set is in the first place (for example, only class A GPCRs, only amine neurotransmitter-binding GPCRs, or only Gs-stimulating GPCRs).   
In other words, a data set could be big for at least two different reasons: (1) it’s broad, (2) it’s redundant. For redundancy-filtering, I’d personally only filter down to get an alignment that is not too big for Chimera, and err on the side of including lots of sequences as long as the overall range of the set is appropriate for the function of interest.  There are  sequence-weighting options in the AL2CO conservation scoring in Chimera that may help to mitigate any under- and over-representation of subsets of the sequences.

Disclaimer:  this is just my opinion as someone who has dabbled in the area, and others with more experience may have divergent views or better ideas of how to attack the problem!

I hope this helps,
Elaine
-----
Elaine C. Meng, Ph.D.                       
UCSF Computer Graphics Lab (Chimera team) and Babbitt Lab
Department of Pharmaceutical Chemistry
University of California, San Francisco

On Sep 1, 2014, at 2:55 AM, James Starlight <jmsstarlight at gmail.com> wrote:

> Elaine,
> 
> I have not fully finished your tutorial but now I have one question: have you seen some intresting papers covering my problem: to make prediction of the functional properties of some amino acid (motifs) based on the analysis of the 3D structure of the protein under interest together with the analysis of sequences of closely related homologues?
> 
> In particular I'm interested of how much sequences should I include to the MSA and what threshold for the seq identity (agains my target protein) should be chosen. E.g In case where I deal with the set of G-protein coupled receptors (which has low sequence similarity but hight structure conservation): I've obtained 2 different pictures of the conservative a.a motifs in cases where i've used i) only several templates with low sequence (40%) identity VS ii) where I have used alot of sequenses with begger identity (up to 60%). In the latter cases I've obtained much bigger conservation in the motifs seen based on the analysis of SS( which is trivial!) where in the i) case- there were only several highly conservative motifs. Does it means that the analysis of BIG datasets with bigger sequence identity produce bigger unsertaintly in the final results because we can conclude about what conservative elements are *really* functional importnat?
> 
> James
>