#5217 closed enhancement (fixed)
blast protein lists same hit multiple times
Reported by: | Elaine Meng | Owned by: | Zach Pearson |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | Sequence | Version: | |
Keywords: | Cc: | pett | |
Blocked By: | Blocking: | ||
Notify when closed: | Platform: | all | |
Project: | ChimeraX |
Description
Isn't there some option to consolidate multiple hits to the same entry into one? If so, we should probably use that.
See attached image resulting from "alphafold search ldlr_human".
In Chimera we had an option "List only best-matching chain per PDB entry (default) " which was slightly different, even collapsing more things into one. Say the PDB structure is a homotetramer, then chains A,B,C,D all match and this option would just give one of those in the output hit list (typically D because it was evaluated last).
Attachments (1)
Change History (5)
by , 4 years ago
Attachment: | blast-multiples.png added |
---|
comment:1 by , 4 years ago
comment:2 by , 4 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
This commit should take care of it. I'm assuming that if it's got the same score and match_id it's the same thing, but there are other attributes we can use to test equality if those two don't give us what we want.
comment:3 by , 4 years ago
Is match_id what is shown in the "Name" column? That is what I was using to judge whether the hit was the same or not. In my image (original attachment) there were several different scores associated with the same name. However, in my test it does work now in that I don't get multiples of the same name. I was kinda surprised I didn't get any LRP1_HUMAN since that was the example with many hits in the image, but maybe that's because we are searching a different version of the alphafold database now, perhaps does not contain LRP1_HUMAN.
I would still like to put in a feature request for the "List only best-matching chain per PDB entry" option that was in Chimera. Should that be a separate ticket? Currently all hits per PDB entry are given, for example:
open 4hhb
blastprotein #1/A
... gives several hits that are 2 different chains in the same entry, e.g. 1A00_A & 1A00_C and several other pairs.
comment:4 by , 4 years ago
Is match_id what is shown in the "Name" column?
Yes it is. I'm assuming items with the same name and the same score are the same, but I can tweak those two assumptions really easily.
We should open another ticket for just showing the best chain.
Made this an enhancement rather than bug-ticket since it does make sense that blast identifies multiple different segments of the same sequence, and these have different scores. Perhaps up for discussion whether they should be collapsed or not... it's a gray area, and perhaps too much work if there isn't a built-in blast option to do it already. At least in case of PDB-hit multiple identical chains in the same entry with the same scores, it is clear that only listing once makes sense.