Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#5217 closed enhancement (fixed)

blast protein lists same hit multiple times

Reported by: Elaine Meng Owned by: Zach Pearson
Priority: high Milestone:
Component: Sequence Version:
Keywords: Cc: pett
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

Isn't there some option to consolidate multiple hits to the same entry into one? If so, we should probably use that.

See attached image resulting from "alphafold search ldlr_human".

In Chimera we had an option "List only best-matching chain per PDB entry (default) " which was slightly different, even collapsing more things into one. Say the PDB structure is a homotetramer, then chains A,B,C,D all match and this option would just give one of those in the output hit list (typically D because it was evaluated last).

Attachments (1)

blast-multiples.png (417.0 KB ) - added by Elaine Meng 4 years ago.

Download all attachments as: .zip

Change History (5)

by Elaine Meng, 4 years ago

Attachment: blast-multiples.png added

comment:1 by Elaine Meng, 4 years ago

Made this an enhancement rather than bug-ticket since it does make sense that blast identifies multiple different segments of the same sequence, and these have different scores. Perhaps up for discussion whether they should be collapsed or not... it's a gray area, and perhaps too much work if there isn't a built-in blast option to do it already. At least in case of PDB-hit multiple identical chains in the same entry with the same scores, it is clear that only listing once makes sense.

comment:2 by Zach Pearson, 4 years ago

Resolution: fixed
Status: assignedclosed

This commit should take care of it. I'm assuming that if it's got the same score and match_id it's the same thing, but there are other attributes we can use to test equality if those two don't give us what we want.

comment:3 by Elaine Meng, 4 years ago

Is match_id what is shown in the "Name" column? That is what I was using to judge whether the hit was the same or not. In my image (original attachment) there were several different scores associated with the same name. However, in my test it does work now in that I don't get multiples of the same name. I was kinda surprised I didn't get any LRP1_HUMAN since that was the example with many hits in the image, but maybe that's because we are searching a different version of the alphafold database now, perhaps does not contain LRP1_HUMAN.

I would still like to put in a feature request for the "List only best-matching chain per PDB entry" option that was in Chimera. Should that be a separate ticket? Currently all hits per PDB entry are given, for example:

open 4hhb
blastprotein #1/A
... gives several hits that are 2 different chains in the same entry, e.g. 1A00_A & 1A00_C and several other pairs.

comment:4 by Zach Pearson, 4 years ago

Is match_id what is shown in the "Name" column?

Yes it is. I'm assuming items with the same name and the same score are the same, but I can tweak those two assumptions really easily.

We should open another ticket for just showing the best chain.

Note: See TracTickets for help on using tickets.