#14976 closed enhancement (fixed)

Make blastprotein command load structures and alignment without GUI

Reported by: Tom Goddard Owned by: Zach Pearson
Priority: high Milestone:
Component: Sequence Version:
Keywords: Cc: Elaine Meng, Eric Pettersen
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

I'd like 3 new options to the blastprotein command

blastprotein ... gui false loadStructures true showSequenceAlignment true

that allow loading all the atomic structures found and showing the sequence alignment without doing these steps through the GUI and without showing the GUI. The use is that I am making a tool to explore the differences between tens or hundreds or thousands of homolog structures that come from PDB or the AlphaFold database, looking at kinases with Klim Verba as an example system. Currently I have to run Blast then select all the lines in the gui and press the buttons. We'll be doing this for many different Blast searches, so being able to do it by command is much easier than all the mouse clicking. In general in ChimeraX we have tried to make command equivalents for all GUI tools to allow scripting, and heavy repeated use where the GUI is not needed.

Besides these 3 new options I also want the loaded structures to be associated with their sequences in the sequence alignment. Currently when using the BLAST GUI the structures get associated automatically and that causes them to be associated with the wrong sequence alignment lines, probably because their are identical sequences in the alignment. So structure 4xv9_A should associate with the alignment sequence named 4xv9_A. In my current kinase example with 211 sequence (4xv9 chain A sequcence, 1e-100 evalue cutoff) almost every structure is associated with a different sequence in the alignment.

Change History (34)

comment:1 by Tom Goddard, 18 months ago

I'm working with this with professor Klim Verba and postdoc Katrina Black (here in Genentech hall 4th floor) and they have been excited about the umap kinase structure comparison results so far and want to use it in ChimeraX. So I'd like to get the blastprotein part working effectively as described in this ticket next week if possible. If this is too rushed, let me know and I can work on the blastprotein changes. The aim is to keep these collaborators engaged.

comment:2 by Zach Pearson, 18 months ago

Priority: moderatehigh

Raising the priority because having wrong structure-sequence association is a pretty big deal

comment:3 by Zach Pearson, 18 months ago

I have a couple questions about the desired functionality.

1) Do you want these options to print a summary of what the GUI results table would be to the log as it does in the NoGUI case?

2) Is showSequenceAlignment supposed to show the GUI multiple alignment viewer or is there some GUI-less command I should use instead?

comment:4 by Tom Goddard, 18 months ago

If the "gui false" option is used I do not expect the results table to appear in the log.

The multiple sequence alignment should be shown in its gui. The "gui false" option to the blastprotein command means don't show the BLAST results gui even though I am running this command in gui mode. The use case is that I just want to see the structure and alignment and not the table.

comment:5 by Elaine Meng, 18 months ago

The keyword "gui" might be confusing.  Is there some better word to use?  Some thoughts:  listHits false, table false, showTable false, resultsWindow false ... I'm not militant about it, though.

comment:6 by goddard@…, 18 months ago

Maybe "resultsTable false"?

comment:7 by Zach Pearson, 18 months ago

Sure. showResultsTable, even. Should showResultsTable false imply the other two options? When are you going to run a BLAST and just throw away the results?

comment:8 by Tom Goddard, 18 months ago

No I don't think showResultsTable false should effect other options. You don't know if they just want the sequence alignment, or just the loaded structures, or both.

comment:9 by Zach Pearson, 18 months ago

I've split off the second paragraph of this ticket into a new one, #15010, so this ticket can be narrower. I've implemented what was originally showGui as showResultsTable and added the options loadStructures and showSequenceAlignment. I have to double check that all databases work and then I'll push the changes for tomorrow's daily build.

comment:10 by goddard@…, 18 months ago

Sounds good.

comment:11 by Zach Pearson, 18 months ago

If you run a BLAST of pdb for 5y5s/U, the GUI will have entries for 3WMM_{1, 3, 5, 7, 9, A, D, F, I, K, O, Q, S, U, W, Y}. If you select all those rows and open them, you just get 16 different copies of 3WMM that are oriented in different directions around 5y5s/U depending on which 3WMM chain was sent to matchmaker. Opening all the hits is extremely slow; over 600 structures get opened, most of them redundant. Is that the desired behavior? Do we have the ability to open just one chain of a PDB?

comment:12 by Elaine Meng, 18 months ago

I thought I'd requested the "best hit only per PDB" feature a while ago (like years, because we had it in Chimera) but I don't remember if there is a ticket, or if so, what number it is.
Elaine

comment:13 by Zach Pearson, 18 months ago

I found the ticket and can see that the originally committed fix is in our codebase. I'm not sure why I set the visibility of the toggle checkbox I put in to 'false'? What a jerk past-me was! It will be easy to modify code I already wrote for this ticket to also provide an up-front keyword argument to filter the results.

comment:14 by Elaine Meng, 18 months ago

Great!  Without knowing your past motivations (maybe you meant to fully implement it and just didn't get there), I wouldn't case aspersions on your past self. :-)

So at least in Chimera this option would omit duplicate-ID results from the hit list entirely, which I'm hoping is the case, as opposed to only from the seq alignment and/or loaded structures.

comment:15 by Tom Goddard, 18 months ago

Regarding comment 11, yes I want to load all the copies of a matching PDB for each separate chain that matches, because those copies have different conformations (kinases form asymmetric homo-dimers). A nice additional option would be to delete the non-aligned chains when opening matching structures. But I don't want to pile on additional requests in this ticket and I don't have an immediate need for deleting the non-aligned chains -- those chains can be important to understanding why the aligned chain assumes a given conformation so for my use I want to keep them.

comment:16 by Elaine Meng, 18 months ago

I definitely want them out of the hit list, but you could make it a separate ticket to focus on the other stuff first.

comment:17 by Tom Goddard, 18 months ago

Elaine, why do you want to remove multiple chain hits from the result table? Those are useful hits and the fact that they are chains in the PDB means they are not symmetry equivalent to each other. All of them in most cases will have identical sequences, so they will all have the exact same BLAST score, so how would you choose just one -- there is no best chain based on BLAST score. I guess the results table could be consolidated so it has only one row for a PDB entry and then in another columns lists all the chains that match in that PDB. But that is going to complicate things significantly -- how will the user choose the chain they want to align when the load the structure? There is also the case where there are multiple hits to chains in the same PDB where the sequences are different, and those will need different table rows.

comment:18 by Elaine Meng, 18 months ago

This is for the majority of interactive users.  They don't want to load the same structure from the PDB N times into ChimeraX,  nor see it multiple times in the sequence alignment.   The first one (not having to deal with so many copies of the same fetch) is a big one.  Lots of people aren't familiar with easily hiding/showing using the Models Panel, or even understand why they got 4 of the same thing, or why they are placed differently when the multiple chain copies usually look the same as each other.

 They may want to easily search for structures with ligands, certain resolution etc. without having double (or more) the vertical scrolling of hits to get through, or laboriously selecting every other row in the hit list to load only unique hits or get a less biased sequence alignment.

Typically they will get only a handful of structures at most and easily see when they are homomultimers, and in the (small minority I think) of cases where the different subunits actually have meaningfully different conformations from one another, they could explore that manually.

I feel that this is all so obvious and more convenient in the vast majority of use cases I'm kind of surprised you would even ask.  Most people are not trying to find every single conformation of every related structure in the PDB when they blast the PDB.

This is why we had it in Chimera (not even at my request) and it  was on by default.

Elaine

comment:19 by Tom Goddard, 18 months ago

I don't object to having an option to somehow filter out all but one chain from a structure, although I don't know how it will choose which one. I think the desired setting of that option will depend on what you are looking for. If you are interested in how the structures align (in which case you need to decide which chain to align and probably need to look at the different chains -- they often have different missing residues). But if you are only interested in which structures contain one or more of the chain, or which ligands can be found with the chain then you probably do want to filter to limit clutter in the table. I'm also fine with the default value of the option being either way. I think if it is one line per PDB in the results it really should list all the chains that matched in another column.

comment:20 by Elaine Meng, 18 months ago

Sequence alignment score to the query was the criterion in Chimera, and was identical for all the monomers, so you would get the last one (e.g. chain D of homotetramer).  I'm fine with that behavior.  It makes sense because it is evaluated before any structure is fetched.  I personally don't even think it needs to list all the chains that got the same score ... if you wanted the duplicates, you would turn off the option to omit them.  I'd rather just have the option than block its implementation by making it harder to implement.

When the option is on, the structure fetch for the unique hit will still get the whole asymmetric unit with all of the copies. So even in the unlikely case that some user turns the option off but still wants to explore the fit of each monomer, it can be done without adding to the number of models/atoms in the scene.

Elaine

comment:21 by Zach Pearson, 18 months ago

We don't actually load different PDBs for each entry, although it sounds like we should. I double checked Conrad's code, and even it split the PDB ID from its chain and ran the ID through the ChimeraX "open" command -- a behavior I kept when I ported blastprotein from HTML to QT.

comment:22 by Tom Goddard, 18 months ago

The current blast protein tool shows each PDB entry chain that matches in a separate results table row and if you select and load that row then opens the full PDB entry and aligns the specific chain to the reference structure used to do the search if it used a reference structure. If you select multiple table rows that all are for chains of the same PDB then the full PDB structure is opened for each rows and the specific chain for that row is aligned to the reference structure.

comment:23 by Zach Pearson, 18 months ago

Oh, I see now I misread what you wrote in comment 15 -- sorry!

comment:24 by Zach Pearson, 18 months ago

Resolution: fixed
Status: assignedclosed

I've pushed the changes. YMMV, it was fast for an Alphafold version 2 search (14 results) and gave me the beach ball of death until I got tired of waiting for a new frame and killed ChimeraX for a PDB search (>600 results).

comment:25 by Tom Goddard, 18 months ago

Thanks! I tested it using 4xv9 chain A with evalue cutoff 1e-100 and it showed the structures and sequence alignment in 73 seconds, although all the structures had already been fetched and were cached in my Downloads directory. This is the same speed I get using the Blast GUI.

comment:26 by Elaine Meng, 18 months ago

Is the loadStructures option only applicable to searching pdb and alphafold and esmfold, or does it also work to get all of the pdb-chain entry hits in the other searches (e.g. "NR" includes PDB)?  I just generated a traceback trying to formulate a short uniref50 search with turning on these new options.  Unfortunately the short search I tried didn't have any structure hits which is probably the cause of the traceback, ticket #15025.


comment:27 by Elaine Meng, 18 months ago

Minor wording thing: in the Blast results panel, can "List only best chain per hit" instead be "List only best-matching chain per PDB entry"?   

To me at least, each chain is a hit, so "chain per hit" doesn't make sense.  And as far as I know this pertains only to PDB since AlphaFold and ESMFold entries are all monomers, right?

comment:28 by Zach Pearson, 18 months ago

Yes, it attempts to get structures for all databases. I'll change the wording of the string.

comment:29 by Elaine Meng, 18 months ago

(1) I still don't know what is supposed to happen for nr and uniref databases, which as I understand it include as a subset PDB protein chain sequences.  I was wondering if loadStructures true would automatically open all the hits from their PDB subsets when these other databases are searched, or only when PDB was the specified search database.

(2) is this correct: loadStructures applies to searching alphafold and esmfold databases too

(3) is this correct: showResultsTable is default true, loadStructures default false, showSequenceAlignment default false, onlyBest false.

comment:30 by Zach Pearson, 18 months ago

1) I split hit names on underscore and if the first field isn't a PDB ID it doesn't get loaded, but it will look at *all* hits from *all* databases for those IDs and try to load them.
2) Yes
3) Yes

comment:31 by Elaine Meng, 18 months ago

Resolution: fixed
Status: closedreopened

wording (as per comment 27) still not changed as of UCSF ChimeraX version: 1.8.dev202404300104 (2024-04-30)

comment:32 by Zach Pearson, 18 months ago

OK, last question before closing this. It's really not hard to apply the filtering to each database, it's just maybe meaningless for databases that aren't PDB because the results are probably all unique. Should it be an error to pass in onlyBest=true or a warning in the log (or info, maybe) -- and should the checkbox be hidden for databases that aren't PDB?

comment:33 by Elaine Meng, 18 months ago

The suggested wording is "List only best-matching chain per PDB entry" which should serve to notify the user that it only pertains to PDB hits.  I don't think you have to be fancy and hide it if some other database is searched, unless (A) doing that is really easy and (B) you want to.  If this option is checked and PDB is not searched, I don't think the user should be accosted with any errors or warnings, it should  simply be ignored.

comment:34 by Zach Pearson, 18 months ago

Resolution: fixed
Status: reopenedclosed
Note: See TracTickets for help on using tickets.