#15591 closed enhancement (fixed)

Add Foldseek structure similarity search

Reported by: Tom Goddard Owned by: Tom Goddard
Priority: moderate Milestone:
Component: Structure Comparison Version:
Keywords: Cc:
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

Martin Steinegger asked if we could add a Foldseek search tool to ChimeraX. He says we can use their server to run the searches.

I've been wanting to add this for a few years since the biorxiv Foldseek preprint came out.

From: Martin Steinegger
Subject: Re: Using ColabFold in ChimeraX
Date: June 3, 2024 at 6:17:18 AM PDT
To: Tom Goddard
Cc: Cam Gilchrist

Hi Tom,

I hope you are doing well. I wanted to ask if you've had the chance to look into integrating Foldseek into ChimeraX.
We believe this would be a valuable addition to the community and are happy to assist with the integration if needed.
Also our API is not busy at all, so the extra searches should not be an issue.

Additionally, I would like to discuss another project from my postdoc, Cam (cc'd), who has developed a very fast multiple structure alignment tool.
We are curious if you have considered visualizing multiple structure alignments in ChimeraX.

Looking forward to your thoughts. Thank you so much.

Best regards,
Martin

Change History (8)

comment:1 by Tom Goddard, 16 months ago

I've added a foldseek command and gui user interface (menu Tools / Structure Analysis / Foldseek) that searches a single chain against the PDB or AlphaFold databases and displays results in a table. Selecting rows of the table and pressing an Open button can open them, align them to the query and trim them. A Coverage button can show a plot of the aligned sequences to see what parts of the structure were found and shows sequence identity and that plot is interactive and can open specific hits or select specific (e.g. conserved) residues using a context menu.

The maximum number of hits appears to be limited to 1000 and often queries produce a few hundred to 1000 hits.

I've been testing with PDB 8jnb, 8fse chain A, 6x0t chain X.

The PDB 8cop chain A returns an error status from the Foldseek server. Not sure why. I see it starts at sequence number -1 (not observed). I renumbered residues starting at 100, still gives a server error. Maybe there are non-standard residues. Don't see any. Low priority. Can ask Martin to provide better error message from server.

Version 0, edited 16 months ago by Tom Goddard (next)

comment:2 by Tom Goddard, 16 months ago

The PDB 8cop chain A returns an error status from the Foldseek server. Not sure why. I see it starts at sequence number -1 (not observed). I renumbered residues starting at 100, still gives a server error. Maybe there are non-standard residues. Don't see any. Low priority. Can ask Martin to provide better error message from server.

Hmm. 8giv also failed.

Foldseek query failed: FoldSeek jobs failed {'id': 'YX5MblT_ivgErB2CcBCKv9yuxcGwb-peeugE1A', 'status': 'ERROR'}

But both 8cop and 8giv worked with the Foldseek web page.

8jyd also failed. 1grl_A failed. 1svy failed. AlphaFold Q20655 failed. Appears that everything is failing except structures I ran before. And even the structures I ran before failed if I just change the name of the model which changes one line in the mmCIF query "data_8jnb_B" to "data_8jnb_test_B". Probably the server has a cache based on mmcif file hashes.

Still the Foldseek web page is working when I change the "data_" line in the input. So it seems only the REST API is broken.

Some testing with the Foldseek web page shows that if the mmCIF input starts with the data block line "data_xyz" but if it starts with a comment "# something" then it fails. ChimeraX writes mmcif files starting with a comment. Apparently the Foldseek server changed its mmcif parsings. I sent email to Martin Steinegger requesting a fix to the server mmcif handling.

I made the ChimeraX Foldseek tool strip the leading mmcif comment lines on July 10, 2024 so the jobs succeed until the Foldseek maintainers (Milot?) fix it.

Last edited 16 months ago by Tom Goddard (previous) (diff)

comment:3 by Tom Goddard, 16 months ago

Ideas for next steps.

1) Show multiple possible structure alignments somehow. The pruned structure alignment with a cutoff distance surprisingly often doesn't align the most residues. I tried doing a pruned alignment with the residues not used by the first pruned alignment and in the top 4 foldseek hits for PDB 8fse the second alignments all gave more aligned residues! 8fse is a multi-domain protein and hits have different domain packing so this is not too surprising. But it indicates that opening a hit with any one alignment is missing a lot of the picture.

2) Local distance difference test, LDDT. One way to score the variable conformations of multi-domain proteins is to compute LDDT. The Foldseek web page alignment view reports the average LDDT per residue. We could easily compute this using just the C-alpha positions in the results for every residue of every hit and use it to color the sequence plot view and opened structures.

3) Report % close at 2A, 4A, 8A. Currently the results table reports the percent of close residues at 2A distance cutoff. Might be useful to also report at 4A and 8A cutoff because sometimes the helix packing of hits gradually changes across a domain and this would reveal that.

4) Show sequence conservation. Would be nice to assign an attribute or directly color query sequence by conservation. Current sequence plot coloring can show high identity (> 30%) positions as red, and matching positions as green. Should also be able to show conserved positions that differ from query say in blue.

5) Filter by highly conserved residues. Might be useful to filter to sequences that have at least some fraction of the most highly conserved residues.

6) Show all bound ligands from hits. Could load each hit automatically, find all ligands that have nearby aligned residues, align those residues to query and if the RMSD is not too big, then map that ligand to the query. Form a new model containing all the mapped ligands using chain ids that include the PDB id. Could look at all non-polymer residues including waters and ions, or could exclude those. Probaby require at least 50% of residues near ligand in hit have alignment to query, and require RMSD of those aligned to query to be better than 4 Angstroms. All thresholds can be user specified.

7) Aggregate Uniprot per-residue annotations from all hits into a single list using query residue numbers. Might initially focus on just highly conserved positions. Probably will need away to summarize so the same annotation is not listed 100 times for a residue. Will need a way to refer back to which PDBs and Uniprot entries the annotations came from.

8) Color query domains by aligned hit regions. Might be useful to see which parts of the query have good structure matches and color different parts with different colors as a way to see the subunits of the query that are packed differently in the hits. Not sure exactly how to do this.

9) Animate between different query conformations observed among the hits. This could show the possible flexibility of the query to rearrange. Not sure exactly how to do this.

comment:4 by Tom Goddard, 16 months ago

Results for 8kcl include 8bdw_H-2. The H-2 is an actual chain id of the rcsb bioassembly (open 8bdw from rcsb_bio). But my code is fetching the standard PDB entry that has only chain id H. I think it doesn't make sense to use the bioassembly structures for monomer searches. So maybe a work-around is to strip from chain ids any '-' and following characters.

Stripped the "-" part.

Last edited 16 months ago by Tom Goddard (previous) (diff)

comment:5 by Tom Goddard, 16 months ago

For PDB 8kcl searching PDB, 30% of the hits covering the second domain have near 0 LDDT scores across the whole domain. Usually the first domain has some regions of decent LDDT. Why is foldseek including the second domain when it has no apparent sequence or structure similarity to the 8kcl second domain?

If this is common I may need to allow trimming the result alignments to exclude these bad regions. I can automatically determine the domain boundaries in the sequence by clustering the alignment end positions and then reject domains of sequences that have horrible LDDT and low sequence identity.

comment:6 by Tom Goddard, 15 months ago

Foldseek omits residues that are not in a list of about 130 3-letter codes. This broke the ChimeraX foldseek interpretation of the sequence alignment and was fixed as described in bug report #15652.

comment:7 by Tom Goddard, 15 months ago

ChimeraX sometimes reports a residue 1-letter code as X, such as 6uyy /ALY:39 even though the mod-res mmcif tables says it is based on lysine K. Foldseek alignments report it as K. This was causing the ChimeraX foldseek code to raise an error due to sequence mismatch. I reduced the error to a warning if ChimeraX is reporting an X as described in #15653. Also reported the ChimeraX mmCIF reading bug as #15654.

comment:8 by Tom Goddard, 12 months ago

Resolution: fixed
Status: assignedclosed

This has been added as the Similar Structures tools.

Note: See TracTickets for help on using tickets.