Opened 14 months ago
Closed 13 months ago
#15825 closed enhancement (fixed)
Add a PDB and AlphaFold database sequence search using RCSB mmseqs2
Reported by: | Tom Goddard | Owned by: | Tom Goddard |
---|---|---|---|
Priority: | moderate | Milestone: | |
Component: | Sequence | Version: | |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Notify when closed: | Platform: | all | |
Project: | ChimeraX |
Description
It might be useful to add a PDB and AlphaFold database sequence search that uses RCSB web services. Currently we offer BLAST sequence search in ChimeraX which is slower especially for large databases. The RCSB explains on their web pages they use the fast and modern mmseqs2 instead of BLAST for searches from their web page.
My main interest in this is that I wanted to allow my new structure comparison tools (backbone traces, sequence plots, ligand aggregation, umap clustering) to work on collections of hundreds of structures from methods other than foldseek. I considered making these capabilities work with BLAST, but it seems that mmseqs2 searches have mostly supplanted BLAST, giving similar sensitivity with much higher speed for handling the large databases (AlphaFold). Since mmseqs2 was developed by the same group that make Foldseek (Steinegger lab), its web service works almost exactly the same and the returned results are very similar. So the current foldseek code and result display can almost work with an mmseqs2 server.
Change History (8)
comment:1 by , 14 months ago
comment:2 by , 14 months ago
There are a fair number of small differences between using mmseqs2 and foldseek. The input is different (fasta versus mmcif). The database names are different. The output is different, in particular, the RCSB mmseqs2 output does not include the title of the PDB hits or species which are included with foldseek output. So a separate query would be needed to get descriptions and species. Another difference is that foldseek omit residues that don't have atomic coordinates in its alignments while mmseqs2 I believe keeps them. So associating loaded structure to the result sequences is a little different.
comment:3 by , 14 months ago
I need to check if the RCSB even officially supports this web service
and whether it is ok if ChimeraX uses it. I will send those questions to RCSB help desk.
comment:4 by , 14 months ago
The RCSB mmseqs2 server is an internal RCSB resource and Rachel suggests I use the official search API. But I do not see how to get the official API to return the mmseqs2 alignment.
From: Rachel Kramer Green
Subject: Re: [JIRA] (HELP-21471) RCSB mmseqs2 sequence search web service in ChimeraX
Date: August 28, 2024 at 10:43:30 AM PDT
To: Tom Goddard <goddard@…>
Dear Tom,
RCSB.org uses MMseqs2 to run sequence similarity searches and you are welcome to make use of this capability. However, please access it through our official Search API (example query and results). Let us know if you need help setting this up.
The seqsearch.east.k8s.rcsb.org URL is one of our development resources and shouldn’t be used directly.
Best wishes,
Rachel
RACHEL KRAMER GREEN, PH.D.
Scientific Support & Customer Service Lead
RCSB Protein Data Bank
comment:5 by , 14 months ago
From: Tom Goddard
Subject: Re: [JIRA] (HELP-21471) RCSB mmseqs2 sequence search web service in ChimeraX
Date: August 28, 2024 at 11:37:44 AM PDT
To: Info <info@…>
Hi Rachel,
Thanks for the quick response. The official search API does not give me the mmseqs2 alignment. The ChimeraX tool I am implementing uses the alignment. Is there a way to get the official RCSB search API to return the mmseqs2 alignment for each hit?
Tom
comment:6 by , 14 months ago
On Aug 29, 2024, at 5:17 AM, Rachel Kramer Green wrote:
Hi Tom,
Details are available if you switch to results_verbosity: 'verbose'. Example. Is this good enough?
Best,
Rachel
From: Tom Goddard
Subject: Re: [JIRA] (HELP-21471) RCSB mmseqs2 sequence search web service in ChimeraX
Date: August 29, 2024 at 12:12:15 PM PDT
To: Info <info@…>
Thanks Rachel, "results_verbosity: 'verbose'" looks like exactly what I need to get the mmseqs2 alignment info. I will try to use that. And I will study your documentation more closely before I ask about the API again. The documents do describe the match_context output using verbose mode but I didn't find it when I searched before because the docs API don't talk specifically about mmseqs2.
Tom
comment:7 by , 14 months ago
This example RCSB search query using "results_verbosity": "verbose" given to their API query editor
produces the mmseqs2 alignment info.
{ "query": { "type": "terminal", "service": "sequence", "parameters": { "value": "MGSSHHHHHHSSGLVPRGSHMTEQEDVLAKELEDVNKWGLHVFRIAELSGNRPLTVIMHTIFQERDLLKTFKIPVDTLITYLMTLEDHYHADVAYHNNIHAADVVQSTHVLLSTPALEAVFTDLEILAAIFASAIHDVDHPGVSNQFLINTNSELALMYNDSSVLENHHLAVGFKLLQEENCDIFQNLTKKQRQSLRKMVIDIVLATDMSKHMNLLADLKTMVETKKVTSSGVLLLDNYSDRIQVLQNMVHCADLSNPTKPLQLYRQWTDRIMEEFFRQGDRERERGMEISPMCDKHNASVEKSQVGFIDYIVHPLWETWADLVHPDAQDILDTLEDNREWYQSTIPQS", "identity_cutoff": 0.4, "sequence_type": "protein", "evalue_cutoff": 0.1 } }, "return_type": "polymer_entity", "request_options": { "paginate": { "start": 0, "rows": 1399 }, "results_verbosity": "verbose", "results_content_type": [ "experimental" ], "sort": [ { "sort_by": "score", "direction": "desc" } ], "scoring_strategy": "combined" } }
Results
{ "query_id": "6c85decd-ec7b-424f-8744-226e44832838", "result_type": "polymer_entity", "total_count": 1399, "result_set": [ { "identifier": "1XOM_1", "score": 1, "services": [ { "service_type": "sequence", "nodes": [ { "node_id": 568, "original_score": 709, "norm_score": 1, "match_context": [ { "sequence_identity": 1, "evalue": 7.82e-231, "bitscore": 709, "alignment_length": 349, "mismatches": 0, "gaps_opened": 0, "query_beg": 1, "query_end": 349, "subject_beg": 1, "subject_end": 349, "query_length": 349, "subject_length": 349, "query_aligned_seq": "MGSSHHHHHHSSGLVPRGSHMTEQEDVLAKELEDVNKWGLHVFRIAELSGNRPLTVIMHTIFQERDLLKTFKIPVDTLITYLMTLEDHYHADVAYHNNIHAADVVQSTHVLLSTPALEAVFTDLEILAAIFASAIHDVDHPGVSNQFLINTNSELALMYNDSSVLENHHLAVGFKLLQEENCDIFQNLTKKQRQSLRKMVIDIVLATDMSKHMNLLADLKTMVETKKVTSSGVLLLDNYSDRIQVLQNMVHCADLSNPTKPLQLYRQWTDRIMEEFFRQGDRERERGMEISPMCDKHNASVEKSQVGFIDYIVHPLWETWADLVHPDAQDILDTLEDNREWYQSTIPQS", "subject_aligned_seq": "MGSSHHHHHHSSGLVPRGSHMTEQEDVLAKELEDVNKWGLHVFRIAELSGNRPLTVIMHTIFQERDLLKTFKIPVDTLITYLMTLEDHYHADVAYHNNIHAADVVQSTHVLLSTPALEAVFTDLEILAAIFASAIHDVDHPGVSNQFLINTNSELALMYNDSSVLENHHLAVGFKLLQEENCDIFQNLTKKQRQSLRKMVIDIVLATDMSKHMNLLADLKTMVETKKVTSSGVLLLDNYSDRIQVLQNMVHCADLSNPTKPLQLYRQWTDRIMEEFFRQGDRERERGMEISPMCDKHNASVEKSQVGFIDYIVHPLWETWADLVHPDAQDILDTLEDNREWYQSTIPQS" } ] } ] } ] }, ...
comment:8 by , 13 months ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Done.
Added as part of the new Similar Structures tool, and the new "sequence search" command.
Here are commands I used to test the RCSB mmseqs2 server
$ curl -X POST -F q=@/Users/goddard/Desktop/rcsb_mmseqs2/test.fasta -F 'mode=accept' -F 'database[]=pdb_protein_sequence' https://seqsearch.east.k8s.rcsb.org/api/ticket
{"id":"40gNfGXI_wmwVKZkz0iG7Y7BI-yduA_kZ3arSA","status":"PENDING"}
$ curl https://seqsearch.east.k8s.rcsb.org/api/ticket/40gNfGXI_wmwVKZkz0iG7Y7BI-yduA_kZ3arSA
{"id":"40gNfGXI_wmwVKZkz0iG7Y7BI-yduA_kZ3arSA","status":"COMPLETE"}
$ curl https://seqsearch.east.k8s.rcsb.org/api/result/download/40gNfGXI_wmwVKZkz0iG7Y7BI-yduA_kZ3arSA -o test.tar.gz
$ tar tf test.tar.gz
alis_pdb_protein_sequence.m8
$ tar xf test.tar.gz
$ head -1 alis_pdb_protein_sequence.m8
8KFO_test 6QS9_1 1.000 607 0 0 1 607 1 607 0.000E+00 1262 607 607 MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCVADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPKLKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKGACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLVTDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEKDAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEATLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKVPQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTKCCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKHKPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCVADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPKLKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKGACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLVTDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEKDAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEATLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKVPQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTKCCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKHKPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA
# Output does not have description of hit or species. I didn't see in the mmseqs2 code any REST API parameter to control the output format. Could probably do another RCSB query to get the PDB entry titles and polymer entity species.
# The tab separated values seem to be the standard 12 from BLAST ouptut format 6
plus 4 more