#15825 closed enhancement (fixed)

Add a PDB and AlphaFold database sequence search using RCSB mmseqs2

Reported by: Tom Goddard Owned by: Tom Goddard
Priority: moderate Milestone:
Component: Sequence Version:
Keywords: Cc:
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

It might be useful to add a PDB and AlphaFold database sequence search that uses RCSB web services. Currently we offer BLAST sequence search in ChimeraX which is slower especially for large databases. The RCSB explains on their web pages they use the fast and modern mmseqs2 instead of BLAST for searches from their web page.

My main interest in this is that I wanted to allow my new structure comparison tools (backbone traces, sequence plots, ligand aggregation, umap clustering) to work on collections of hundreds of structures from methods other than foldseek. I considered making these capabilities work with BLAST, but it seems that mmseqs2 searches have mostly supplanted BLAST, giving similar sensitivity with much higher speed for handling the large databases (AlphaFold). Since mmseqs2 was developed by the same group that make Foldseek (Steinegger lab), its web service works almost exactly the same and the returned results are very similar. So the current foldseek code and result display can almost work with an mmseqs2 server.

Change History (8)

comment:1 by Tom Goddard, 14 months ago

Here are commands I used to test the RCSB mmseqs2 server

https://seqsearch.east.k8s.rcsb.org/search
https://search.mmseqs.com/docs/

$ curl -X POST -F q=@/Users/goddard/Desktop/rcsb_mmseqs2/test.fasta -F 'mode=accept' -F 'database[]=pdb_protein_sequence' https://seqsearch.east.k8s.rcsb.org/api/ticket
{"id":"40gNfGXI_wmwVKZkz0iG7Y7BI-yduA_kZ3arSA","status":"PENDING"}

$ curl https://seqsearch.east.k8s.rcsb.org/api/ticket/40gNfGXI_wmwVKZkz0iG7Y7BI-yduA_kZ3arSA
{"id":"40gNfGXI_wmwVKZkz0iG7Y7BI-yduA_kZ3arSA","status":"COMPLETE"}

$ curl https://seqsearch.east.k8s.rcsb.org/api/result/download/40gNfGXI_wmwVKZkz0iG7Y7BI-yduA_kZ3arSA -o test.tar.gz

$ tar tf test.tar.gz
alis_pdb_protein_sequence.m8

$ tar xf test.tar.gz

$ head -1 alis_pdb_protein_sequence.m8
8KFO_test 6QS9_1 1.000 607 0 0 1 607 1 607 0.000E+00 1262 607 607 MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCVADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPKLKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKGACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLVTDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEKDAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEATLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKVPQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTKCCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKHKPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCVADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPKLKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKGACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLVTDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEKDAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEATLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKVPQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTKCCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKHKPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA

# Output does not have description of hit or species. I didn't see in the mmseqs2 code any REST API parameter to control the output format. Could probably do another RCSB query to get the PDB entry titles and polymer entity species.

# The tab separated values seem to be the standard 12 from BLAST ouptut format 6

https://www.metagenomics.wiki/tools/blast/blastn-output-format-6

  1. qseqid query or source (gene) sequence id
  2. sseqid subject or target (reference genome) sequence id
  3. pident percentage of identical positions
  4. length alignment length (sequence overlap)
  5. mismatch number of mismatches
  6. gapopen number of gap openings
  7. qstart start of alignment in query
  8. qend end of alignment in query
  9. sstart start of alignment in subject
  1. send end of alignment in subject
  2. evalue expect value
  3. bitscore bit score

plus 4 more

  1. qlen query sequence length
  2. slen subject sequence length
  3. qseq aligned part of query sequence with gaps "-"
  4. sseq aligned part of subject sequence with gaps "-"

comment:2 by Tom Goddard, 14 months ago

There are a fair number of small differences between using mmseqs2 and foldseek. The input is different (fasta versus mmcif). The database names are different. The output is different, in particular, the RCSB mmseqs2 output does not include the title of the PDB hits or species which are included with foldseek output. So a separate query would be needed to get descriptions and species. Another difference is that foldseek omit residues that don't have atomic coordinates in its alignments while mmseqs2 I believe keeps them. So associating loaded structure to the result sequences is a little different.

comment:3 by Tom Goddard, 14 months ago

I need to check if the RCSB even officially supports this web service

https://seqsearch.east.k8s.rcsb.org/search

and whether it is ok if ChimeraX uses it. I will send those questions to RCSB help desk.

comment:4 by Tom Goddard, 14 months ago

The RCSB mmseqs2 server is an internal RCSB resource and Rachel suggests I use the official search API. But I do not see how to get the official API to return the mmseqs2 alignment.

From: Rachel Kramer Green
Subject: Re: [JIRA] (HELP-21471) RCSB mmseqs2 sequence search web service in ChimeraX
Date: August 28, 2024 at 10:43:30 AM PDT
To: Tom Goddard <goddard@…>

Dear Tom,

RCSB.org uses MMseqs2 to run sequence similarity searches and you are welcome to make use of this capability. However, please access it through our official Search API (example query and results). Let us know if you need help setting this up.
The seqsearch.east.k8s.rcsb.org URL is one of our development resources and shouldn’t be used directly.
Best wishes,
Rachel


RACHEL KRAMER GREEN, PH.D.
Scientific Support & Customer Service Lead
RCSB Protein Data Bank

comment:5 by Tom Goddard, 14 months ago

From: Tom Goddard
Subject: Re: [JIRA] (HELP-21471) RCSB mmseqs2 sequence search web service in ChimeraX
Date: August 28, 2024 at 11:37:44 AM PDT
To: Info <info@…>

Hi Rachel,

Thanks for the quick response. The official search API does not give me the mmseqs2 alignment. The ChimeraX tool I am implementing uses the alignment. Is there a way to get the official RCSB search API to return the mmseqs2 alignment for each hit?

Tom

comment:6 by Tom Goddard, 14 months ago

On Aug 29, 2024, at 5:17 AM, Rachel Kramer Green wrote:

Hi Tom,

Details are available if you switch to results_verbosity: 'verbose'. Example. Is this good enough?

Best,

Rachel

From: Tom Goddard
Subject: Re: [JIRA] (HELP-21471) RCSB mmseqs2 sequence search web service in ChimeraX
Date: August 29, 2024 at 12:12:15 PM PDT
To: Info <info@…>

Thanks Rachel, "results_verbosity: 'verbose'" looks like exactly what I need to get the mmseqs2 alignment info. I will try to use that. And I will study your documentation more closely before I ask about the API again. The documents do describe the match_context output using verbose mode but I didn't find it when I searched before because the docs API don't talk specifically about mmseqs2.

Tom

comment:7 by Tom Goddard, 14 months ago

This example RCSB search query using "results_verbosity": "verbose" given to their API query editor

https://search.rcsb.org/query-editor.html

produces the mmseqs2 alignment info.

{
  "query": {
    "type": "terminal",
    "service": "sequence",
    "parameters": {
      "value": "MGSSHHHHHHSSGLVPRGSHMTEQEDVLAKELEDVNKWGLHVFRIAELSGNRPLTVIMHTIFQERDLLKTFKIPVDTLITYLMTLEDHYHADVAYHNNIHAADVVQSTHVLLSTPALEAVFTDLEILAAIFASAIHDVDHPGVSNQFLINTNSELALMYNDSSVLENHHLAVGFKLLQEENCDIFQNLTKKQRQSLRKMVIDIVLATDMSKHMNLLADLKTMVETKKVTSSGVLLLDNYSDRIQVLQNMVHCADLSNPTKPLQLYRQWTDRIMEEFFRQGDRERERGMEISPMCDKHNASVEKSQVGFIDYIVHPLWETWADLVHPDAQDILDTLEDNREWYQSTIPQS",
      "identity_cutoff": 0.4,
      "sequence_type": "protein",
      "evalue_cutoff": 0.1
    }
  },
  "return_type": "polymer_entity",
  "request_options": {
    "paginate": {
      "start": 0,
      "rows": 1399
    },
    "results_verbosity": "verbose",
    "results_content_type": [
      "experimental"
    ],
    "sort": [
      {
        "sort_by": "score",
        "direction": "desc"
      }
    ],
    "scoring_strategy": "combined"
  }
}

Results

{
	"query_id": "6c85decd-ec7b-424f-8744-226e44832838",
	"result_type": "polymer_entity",
	"total_count": 1399,
	"result_set": [
		{
			"identifier": "1XOM_1",
			"score": 1,
			"services": [
				{
					"service_type": "sequence",
					"nodes": [
						{
							"node_id": 568,
							"original_score": 709,
							"norm_score": 1,
							"match_context": [
								{
									"sequence_identity": 1,
									"evalue": 7.82e-231,
									"bitscore": 709,
									"alignment_length": 349,
									"mismatches": 0,
									"gaps_opened": 0,
									"query_beg": 1,
									"query_end": 349,
									"subject_beg": 1,
									"subject_end": 349,
									"query_length": 349,
									"subject_length": 349,
									"query_aligned_seq": "MGSSHHHHHHSSGLVPRGSHMTEQEDVLAKELEDVNKWGLHVFRIAELSGNRPLTVIMHTIFQERDLLKTFKIPVDTLITYLMTLEDHYHADVAYHNNIHAADVVQSTHVLLSTPALEAVFTDLEILAAIFASAIHDVDHPGVSNQFLINTNSELALMYNDSSVLENHHLAVGFKLLQEENCDIFQNLTKKQRQSLRKMVIDIVLATDMSKHMNLLADLKTMVETKKVTSSGVLLLDNYSDRIQVLQNMVHCADLSNPTKPLQLYRQWTDRIMEEFFRQGDRERERGMEISPMCDKHNASVEKSQVGFIDYIVHPLWETWADLVHPDAQDILDTLEDNREWYQSTIPQS",
									"subject_aligned_seq": "MGSSHHHHHHSSGLVPRGSHMTEQEDVLAKELEDVNKWGLHVFRIAELSGNRPLTVIMHTIFQERDLLKTFKIPVDTLITYLMTLEDHYHADVAYHNNIHAADVVQSTHVLLSTPALEAVFTDLEILAAIFASAIHDVDHPGVSNQFLINTNSELALMYNDSSVLENHHLAVGFKLLQEENCDIFQNLTKKQRQSLRKMVIDIVLATDMSKHMNLLADLKTMVETKKVTSSGVLLLDNYSDRIQVLQNMVHCADLSNPTKPLQLYRQWTDRIMEEFFRQGDRERERGMEISPMCDKHNASVEKSQVGFIDYIVHPLWETWADLVHPDAQDILDTLEDNREWYQSTIPQS"
								}
							]
						}
					]
				}
			]
		},
...

comment:8 by Tom Goddard, 13 months ago

Resolution: fixed
Status: assignedclosed

Done.

Added as part of the new Similar Structures tool, and the new "sequence search" command.

Note: See TracTickets for help on using tickets.