Opened 12 months ago

Last modified 12 months ago

#16207 assigned defect

redundant foldseek results?

Reported by: Elaine Meng Owned by: Tom Goddard
Priority: low Milestone:
Component: Sequence Version:
Keywords: Cc:
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

I don't know if this is a bug or some issue with the pdb100 database, but when I open 1hxz and do

foldseek /D save keep database pdb

I get 14 results and the lines 1,2 are identical, lines 3,4 are identical.
see screenshot attached

probably not important but could be, if bigger searches have many duplicates

Attachments (2)

Screenshot 2024-10-28 at 3.42.47 PM.png (576.9 KB ) - added by Elaine Meng 12 months ago.
weird.cxs (108.5 KB ) - added by Elaine Meng 12 months ago.
session

Download all attachments as: .zip

Change History (5)

by Elaine Meng, 12 months ago

Attachment: weird.cxs added

session

comment:1 by Tom Goddard, 12 months ago

Foldseek is a bit wacky. The output as visible in the .sms file ChimeraX saves in ~/Downloads/ChimeraX/Foldseek/1hxz_foldseek_pdb.sms indicates where the problem likes. The Foldseek pdb100 database is using PDB biassembly files so the two 1hxz_D entries are listed in the Foldseek output as 1hxz-assembly1.cif.gz_D and 1hxz-assembly1.cif.gz_D-2. If you create the assembly from the 1hxz mmcif it is a dimer with two copies of chain D. Foldseek should filter this out since they are identical but apparently does not.

I can filter these out. My guess is that they are not filtered out because Foldseek can do multimeric searches and in that case duplicate chains may have different packing and so all copies may be needed to find a matching complex.

Last edited 12 months ago by Tom Goddard (previous) (diff)

comment:2 by Elaine Meng, 12 months ago

I never noticed it before in much longer output. I put it at low priority since I figured it was some database issue rather than our bug, so you could ignore it if you don't think it usually causes problems.

Last edited 12 months ago by Tom Goddard (previous) (diff)

comment:3 by Tom Goddard, 12 months ago

I looked for duplicate hits in Foldseek results of 15 different PDB structures and none of them had duplicates like 1hxz, although most had a few hundred to a thousand hits, many more than 1hxz. So I'm not sure why Foldseek is giving duplicate results for 1hxz. I'm a bit wary of filtering them out because it may be that Foldseek really can produce duplicates where different sequence alignments were found.

I created a bug report at the Foldseek Github issues so hopefully they can address the problem.

https://github.com/steineggerlab/foldseek/issues/376

Note: See TracTickets for help on using tickets.