Opened 12 months ago
Last modified 12 months ago
#16207 assigned defect
redundant foldseek results?
Reported by: | Elaine Meng | Owned by: | Tom Goddard |
---|---|---|---|
Priority: | low | Milestone: | |
Component: | Sequence | Version: | |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Notify when closed: | Platform: | all | |
Project: | ChimeraX |
Description
I don't know if this is a bug or some issue with the pdb100 database, but when I open 1hxz and do
foldseek /D save keep database pdb
I get 14 results and the lines 1,2 are identical, lines 3,4 are identical.
see screenshot attached
probably not important but could be, if bigger searches have many duplicates
Attachments (2)
Change History (5)
by , 12 months ago
Attachment: | Screenshot 2024-10-28 at 3.42.47 PM.png added |
---|
by , 12 months ago
comment:1 by , 12 months ago
Foldseek is a bit wacky. The output as visible in the .sms file ChimeraX saves in ~/Downloads/ChimeraX/Foldseek/1hxz_foldseek_pdb.sms indicates where the problem likes. The Foldseek pdb100 database is using PDB biassembly files so the two 1hxz_D entries are listed in the Foldseek output as 1hxz-assembly1.cif.gz_D and 1hxz-assembly1.cif.gz_D-2. If you create the assembly from the 1hxz mmcif it is a dimer with two copies of chain D. Foldseek should filter this out since they are identical but apparently does not.
I can filter these out. My guess is that they are not filtered out because Foldseek can do multimeric searches and in that case duplicate chains may have different packing and so all copies may be needed to find a matching complex.
comment:2 by , 12 months ago
I never noticed it before in much longer output. I put it at low priority since I figured it was some database issue rather than our bug, so you could ignore it if you don't think it usually causes problems.
comment:3 by , 12 months ago
I looked for duplicate hits in Foldseek results of 15 different PDB structures and none of them had duplicates like 1hxz, although most had a few hundred to a thousand hits, many more than 1hxz. So I'm not sure why Foldseek is giving duplicate results for 1hxz. I'm a bit wary of filtering them out because it may be that Foldseek really can produce duplicates where different sequence alignments were found.
I created a bug report at the Foldseek Github issues so hopefully they can address the problem.
session