Opened 3 years ago

Last modified 15 months ago

#7387 reopened enhancement

BLAST of Uniref50, UniRef90, UniRef100 appears to be using old databases from 2012

Reported by: Tom Goddard Owned by: Zach Pearson
Priority: high Milestone:
Component: Sequence Version:
Keywords: Cc: Elaine Meng, Eric Pettersen, Greg Couch, Scooter
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description (last modified by Elaine Meng)

Not sure if these are the uniref databases being used by blast protein. They are from 2012. The UniRef100 file is 8 Gbytes, while the current UniRef100 is 83 Gbytes. So these old databases only have 1/10 of the sequences. They should be updated.

Change History (26)

comment:1 by Tom Goddard, 3 years ago

Forgot to include the path to the 2012 databases on plato

/databases/mol/blast/db_uniref

comment:2 by Eric Pettersen, 3 years ago

Cc: Greg Couch Scooter Morris added

The line in periodic.conf that would update the UniRef databases seems to be commented out:

# 0 6 28 * * su sacsdb -c "/usr/local/etc/periodic/scripts/get_uniref_blast.py |& /usr/bin/Mail -s 'UniRef update' scooter@…"

comment:3 by Elaine Meng, 3 years ago

same issue as #7458?

comment:4 by Zach Pearson, 3 years ago

Cc: Zach Pearson added; Scooter Morris removed
Owner: changed from Zach Pearson to Scooter Morris

comment:5 by Zach Pearson, 3 years ago

My user account is not allowed to write files in that directory. Additionally, that script doesn't exist(!)

Reassigning to Scooter.

in reply to:  6 ; comment:6 by Tom Goddard, 3 years ago

The uniref databases can be downloaded by ftp here, uniref100, uniref90 and uniref50 are each a single large gziped fasta file.  Then a simple makeblastdb command makes the database files.  The database directory is owned by sacsdb with group sacs and is not writable by group sacs.

$ ls -ld /databases/mol/blast/db_uniref
drwxr-xr-x. 2 sacsdb sacs 42 Jul 28  2012 /databases/mol/blast/db_uniref

comment:7 by Elaine Meng, 3 years ago

Priority: moderatehigh

comment:8 by Elaine Meng, 3 years ago

Milestone: 1.5

comment:9 by Greg Couch, 3 years ago

Newer data, from 16 June 2021, is in /wynton/group/databases/UniProt/uniref/uniref{100,50,90}.

comment:10 by Greg Couch, 3 years ago

Milestone: 1.51.6

Give Scooter more time.

comment:11 by Tom Goddard, 3 years ago

Cc: Zach Pearson removed
Owner: changed from Scooter Morris to Zach Pearson

comment:12 by Zach Pearson, 3 years ago

I've re-created the missing get_uniref_blast script and am testing it now.

comment:13 by Zach Pearson, 3 years ago

Milestone: 1.6

comment:14 by Elaine Meng, 18 months ago

Description: modified (diff)

Still hoping for updates. :-)

comment:15 by Elaine Meng, 16 months ago

Did these databases ever get updated??  Easy for me to say, I guess, but using 2012 versions seems like a pretty bad thing that should be relatively easy to ameliorate.
Elaine

comment:16 by Eric Pettersen, 15 months ago

Resolution: fixed
Status: assignedclosed

As a side effect of all the databases on plato getting nuked, they all got updated and the scripts that update them were fixed.

comment:17 by Tom Goddard, 15 months ago

BLAST is not working for uniref, nr, or esmfold and those databases don't appear to be in blast/db_current on plato.

comment:18 by Elaine Meng, 15 months ago

Resolution: fixed
Status: closedreopened

yes as Tom G said, I also got failures for uniref(N) and esmfold today. The others may be OK now, I could search pdb, and alphafold, and nr also is running now that Scooter has gotten it into place, although I may not wait around for that job to complete. Tested in UCSF ChimeraX version: 1.9.dev202408060516 (2024-08-06)

comment:19 by Zach Pearson, 15 months ago

Cc: Scooter added

The error in the logs is:

BLAST Database error: No alias or index file found for protein database [/databases/mol/blast/db/uniref100]

Here's all the files in /databases/mol/blast/db containing the string uniref:

03:39:29 zjp@crick cxservices → ls /databases/mol/blast/db | grep uniref
lrwxrwxrwx.  1 sacsdb sacs   28 Aug  3 17:38 makeblastdb.log -> ../db_uniref/makeblastdb.log
lrwxrwxrwx.  1 sacsdb sacs   28 Aug  3 17:38 uniref100.fasta -> ../db_uniref/uniref100.fasta
lrwxrwxrwx.  1 sacsdb sacs   32 Aug  3 17:38 uniref100.fasta.pdb -> ../db_uniref/uniref100.fasta.pdb
lrwxrwxrwx.  1 sacsdb sacs   37 Aug  3 17:38 uniref100.fasta.pdb-lock -> ../db_uniref/uniref100.fasta.pdb-lock
lrwxrwxrwx.  1 sacsdb sacs   27 Aug  3 17:38 uniref50.fasta -> ../db_uniref/uniref50.fasta
lrwxrwxrwx.  1 sacsdb sacs   31 Aug  3 17:38 uniref50.fasta.pdb -> ../db_uniref/uniref50.fasta.pdb
lrwxrwxrwx.  1 sacsdb sacs   36 Aug  3 17:38 uniref50.fasta.pdb-lock -> ../db_uniref/uniref50.fasta.pdb-lock
lrwxrwxrwx.  1 sacsdb sacs   27 Aug  3 17:38 uniref90.fasta -> ../db_uniref/uniref90.fasta
lrwxrwxrwx.  1 sacsdb sacs   31 Aug  3 17:38 uniref90.fasta.pdb -> ../db_uniref/uniref90.fasta.pdb
lrwxrwxrwx.  1 sacsdb sacs   36 Aug  3 17:38 uniref90.fasta.pdb-lock -> ../db_uniref/uniref90.fasta.pdb-lock

comment:20 by Zach Pearson, 15 months ago

NR runs, it's just the unirefs and esmfold.

comment:21 by Scooter Morris, 15 months ago

Those "pdb" files are the database, supposedly. At least that's what makeblastdb outputs. I'll look into it a bit more.

comment:22 by Zach Pearson, 15 months ago

Maybe I need to append '.fasta' to the names of those databases in webservices like I do for AlphaFold and ESMFold.

Also, it's not really readily apparent why requests for v0 of ESMFold try to look for it in a folder called ESMFold/v1, so I just symlinked v1 to v0.

https://encrypted-tbn0.gstatic.com/images

comment:23 by Scooter Morris, 15 months ago

OK, so the problem seems to be beegfs. I can build the blast database in /tmp, but not directly in /databases. Still looking at it.

comment:24 by Scooter Morris, 15 months ago

OK, uniref should now be working for blast.

comment:25 by Elaine Meng, 15 months ago

Thanks, Scooter!  

I just verified that I can blast esmfold and uniref50 (assuming the other unirefs will be the same).  However, only the uniref50 job returned results even though the task manager reports both are finished.  Here are the IDs in case they are useful in debugging.  Or maybe it's known that esmfold won't really work yet.

blastprotein /A database esmfold  Webservices job id: OPBU9S578S8KRPNX
blastprotein /A database uniref50  Webservices job id: JANU7RB1SUM2ZGUX

As an aside, it might be useful to enhance blastprotein to report something about database version or when it was last updated, but I have no idea of feasibility/difficulty so I'll leave it up to others to ticket or ignore.

Elaine

comment:26 by Elaine Meng, 15 months ago

OK, in today's test I was able to blast esmfold and get results, thanks!
Can this be closed?  Feel free to close it if everything in this ticket is done.
Elaine
Note: See TracTickets for help on using tickets.