Opened 5 years ago

Last modified 5 years ago

#3380 assigned enhancement

Read compressed cif files

Reported by: lpravda@… Owned by: Eric Pettersen
Priority: moderate Milestone:
Component: Input/Output Version:
Keywords: Cc: Greg Couch, Tom Goddard
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

I was wondering if it would be possible to add option for chimerax to read gzipped mmcif files. It would make our life a bit easier.

Change History (16)

comment:1 by Greg Couch, 5 years ago

Cc: Eric Pettersen added
Status: assignedaccepted

This was always intended to work. And may have worked at one time, but it has been broken for a while. It will slow down reading those files, since they will need to be decompressed first. If it is an option, you should have better performance if you use a compressing filesystem, like ZFS, instead of compressing the files yourself.

comment:2 by Eric Pettersen, 5 years ago

It's because mmcif.mmcif.open_mmcif() expects a path name argument and not a stream, and therefore mmcif's bundle_info.xml declares "want_path=true", which precludes it being passed a stream from gzip.open(). The pdb module handles both paths and streams, but does some pretty ugly things in the C++ layer to handle the stream.

comment:3 by Eric Pettersen, 5 years ago

The simple dumb solution would be for mmcif.mmcif.open_mmcif() to read the entire stream, write a temporary file with the contents, and then call the C++ layer with the path to the temporary file.

comment:4 by Greg Couch, 5 years ago

It would be nice to do this generically for every reader that declares "want_path=true". There was code in the previous core.io package that did that in open_data().

comment:5 by Eric Pettersen, 5 years ago

Cc: Tom Goddard added

Greg, I think you're mistaken. I tried both a ChimeraX from 1/3/18 and the 0.92 release and neither one could read a .cif.gz file -- the former with a traceback in the mmcif module, and the latter with the message "Cannot read compressed mmCIF files".

When I was implementing the new open/save scheme I thought about automatically generating a temporary file for readers that didn't support streams, but felt that it could lead to mysterious errors where the temp location didn't have enough space -- presumably these files are compressed because disk space is at a premium (though in some cases it could be for network I/O reasons). Also, the performance penalty incurred for writing and reading the (presumably huge) temp file would probably be blamed on ChimeraX, with the user never being aware that using an uncompressed file would be significantly faster -- though I suppose a message could be logged when such temp-file shenanigans occurs.

So I *could* put in some automatic-temp-file code, but I would want some kind of consensus beforehand. I could see it being problematic with the way maps aren't directly saved into sessions.

comment:6 by Greg Couch, 5 years ago

I didn't say it worked in 0.9*. But the intent was there. I think that code was bypassed much earlier. If anyone cares, I can figure out if it ever worked.

comment:7 by Eric Pettersen, 5 years ago

I can see the temp-file code in the 0.92 io.py. It was short-circuited by code above it added by Tom (that shows the "Cannot read compressed file" message). My job in porting to the new open/save scheme was to preserve the current behavior, which in this case meant no auto-temp-file stuff. Since Tom was the one that short-circuited that code, presumably he had a reason, so I would want his approval before reimplementing it.

comment:8 by Eric Pettersen, 5 years ago

For reference, the commits that added those lines are 6e0836c5cd and e8a2d9d5de

comment:9 by Tom Goddard, 5 years ago

Here are the changes Eric mentions.

https://www.rbvi.ucsf.edu/trac/ChimeraX/changeset/6e0836c5cddd7fc1d3fad4d2f3430792ec8735c3/

https://www.rbvi.ucsf.edu/trac/ChimeraX/changeset/e8a2d9d5de521e6af1af652c18a606a2a7ce3476/

Those changes are that if an open format requires a path and the file is compressed so that the io.py code could only provide a stream (no provision for temp files), then give an error. In the first commit I added a new open format attribute "requires_path". But then in the second commit I decided I could just look at whether the open function has a "path" argument by introspection, and if so conclude it requires a path. The open_mmcif() routine has a path argument, thus it got labeled as "requires path". There is no doc string in open_mmcif to indicate what type path can be and it immediately hands it to a C++ routine. Maybe it need not be path, but can be a stream. I'd suggest documenting what type the path argument is and also not allowing it to be a string or a stream -- that is just confusing. Use separate stream and path arguments for API clarity.

comment:10 by Tom Goddard, 5 years ago

One further comment. In almost every case where I've tried reading compressed files it has been a mistake that I later changed because performance is horrible compared to reading uncompressed. This applies to ChimeraX sessions and density maps. Specifically gzip compression is dog slow. I have used blosc compression for segmentations that compress 50-fold which is the one case I've seen where compression makes sense.

Of course we want compression when transmitting files over the network, e.g. it might increase performance when fettching a large mmcif. But the file should be decompressed one time immediately after fetch. This is what is done for EMDB maps.

comment:11 by Eric Pettersen, 5 years ago

Yeah, those were the good old days. Ignoring the performance issues (for now), you should know that core/io.py doesn't exist anymore -- we're only looking at it for historical purposes to determine what the code used to do. In particular, that whole path/stream introspection thing is gone -- handled by the "want_path" attribute in bundle_info.xml . The mmCIF code does need a path and not a stream, and there was code below the error-raising code to handle auto-decompression, namely:

provide_stream = list(params.keys())[1] == "stream"
if provide_stream:

# doesn't matter

elif compression:

# need to use temporary file
import tempfile
filename, dname, stream = _compressed_open(filename, compression, 'rb')
with tempfile.NamedTemporaryFile(delete=False) as tf:

tf.write(stream.read())
tf.close()

filename = tf.name
delete_when_done = True
args = (session, filename)

else:

If I comment out the error-raising lines, then I can open a .cif.gz file successfully in 0.92.

I guess that brings us to the performance issues. If I implemented this, I would definitely log something each time saying that using compressed files directly is slow and recommending decompression beforehand. I could also implement a bundle_info.xml keyword to block auto-decompression for particular formats (e.g. map formats due to the way they handle session saving). Would this be "good enough" do you think?

in reply to:  12 ; comment:12 by goddard@…, 5 years ago

That sounds ok.  I wonder if it might be better to help the user out and uncompress the file for them in the same directory with the same file name (if such a file does not exist, otherwise add a suffix like "-1"), then open that decompressed version.  So if they opened myfile.cif.gz they would then find myfile.cif in the same directory and when the use the recent file history it would open that uncompressed file.  Maybe there could be a preference setting to instead decompess to a temp directory if the user prefers that and you want to go whole hog.

I think it is worth giving a bit of thought to whether we want to spend the time for dealing with decompression at all.  Would anyone actually use this?  I never have.  We shouldn't give time to features almost no one will use.  I know this was requested by EBI and they are an important user, but they also have an unusual use case, and perhaps even their case would benefit from them running gunzip *.cif.gz so that all their ChimeraX uses run faster.

comment:13 by Eric Pettersen, 5 years ago

Cc: Greg Couch added; Eric Pettersen removed
Owner: changed from Greg Couch to Eric Pettersen
Status: acceptedassigned

I guess I will go "whole hog".

Lukas! Hey Lukas! Would you prefer the default behavior to be:

1) Decompress into the same directory as the compressed file, and leave the decompressed file for possible later use (including file history)
2) Decompress into a temporary file (and temporary directory) and remove afterwards

--Eric

in reply to:  14 ; comment:14 by lpravda@…, 5 years ago

Hi guys,

Thank you for looking into this so quickly __. I wonder what is the performance hit when reading compressed files you are talking about? Is it 10%, 50%? Also this performance hit would apply only to gzipped structures, am I correct? So if one decides not to decompress structures beforehand, it has no impact on the use of decompressed structures by other poeple. Although I think most of the people would not really care if there is 1s delay in opening a coordinate structure. Also the actual time you spend by manually decompressing the file before feeding it to ChimeraX may be much higher than the possible performance hit __ in my view. Besides, both wwPDB (https://ftp.wwpdb.org/pub/pdb/data/structures/divided/mmCIF/00/) and us PDBe (http://ftp.ebi.ac.uk/pub/databases/pdb/data/structures/divided/mmCIF/00/) distribute the coordinate data in their compressed form through FTP. It is just an augmentation of a web server to provide data in uncompressed form. For example assembly files are available in *cif.gz form only. So having the possibility to natively read gzipped structures makes total sense from my point of view. However, silently decompressing files into arbitrary directories can be tricky for all sorts of reason Eric pointed out (in fact our area is RO, so option 1 would not work at all). At the end of the day, this is a design decision you need to make, but rather than ChimeraX decompressing structures into arcane locations, I'd prefer to decompress the structure myself and leave it be.

Kind regards,
Lukas


On 11/06/2020, 03:05, "ChimeraX" <ChimeraX-bugs-admin@cgl.ucsf.edu> wrote:

    #3380: Read compressed cif files
    -----------------------------------+----------------------------
              Reporter:  lpravda@…     |      Owner:  Eric Pettersen
                  Type:  enhancement   |     Status:  assigned
              Priority:  moderate      |  Milestone:
             Component:  Input/Output  |    Version:
            Resolution:                |   Keywords:
            Blocked By:                |   Blocking:
    Notify when closed:                |   Platform:  all
               Project:  ChimeraX      |
    -----------------------------------+----------------------------
    Changes (by Eric Pettersen):

     * cc: Eric Pettersen (removed)
     * cc: Greg Couch (added)
     * owner:  Greg Couch => Eric Pettersen
     * status:  accepted => assigned


    Comment:

     I guess I will go "whole hog".

     Lukas!  Hey Lukas!  Would you prefer the default behavior to be:

     1) Decompress into the same directory as the compressed file, and leave
     the decompressed file for possible later use (including file history)
     2) Decompress into a temporary file (and temporary directory) and remove
     afterwards

     --Eric

    --
    Ticket URL: <https://plato.cgl.ucsf.edu/trac/ChimeraX/ticket/3380#comment:13>
    ChimeraX <http://www.rbvi.ucsf.edu/chimerax/>
    ChimeraX Issue Tracker


comment:15 by Eric Pettersen, 5 years ago

Well, I'll probably still do this, but make it lower priority than I was going to. In my testing uncompressing and writing out an uncompressed file, using Python, takes less than a second for almost all .cif files. For something as large as 3j3q it takes less than 2 seconds. Using just Unix command-line tools, it's a factor of 2 faster than that. It's _compressing_ files that is really slow.

in reply to:  16 ; comment:16 by goddard@…, 5 years ago

Although compression methods try to make decompressing fast it is dismally slow for large (many Gbytes) data, typically 200 Mbytes/sec for gunzip.  That was not too big a problem in the era of spinning hard drives that read at 100 Mbytes/sec, but now SSD drives are common with read speeds of 3 Gbytes/second.  You are right that compression speed is much slower still.  mmCIF files are not big enough for the speed to matter, particularly since parsing those files is slower than decompressing.

Although not relevant to mmCIF small files, high decompression speeds (5 Gbytes/eec) are available from modern decompression methods like BLOSC and is built into HDF5.

Note: See TracTickets for help on using tickets.