Opened 4 years ago
Last modified 3 years ago
#4678 accepted defect
Large alignment performance
Reported by: | Owned by: | pett | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Sequence | Version: | |
Keywords: | Cc: | Elaine Meng | |
Blocked By: | Blocking: | ||
Notify when closed: | Platform: | all | |
Project: | ChimeraX |
Description
The following bug report has been submitted: Platform: macOS-10.15.7-x86_64-i386-64bit ChimeraX Version: 1.3.dev202105180049 (2021-05-18 00:49:36 UTC) Description Working with a largish sequence alignment is excruciatingly slow and takes an unexpected enormous amount of memory. I opened 6vxx, did a BLAST search and displayed a sequence alignment of the ~1000 sequences. This is 1000 sequences of 1300 amino acids, one Mbyte of data. It took about 5 minutes, no progress messages, to show the sequence alignment and 13 Gbytes of memory. Switching conservation headers to identity and then back to ALCO took about 3 minutes each change and now ChimeraX is using 26 Gbytes. Log: UCSF ChimeraX version: 1.3.dev202105180049 (2021-05-18) © 2016-2021 Regents of the University of California. All rights reserved. How to cite UCSF ChimeraX > open 6vxx format mmcif fromDatabase pdb 6vxx title: Structure of the SARS-CoV-2 spike glycoprotein (closed state) [more info...] Chain information for 6vxx #1 --- Chain | Description A B C | spike glycoprotein Non-standard residues in 6vxx #1 --- NAG — N-acetyl-D-glucosamine > ui tool show "Blast Protein" > blastprotein /A database pdb cutoff 1e-3 matrix BLOSUM62 maxSeqs 100 name > bp1 Web Service: BlastProtein2 is a Python wrapper that calls blastp to search nr or pdb for sequences similar to the given protein sequence Opal service URL: http://webservices.rbvi.ucsf.edu/opal2/services/BlastProtein2Service Opal job id: appBlastProtein2Service16215643635791603944578 Opal status URL prefix: http://webservices.rbvi.ucsf.edu/appBlastProtein2Service16215643635791603944578 stdout.txt = standard output stderr.txt = standard error BlastProtein finished. Alignment identifier is bp1 [1] Associated 6vxx chain A to #1/A with 0 mismatches Associated 6vxx chain B to #1/A with 0 mismatches Associated 6vxx chain C to #1/A with 0 mismatches Showing conservation header ("seq_conservation" residue attribute) for alignment bp1 [1] > color byattribute seq_conservation 23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24 > select /A-C:355 33 atoms, 30 bonds, 3 residues, 1 model selected > select /A-C:355 33 atoms, 30 bonds, 3 residues, 1 model selected > color byattribute seq_conservation palette redblue 23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24 > preset cartoons/nucleotides cylinders/stubs Changed 0 atom styles Preset expands to these ChimeraX commands: show nucleic hide protein|solvent|H surf hide style (protein|nucleic|solvent) & @@draw_mode=0 stick cartoon cartoon style modeh def arrows t arrowshelix f arrowscale 2 wid 2 thick 0.4 sides 12 div 20 cartoon style ~(nucleic|strand) x round cartoon style (nucleic|strand) x rect cartoon style protein modeh tube rad 2 sides 24 thick 0.6 cartoon style nucleic x round width 1.6 thick 1.6 nucleotides stubs > color byattribute seq_conservation palette 0,red:1,blue 23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24 > color byattribute seq_conservation palette 0,red:1.0,blue 23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24 > color byattribute seq_conservation palette -2,red:1.0,blue 23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24 > color byattribute seq_conservation palette 0,red:1.0,blue 23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24 > color byattribute seq_conservation palette -1,red:1.0,blue 23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24 > color byattribute seq_conservation palette cyanmaroon 23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24 > style sel sphere Changed 33 atom styles > select clear > style sphere Changed 23694 atom styles > select /C:494 6 atoms, 5 bonds, 1 residue, 1 model selected > sequence header conservation setting style "identity histogram" > color byattribute seq_conservation palette cyanmaroon 23694 atoms, 2979 residues, atom seq_conservation range 0.683 to 0.973 > sequence header conservation setting style AL2CO OpenGL version: 4.1 ATI-3.10.19 OpenGL renderer: AMD Radeon Pro Vega 20 OpenGL Engine OpenGL vendor: ATI Technologies Inc.Hardware: Hardware Overview: Model Name: MacBook Pro Model Identifier: MacBookPro15,3 Processor Name: 8-Core Intel Core i9 Processor Speed: 2.4 GHz Number of Processors: 1 Total Number of Cores: 8 L2 Cache (per Core): 256 KB L3 Cache: 16 MB Hyper-Threading Technology: Enabled Memory: 32 GB Boot ROM Version: 1554.100.64.0.0 (iBridge: 18.16.14556.0.0,0) Software: System Software Overview: System Version: macOS 10.15.7 (19H1030) Kernel Version: Darwin 19.6.0 Time since boot: 49 minutes Graphics/Displays: Intel UHD Graphics 630: Chipset Model: Intel UHD Graphics 630 Type: GPU Bus: Built-In VRAM (Dynamic, Max): 1536 MB Vendor: Intel Device ID: 0x3e9b Revision ID: 0x0002 Automatic Graphics Switching: Supported gMux Version: 5.0.0 Metal: Supported, feature set macOS GPUFamily2 v1 Radeon Pro Vega 20: Chipset Model: Radeon Pro Vega 20 Type: GPU Bus: PCIe PCIe Lane Width: x8 VRAM (Total): 4 GB Vendor: AMD (0x1002) Device ID: 0x69af Revision ID: 0x00c0 ROM Revision: 113-D2060I-087 VBIOS Version: 113-D20601MA0T-016 Option ROM Version: 113-D20601MA0T-016 EFI Driver Version: 01.01.087 Automatic Graphics Switching: Supported gMux Version: 5.0.0 Metal: Supported, feature set macOS GPUFamily2 v1 Displays: Color LCD: Display Type: Built-In Retina LCD Resolution: 2880 x 1800 Retina Framebuffer Depth: 24-Bit Color (ARGB8888) Main Display: Yes Mirror: Off Online: Yes Automatically Adjust Brightness: No Connection Type: Internal Locale: (None, 'UTF-8') PyQt5 5.15.2, Qt 5.15.2 Installed Packages: alabaster: 0.7.12 appdirs: 1.4.4 appnope: 0.1.2 Babel: 2.9.1 backcall: 0.2.0 blockdiag: 2.0.1 certifi: 2020.12.5 cftime: 1.4.1 chardet: 4.0.0 ChimeraX-AddCharge: 1.1.3 ChimeraX-AddH: 2.1.6 ChimeraX-AlignmentAlgorithms: 2.0 ChimeraX-AlignmentHdrs: 3.2 ChimeraX-AlignmentMatrices: 2.0 ChimeraX-Alignments: 2.1 ChimeraX-AmberInfo: 1.0 ChimeraX-Arrays: 1.0 ChimeraX-Atomic: 1.20.1 ChimeraX-AtomicLibrary: 3.2.1 ChimeraX-AtomSearch: 2.0 ChimeraX-AtomSearchLibrary: 1.0 ChimeraX-AxesPlanes: 2.0 ChimeraX-BasicActions: 1.1 ChimeraX-BILD: 1.0 ChimeraX-BlastProtein: 1.1.1 ChimeraX-BondRot: 2.0 ChimeraX-BugReporter: 1.0 ChimeraX-BuildStructure: 2.5.2 ChimeraX-Bumps: 1.0 ChimeraX-BundleBuilder: 1.1 ChimeraX-ButtonPanel: 1.0 ChimeraX-CageBuilder: 1.0 ChimeraX-CellPack: 1.0 ChimeraX-Centroids: 1.1 ChimeraX-ChemGroup: 2.0 ChimeraX-Clashes: 2.1 ChimeraX-ColorActions: 1.0 ChimeraX-ColorGlobe: 1.0 ChimeraX-ColorKey: 1.3 ChimeraX-CommandLine: 1.1.4 ChimeraX-ConnectStructure: 2.0 ChimeraX-Contacts: 1.0 ChimeraX-Core: 1.3.dev202105180049 ChimeraX-CoreFormats: 1.0 ChimeraX-coulombic: 1.3 ChimeraX-Crosslinks: 1.0 ChimeraX-Crystal: 1.0 ChimeraX-CrystalContacts: 1.0 ChimeraX-DataFormats: 1.1 ChimeraX-Dicom: 1.0 ChimeraX-DistMonitor: 1.1.3 ChimeraX-DistUI: 1.0 ChimeraX-Dssp: 2.0 ChimeraX-ExperimentalCommands: 1.0 ChimeraX-FileHistory: 1.0 ChimeraX-FunctionKey: 1.0 ChimeraX-Geometry: 1.1 ChimeraX-gltf: 1.0 ChimeraX-Graphics: 1.1 ChimeraX-Hbonds: 2.1 ChimeraX-Help: 1.1 ChimeraX-HKCage: 1.3 ChimeraX-IHM: 1.1 ChimeraX-ImageFormats: 1.1 ChimeraX-IMOD: 1.0 ChimeraX-IO: 1.0.1 ChimeraX-ItemsInspection: 1.0 ChimeraX-Label: 1.1 ChimeraX-ListInfo: 1.1.1 ChimeraX-Log: 1.1.4 ChimeraX-LookingGlass: 1.1 ChimeraX-Maestro: 1.8.1 ChimeraX-Map: 1.1 ChimeraX-MapData: 2.0 ChimeraX-MapEraser: 1.0 ChimeraX-MapFilter: 2.0 ChimeraX-MapFit: 2.0 ChimeraX-MapSeries: 2.1 ChimeraX-Markers: 1.0 ChimeraX-Mask: 1.0 ChimeraX-MatchMaker: 1.2.1 ChimeraX-MDcrds: 2.2 ChimeraX-MedicalToolbar: 1.0.1 ChimeraX-Meeting: 1.0 ChimeraX-MLP: 1.1 ChimeraX-mmCIF: 2.3 ChimeraX-MMTF: 2.1 ChimeraX-Modeller: 1.0.1 ChimeraX-ModelPanel: 1.1 ChimeraX-ModelSeries: 1.0 ChimeraX-Mol2: 2.0 ChimeraX-Morph: 1.0 ChimeraX-MouseModes: 1.1 ChimeraX-Movie: 1.0 ChimeraX-Neuron: 1.0 ChimeraX-Nucleotides: 2.0.1 ChimeraX-OpenCommand: 1.5 ChimeraX-PDB: 2.4.1 ChimeraX-PDBBio: 1.0 ChimeraX-PDBLibrary: 1.0.1 ChimeraX-PDBMatrices: 1.0 ChimeraX-PickBlobs: 1.0 ChimeraX-Positions: 1.0 ChimeraX-PresetMgr: 1.0.1 ChimeraX-PubChem: 2.0.1 ChimeraX-ReadPbonds: 1.0 ChimeraX-Registration: 1.1 ChimeraX-RemoteControl: 1.0 ChimeraX-ResidueFit: 1.0 ChimeraX-RestServer: 1.1 ChimeraX-RNALayout: 1.0 ChimeraX-RotamerLibMgr: 2.0 ChimeraX-RotamerLibsDunbrack: 2.0 ChimeraX-RotamerLibsDynameomics: 2.0 ChimeraX-RotamerLibsRichardson: 2.0 ChimeraX-SaveCommand: 1.4 ChimeraX-SchemeMgr: 1.0 ChimeraX-SDF: 2.0 ChimeraX-Segger: 1.0 ChimeraX-Segment: 1.0 ChimeraX-SelInspector: 1.0 ChimeraX-SeqView: 2.4 ChimeraX-Shape: 1.0.1 ChimeraX-Shell: 1.0 ChimeraX-Shortcuts: 1.1 ChimeraX-ShowAttr: 1.0 ChimeraX-ShowSequences: 1.0 ChimeraX-SideView: 1.0 ChimeraX-Smiles: 2.0.1 ChimeraX-SmoothLines: 1.0 ChimeraX-SpaceNavigator: 1.0 ChimeraX-StdCommands: 1.6 ChimeraX-STL: 1.0 ChimeraX-Storm: 1.0 ChimeraX-Struts: 1.0 ChimeraX-Surface: 1.0 ChimeraX-SwapAA: 2.0 ChimeraX-SwapRes: 2.1 ChimeraX-TapeMeasure: 1.0 ChimeraX-Test: 1.0 ChimeraX-Toolbar: 1.1 ChimeraX-ToolshedUtils: 1.2 ChimeraX-Tug: 1.0 ChimeraX-UI: 1.9.3 ChimeraX-uniprot: 2.1 ChimeraX-UnitCell: 1.0 ChimeraX-ViewDockX: 1.0 ChimeraX-Vive: 1.1 ChimeraX-VolumeMenu: 1.0 ChimeraX-VTK: 1.0 ChimeraX-WavefrontOBJ: 1.0 ChimeraX-WebCam: 1.0 ChimeraX-WebServices: 1.0 ChimeraX-Zone: 1.0 colorama: 0.4.4 comtypes: 1.1.10 cxservices: 1.0 cycler: 0.10.0 Cython: 0.29.23 decorator: 4.4.2 distlib: 0.3.1 docutils: 0.17.1 filelock: 3.0.12 funcparserlib: 0.3.6 grako: 3.16.5 html2text: 2020.1.16 idna: 2.10 ihm: 0.20 imagecodecs: 2021.4.28 imagesize: 1.2.0 ipykernel: 5.5.5 ipython: 7.23.1 ipython-genutils: 0.2.0 jedi: 0.18.0 Jinja2: 2.11.3 jupyter-client: 6.1.12 jupyter-core: 4.7.1 kiwisolver: 1.3.1 lxml: 4.6.3 lz4: 3.1.3 MarkupSafe: 1.1.1 matplotlib: 3.4.2 matplotlib-inline: 0.1.2 msgpack: 1.0.2 netCDF4: 1.5.6 networkx: 2.5.1 numpy: 1.20.3 numpydoc: 1.1.0 openvr: 1.16.801 packaging: 20.9 ParmEd: 3.2.0 parso: 0.8.2 pexpect: 4.8.0 pickleshare: 0.7.5 Pillow: 8.2.0 pip: 21.1.1 pkginfo: 1.7.0 prompt-toolkit: 3.0.18 psutil: 5.8.0 ptyprocess: 0.7.0 pycollada: 0.7.1 pydicom: 2.1.2 Pygments: 2.9.0 PyOpenGL: 3.1.5 PyOpenGL-accelerate: 3.1.5 pyparsing: 2.4.7 PyQt5: 5.15.2 PyQt5-sip: 12.8.1 PyQtWebEngine: 5.15.2 python-dateutil: 2.8.1 pytz: 2021.1 pyzmq: 22.0.3 qtconsole: 5.1.0 QtPy: 1.9.0 requests: 2.25.1 scipy: 1.6.3 setuptools: 56.2.0 six: 1.16.0 snowballstemmer: 2.1.0 sortedcontainers: 2.3.0 Sphinx: 4.0.1 sphinxcontrib-applehelp: 1.0.2 sphinxcontrib-blockdiag: 2.0.0 sphinxcontrib-devhelp: 1.0.2 sphinxcontrib-htmlhelp: 1.0.3 sphinxcontrib-jsmath: 1.0.1 sphinxcontrib-qthelp: 1.0.3 sphinxcontrib-serializinghtml: 1.1.4 suds-jurko: 0.6 tifffile: 2021.4.8 tinyarray: 1.2.3 tornado: 6.1 traitlets: 5.0.5 urllib3: 1.26.4 wcwidth: 0.2.5 webcolors: 1.11.1 wheel: 0.36.2 wheel-filename: 1.3.0
Change History (9)
comment:1 by , 4 years ago
Cc: | added |
---|---|
Component: | Unassigned → Sequence |
Owner: | set to |
Platform: | → all |
Project: | → ChimeraX |
Status: | new → accepted |
Summary: | ChimeraX bug report submission → Large alignment performance |
follow-up: 2 comment:2 by , 4 years ago
I wanted to see the range of mutations available in the PDB on the available sars-cov-2 spike protein structures. I did not suspect that the minutes of waiting and very high memory use was because of the layout of a million characters. I figured it had to do with computation, like aligning the BLAST sequences, or some inefficiency in computing conservation. But I see from your layout description that could be the trouble. I would not have even been able to try this unless my laptop had 32 Gbytes of memory. I realize getting good performance is hard work. The question is whether ChimeraX sequence tools are intended just to handle a few sequences (e.g. 10). I think in 2021 biology having 1000 sequences aligned is not big. GISAID has over 1 million sars-cov-2 sequences and I plan on trying to look at conservation for some large subsets today and color a structure using that info. Large sequence alignments (hundreds of sequences) are common. It would be nice to handle these but maybe it is too difficult. If it is too difficult then we should have a progress indicator so the users knows it is hopeless when they show a large alignment and they should be able to stop it instead of having to force quit and lose their session. The limitation to just a few sequences will be surprising to a researcher (it was surprising to me who should know better), so not letting the user shoot themselves in the foot is especially important if we want them to be happy with our software. I guess progress messages would be pretty easy, but stopping is something we have not setup infrastructure for. So maybe when I try to show 1000 sequences as a stopgap it should say "This is going to take about 13 Gbytes of memory and about 5 minutes to display. ChimeraX is not intended to handle large sequence alignments." where the numbers are simply estimated from the number of characters that need to be layed out.
comment:3 by , 4 years ago
With the confounding factor of the Sequence.characters memory leak removed, the memory consumption drops to a still-too-large 3.4GB. Changing the Conservation header takes ~6 minutes and raise the memory consumption to 5.73MB. Unfortunately memory management of QGraphicsSceneItems is pretty tricky.
comment:4 by , 4 years ago
So, while starting to work on this I wanted to save an alignment file so I wouldn't have to run the Blast repeatedly, and noticed there seemed to be some kind of memory leak just saving an alignment file! It turns out that Sequences created in the Python layer start out with an extra reference count and therefore never get garbage collected. It took me awhile to find and quash that bug, but unfortunately it has virtually no effect on the sequence-display performance problems described in this ticket, so further investigation will be necessary.
comment:5 by , 3 years ago
Well, measuring the memory use today, it's 2.4GB. So it's 1.0GB less than 17 months ago without me doing anything particular that I can remember. Is it just Qt5→Qt6, or did I forget doing something relevant in that time period? At ant rate, 2.4GB is _still_ too large (about a 2.1GB increase), so investigation is still warranted.
comment:6 by , 3 years ago
Well, 1.5GB of that is just asking Qt to put the letters on the canvas, and the rest is Python overhead tracking those letters, so I don't know that I can make tremendous inroads on memory use without a major rewrite to not use individual letters. Time probably better spent on implementing Profile Grids.
comment:7 by , 3 years ago
Agree, maybe profile grids are a better direction. But I think there is an opportunity to acquire lots of devoted users by making working with alignments of thousands of sequences fast. In AlphaFold structure prediction alignments of 10,000 sequences are the norm and I think there is lots of valuable info in those alignment. AlphaFold is believed to figure out the structure largely from seeing pairwise correlations in residue mutations. Profile grids will throw away that info. The data files for 10,000 sequences are no larger than typical PDB files, so the naively the idea of working with them interactively where most operations run in under a second seems practical. Granted it may be too hard to implement.
I think deep sequence alignments are only going to become more important and I'll be trying to look at them. I think it is an interesting long term direction (years) to try to handle deep sequence alignments at interactive speeds. But you (Eric) would be the one to decide if that is a direction to head in.
comment:8 by , 3 years ago
I'm not sure where you're getting the idea that profile grids will be throwing away information. It still has the complete alignment information, it's just depicting it differently. With a profile grid, you can easily filter down to a subalignment with a mutation at. particular position to see what the profile at other positions looks like (depicted either as a profile grid or a "classic" alignment). I'd say it's easier to ferret out mutation correlations with a profile grid than a regular alignment.
follow-up: 9 comment:9 by , 3 years ago
Good point. Of course there is no need to throw away the fine grain information.
I guess the first question is what were you trying to get done by opening this large alignment? There might be better or more direct means we could implement to get what you needed done than via the alignment.
When I ported the viewer to ChimeraX I wanted the same capabilities as in Chimera, and two of those capabilities were individual alignment letter coloring and tooltips (showing structure association info). This means that each letter in the alignment is its own graphics item. Once the port was well underway and initial versions of the viewer were available, Elaine said she preferred black lettering for the alignment sequences. Removing the individual color requirement means that if I change to a slightly uglier fixed-width font I could layout the sequences in large text blocks, which would be far more efficient. I would have to handle the tooltips in a far more manual fashion to show the per-letter association, and rework the quite complicated layout machinery, but I would like better performance.
Alternatively, it may just be better and more informative to use Profile Grids for large alignments.