Context Navigation

← Previous Ticket
Next Ticket →

#4678 accepted defect

Large alignment performance

Reported by:	goddard@…	Owned by:	Eric Pettersen
Priority:	normal	Milestone:
Component:	Sequence	Version:
Keywords:		Cc:	Elaine Meng
Blocked By:		Blocking:
Notify when closed:		Platform:	all
Project:	ChimeraX

Description

The following bug report has been submitted:
Platform:        macOS-10.15.7-x86_64-i386-64bit
ChimeraX Version: 1.3.dev202105180049 (2021-05-18 00:49:36 UTC)
Description
Working with a largish sequence alignment is excruciatingly slow and takes an unexpected enormous amount of memory.  I opened 6vxx, did a BLAST search and displayed a sequence alignment of the ~1000 sequences.  This is 1000 sequences of 1300 amino acids, one Mbyte of data.  It took about 5 minutes, no progress messages, to show the sequence alignment and 13 Gbytes of memory.  Switching conservation headers to identity and then back to ALCO took about 3 minutes each change and now ChimeraX is using 26 Gbytes.

Log:
UCSF ChimeraX version: 1.3.dev202105180049 (2021-05-18)  
© 2016-2021 Regents of the University of California. All rights reserved.  
How to cite UCSF ChimeraX  

> open 6vxx format mmcif fromDatabase pdb

6vxx title:  
Structure of the SARS-CoV-2 spike glycoprotein (closed state) [more info...]  
  
Chain information for 6vxx #1  
---  
Chain | Description  
A B C | spike glycoprotein  
  
Non-standard residues in 6vxx #1  
---  
NAG — N-acetyl-D-glucosamine  
  

> ui tool show "Blast Protein"

> blastprotein /A database pdb cutoff 1e-3 matrix BLOSUM62 maxSeqs 100 name
> bp1

Web Service: BlastProtein2 is a Python wrapper that calls blastp to search nr
or pdb for sequences similar to the given protein sequence  
Opal service URL:
http://webservices.rbvi.ucsf.edu/opal2/services/BlastProtein2Service  
Opal job id: appBlastProtein2Service16215643635791603944578  
Opal status URL prefix:
http://webservices.rbvi.ucsf.edu/appBlastProtein2Service16215643635791603944578  
stdout.txt = standard output  
stderr.txt = standard error  
BlastProtein finished.  
Alignment identifier is bp1 [1]  
Associated 6vxx chain A to #1/A with 0 mismatches  
Associated 6vxx chain B to #1/A with 0 mismatches  
Associated 6vxx chain C to #1/A with 0 mismatches  
Showing conservation header ("seq_conservation" residue attribute) for
alignment bp1 [1]  

> color byattribute seq_conservation

23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24  

> select /A-C:355

33 atoms, 30 bonds, 3 residues, 1 model selected  

> select /A-C:355

33 atoms, 30 bonds, 3 residues, 1 model selected  

> color byattribute seq_conservation palette redblue

23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24  

> preset cartoons/nucleotides cylinders/stubs

Changed 0 atom styles  
Preset expands to these ChimeraX commands:

    
    
    show nucleic
    hide protein|solvent|H
    surf hide
    style (protein|nucleic|solvent) & @@draw_mode=0 stick
    cartoon
    cartoon style modeh def arrows t arrowshelix f arrowscale 2 wid 2 thick 0.4 sides 12 div 20
    cartoon style ~(nucleic|strand) x round
    cartoon style (nucleic|strand) x rect
    cartoon style protein modeh tube rad 2 sides 24 thick 0.6
    cartoon style nucleic x round width 1.6 thick 1.6
    nucleotides stubs

  

> color byattribute seq_conservation palette 0,red:1,blue

23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24  

> color byattribute seq_conservation palette 0,red:1.0,blue

23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24  

> color byattribute seq_conservation palette -2,red:1.0,blue

23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24  

> color byattribute seq_conservation palette 0,red:1.0,blue

23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24  

> color byattribute seq_conservation palette -1,red:1.0,blue

23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24  

> color byattribute seq_conservation palette cyanmaroon

23694 atoms, 2979 residues, atom seq_conservation range -2.88 to 1.24  

> style sel sphere

Changed 33 atom styles  

> select clear

> style sphere

Changed 23694 atom styles  

> select /C:494

6 atoms, 5 bonds, 1 residue, 1 model selected  

> sequence header conservation setting style "identity histogram"

> color byattribute seq_conservation palette cyanmaroon

23694 atoms, 2979 residues, atom seq_conservation range 0.683 to 0.973  

> sequence header conservation setting style AL2CO




OpenGL version: 4.1 ATI-3.10.19
OpenGL renderer: AMD Radeon Pro Vega 20 OpenGL Engine
OpenGL vendor: ATI Technologies Inc.Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: MacBookPro15,3
      Processor Name: 8-Core Intel Core i9
      Processor Speed: 2.4 GHz
      Number of Processors: 1
      Total Number of Cores: 8
      L2 Cache (per Core): 256 KB
      L3 Cache: 16 MB
      Hyper-Threading Technology: Enabled
      Memory: 32 GB
      Boot ROM Version: 1554.100.64.0.0 (iBridge: 18.16.14556.0.0,0)

Software:

    System Software Overview:

      System Version: macOS 10.15.7 (19H1030)
      Kernel Version: Darwin 19.6.0
      Time since boot: 49 minutes

Graphics/Displays:

    Intel UHD Graphics 630:

      Chipset Model: Intel UHD Graphics 630
      Type: GPU
      Bus: Built-In
      VRAM (Dynamic, Max): 1536 MB
      Vendor: Intel
      Device ID: 0x3e9b
      Revision ID: 0x0002
      Automatic Graphics Switching: Supported
      gMux Version: 5.0.0
      Metal: Supported, feature set macOS GPUFamily2 v1

    Radeon Pro Vega 20:

      Chipset Model: Radeon Pro Vega 20
      Type: GPU
      Bus: PCIe
      PCIe Lane Width: x8
      VRAM (Total): 4 GB
      Vendor: AMD (0x1002)
      Device ID: 0x69af
      Revision ID: 0x00c0
      ROM Revision: 113-D2060I-087
      VBIOS Version: 113-D20601MA0T-016
      Option ROM Version: 113-D20601MA0T-016
      EFI Driver Version: 01.01.087
      Automatic Graphics Switching: Supported
      gMux Version: 5.0.0
      Metal: Supported, feature set macOS GPUFamily2 v1
      Displays:
        Color LCD:
          Display Type: Built-In Retina LCD
          Resolution: 2880 x 1800 Retina
          Framebuffer Depth: 24-Bit Color (ARGB8888)
          Main Display: Yes
          Mirror: Off
          Online: Yes
          Automatically Adjust Brightness: No
          Connection Type: Internal

Locale: (None, 'UTF-8')
PyQt5 5.15.2, Qt 5.15.2
Installed Packages:
    alabaster: 0.7.12
    appdirs: 1.4.4
    appnope: 0.1.2
    Babel: 2.9.1
    backcall: 0.2.0
    blockdiag: 2.0.1
    certifi: 2020.12.5
    cftime: 1.4.1
    chardet: 4.0.0
    ChimeraX-AddCharge: 1.1.3
    ChimeraX-AddH: 2.1.6
    ChimeraX-AlignmentAlgorithms: 2.0
    ChimeraX-AlignmentHdrs: 3.2
    ChimeraX-AlignmentMatrices: 2.0
    ChimeraX-Alignments: 2.1
    ChimeraX-AmberInfo: 1.0
    ChimeraX-Arrays: 1.0
    ChimeraX-Atomic: 1.20.1
    ChimeraX-AtomicLibrary: 3.2.1
    ChimeraX-AtomSearch: 2.0
    ChimeraX-AtomSearchLibrary: 1.0
    ChimeraX-AxesPlanes: 2.0
    ChimeraX-BasicActions: 1.1
    ChimeraX-BILD: 1.0
    ChimeraX-BlastProtein: 1.1.1
    ChimeraX-BondRot: 2.0
    ChimeraX-BugReporter: 1.0
    ChimeraX-BuildStructure: 2.5.2
    ChimeraX-Bumps: 1.0
    ChimeraX-BundleBuilder: 1.1
    ChimeraX-ButtonPanel: 1.0
    ChimeraX-CageBuilder: 1.0
    ChimeraX-CellPack: 1.0
    ChimeraX-Centroids: 1.1
    ChimeraX-ChemGroup: 2.0
    ChimeraX-Clashes: 2.1
    ChimeraX-ColorActions: 1.0
    ChimeraX-ColorGlobe: 1.0
    ChimeraX-ColorKey: 1.3
    ChimeraX-CommandLine: 1.1.4
    ChimeraX-ConnectStructure: 2.0
    ChimeraX-Contacts: 1.0
    ChimeraX-Core: 1.3.dev202105180049
    ChimeraX-CoreFormats: 1.0
    ChimeraX-coulombic: 1.3
    ChimeraX-Crosslinks: 1.0
    ChimeraX-Crystal: 1.0
    ChimeraX-CrystalContacts: 1.0
    ChimeraX-DataFormats: 1.1
    ChimeraX-Dicom: 1.0
    ChimeraX-DistMonitor: 1.1.3
    ChimeraX-DistUI: 1.0
    ChimeraX-Dssp: 2.0
    ChimeraX-ExperimentalCommands: 1.0
    ChimeraX-FileHistory: 1.0
    ChimeraX-FunctionKey: 1.0
    ChimeraX-Geometry: 1.1
    ChimeraX-gltf: 1.0
    ChimeraX-Graphics: 1.1
    ChimeraX-Hbonds: 2.1
    ChimeraX-Help: 1.1
    ChimeraX-HKCage: 1.3
    ChimeraX-IHM: 1.1
    ChimeraX-ImageFormats: 1.1
    ChimeraX-IMOD: 1.0
    ChimeraX-IO: 1.0.1
    ChimeraX-ItemsInspection: 1.0
    ChimeraX-Label: 1.1
    ChimeraX-ListInfo: 1.1.1
    ChimeraX-Log: 1.1.4
    ChimeraX-LookingGlass: 1.1
    ChimeraX-Maestro: 1.8.1
    ChimeraX-Map: 1.1
    ChimeraX-MapData: 2.0
    ChimeraX-MapEraser: 1.0
    ChimeraX-MapFilter: 2.0
    ChimeraX-MapFit: 2.0
    ChimeraX-MapSeries: 2.1
    ChimeraX-Markers: 1.0
    ChimeraX-Mask: 1.0
    ChimeraX-MatchMaker: 1.2.1
    ChimeraX-MDcrds: 2.2
    ChimeraX-MedicalToolbar: 1.0.1
    ChimeraX-Meeting: 1.0
    ChimeraX-MLP: 1.1
    ChimeraX-mmCIF: 2.3
    ChimeraX-MMTF: 2.1
    ChimeraX-Modeller: 1.0.1
    ChimeraX-ModelPanel: 1.1
    ChimeraX-ModelSeries: 1.0
    ChimeraX-Mol2: 2.0
    ChimeraX-Morph: 1.0
    ChimeraX-MouseModes: 1.1
    ChimeraX-Movie: 1.0
    ChimeraX-Neuron: 1.0
    ChimeraX-Nucleotides: 2.0.1
    ChimeraX-OpenCommand: 1.5
    ChimeraX-PDB: 2.4.1
    ChimeraX-PDBBio: 1.0
    ChimeraX-PDBLibrary: 1.0.1
    ChimeraX-PDBMatrices: 1.0
    ChimeraX-PickBlobs: 1.0
    ChimeraX-Positions: 1.0
    ChimeraX-PresetMgr: 1.0.1
    ChimeraX-PubChem: 2.0.1
    ChimeraX-ReadPbonds: 1.0
    ChimeraX-Registration: 1.1
    ChimeraX-RemoteControl: 1.0
    ChimeraX-ResidueFit: 1.0
    ChimeraX-RestServer: 1.1
    ChimeraX-RNALayout: 1.0
    ChimeraX-RotamerLibMgr: 2.0
    ChimeraX-RotamerLibsDunbrack: 2.0
    ChimeraX-RotamerLibsDynameomics: 2.0
    ChimeraX-RotamerLibsRichardson: 2.0
    ChimeraX-SaveCommand: 1.4
    ChimeraX-SchemeMgr: 1.0
    ChimeraX-SDF: 2.0
    ChimeraX-Segger: 1.0
    ChimeraX-Segment: 1.0
    ChimeraX-SelInspector: 1.0
    ChimeraX-SeqView: 2.4
    ChimeraX-Shape: 1.0.1
    ChimeraX-Shell: 1.0
    ChimeraX-Shortcuts: 1.1
    ChimeraX-ShowAttr: 1.0
    ChimeraX-ShowSequences: 1.0
    ChimeraX-SideView: 1.0
    ChimeraX-Smiles: 2.0.1
    ChimeraX-SmoothLines: 1.0
    ChimeraX-SpaceNavigator: 1.0
    ChimeraX-StdCommands: 1.6
    ChimeraX-STL: 1.0
    ChimeraX-Storm: 1.0
    ChimeraX-Struts: 1.0
    ChimeraX-Surface: 1.0
    ChimeraX-SwapAA: 2.0
    ChimeraX-SwapRes: 2.1
    ChimeraX-TapeMeasure: 1.0
    ChimeraX-Test: 1.0
    ChimeraX-Toolbar: 1.1
    ChimeraX-ToolshedUtils: 1.2
    ChimeraX-Tug: 1.0
    ChimeraX-UI: 1.9.3
    ChimeraX-uniprot: 2.1
    ChimeraX-UnitCell: 1.0
    ChimeraX-ViewDockX: 1.0
    ChimeraX-Vive: 1.1
    ChimeraX-VolumeMenu: 1.0
    ChimeraX-VTK: 1.0
    ChimeraX-WavefrontOBJ: 1.0
    ChimeraX-WebCam: 1.0
    ChimeraX-WebServices: 1.0
    ChimeraX-Zone: 1.0
    colorama: 0.4.4
    comtypes: 1.1.10
    cxservices: 1.0
    cycler: 0.10.0
    Cython: 0.29.23
    decorator: 4.4.2
    distlib: 0.3.1
    docutils: 0.17.1
    filelock: 3.0.12
    funcparserlib: 0.3.6
    grako: 3.16.5
    html2text: 2020.1.16
    idna: 2.10
    ihm: 0.20
    imagecodecs: 2021.4.28
    imagesize: 1.2.0
    ipykernel: 5.5.5
    ipython: 7.23.1
    ipython-genutils: 0.2.0
    jedi: 0.18.0
    Jinja2: 2.11.3
    jupyter-client: 6.1.12
    jupyter-core: 4.7.1
    kiwisolver: 1.3.1
    lxml: 4.6.3
    lz4: 3.1.3
    MarkupSafe: 1.1.1
    matplotlib: 3.4.2
    matplotlib-inline: 0.1.2
    msgpack: 1.0.2
    netCDF4: 1.5.6
    networkx: 2.5.1
    numpy: 1.20.3
    numpydoc: 1.1.0
    openvr: 1.16.801
    packaging: 20.9
    ParmEd: 3.2.0
    parso: 0.8.2
    pexpect: 4.8.0
    pickleshare: 0.7.5
    Pillow: 8.2.0
    pip: 21.1.1
    pkginfo: 1.7.0
    prompt-toolkit: 3.0.18
    psutil: 5.8.0
    ptyprocess: 0.7.0
    pycollada: 0.7.1
    pydicom: 2.1.2
    Pygments: 2.9.0
    PyOpenGL: 3.1.5
    PyOpenGL-accelerate: 3.1.5
    pyparsing: 2.4.7
    PyQt5: 5.15.2
    PyQt5-sip: 12.8.1
    PyQtWebEngine: 5.15.2
    python-dateutil: 2.8.1
    pytz: 2021.1
    pyzmq: 22.0.3
    qtconsole: 5.1.0
    QtPy: 1.9.0
    requests: 2.25.1
    scipy: 1.6.3
    setuptools: 56.2.0
    six: 1.16.0
    snowballstemmer: 2.1.0
    sortedcontainers: 2.3.0
    Sphinx: 4.0.1
    sphinxcontrib-applehelp: 1.0.2
    sphinxcontrib-blockdiag: 2.0.0
    sphinxcontrib-devhelp: 1.0.2
    sphinxcontrib-htmlhelp: 1.0.3
    sphinxcontrib-jsmath: 1.0.1
    sphinxcontrib-qthelp: 1.0.3
    sphinxcontrib-serializinghtml: 1.1.4
    suds-jurko: 0.6
    tifffile: 2021.4.8
    tinyarray: 1.2.3
    tornado: 6.1
    traitlets: 5.0.5
    urllib3: 1.26.4
    wcwidth: 0.2.5
    webcolors: 1.11.1
    wheel: 0.36.2
    wheel-filename: 1.3.0

Change History (9)

comment:1 by Eric Pettersen, 5 years ago

Cc:	Elaine Meng added
Component:	Unassigned → Sequence
Owner:	set to Eric Pettersen
Platform:	→ all
Project:	→ ChimeraX
Status:	new → accepted
Summary:	ChimeraX bug report submission → Large alignment performance

I guess the first question is what were you trying to get done by opening this large alignment? There might be better or more direct means we could implement to get what you needed done than via the alignment.

When I ported the viewer to ChimeraX I wanted the same capabilities as in Chimera, and two of those capabilities were individual alignment letter coloring and tooltips (showing structure association info). This means that each letter in the alignment is its own graphics item. Once the port was well underway and initial versions of the viewer were available, Elaine said she preferred black lettering for the alignment sequences. Removing the individual color requirement means that if I change to a slightly uglier fixed-width font I could layout the sequences in large text blocks, which would be far more efficient. I would have to handle the tooltips in a far more manual fashion to show the per-letter association, and rework the quite complicated layout machinery, but I would like better performance.

Alternatively, it may just be better and more informative to use Profile Grids for large alignments.

Last edited 5 years ago by Eric Pettersen (previous) (diff)

in reply to: 2 ; follow-up: 2 comment:2 by goddard@…, 5 years ago

I wanted to see the range of mutations available in the PDB on the available sars-cov-2 spike protein structures.

I did not suspect that the minutes of waiting and very high memory use was because of the layout of a million characters. I figured it had to do with computation, like aligning the BLAST sequences, or some inefficiency in computing conservation. But I see from your layout description that could be the trouble. I would not have even been able to try this unless my laptop had 32 Gbytes of memory.

I realize getting good performance is hard work. The question is whether ChimeraX sequence tools are intended just to handle a few sequences (e.g. 10). I think in 2021 biology having 1000 sequences aligned is not big. GISAID has over 1 million sars-cov-2 sequences and I plan on trying to look at conservation for some large subsets today and color a structure using that info. Large sequence alignments (hundreds of sequences) are common. It would be nice to handle these but maybe it is too difficult. If it is too difficult then we should have a progress indicator so the users knows it is hopeless when they show a large alignment and they should be able to stop it instead of having to force quit and lose their session. The limitation to just a few sequences will be surprising to a researcher (it was surprising to me who should know better), so not letting the user shoot themselves in the foot is especially important if we want them to be happy with our software.

I guess progress messages would be pretty easy, but stopping is something we have not setup infrastructure for. So maybe when I try to show 1000 sequences as a stopgap it should say "This is going to take about 13 Gbytes of memory and about 5 minutes to display. ChimeraX is not intended to handle large sequence alignments." where the numbers are simply estimated from the number of characters that need to be layed out.

comment:3 by Eric Pettersen, 5 years ago

With the confounding factor of the Sequence.characters memory leak removed, the memory consumption drops to a still-too-large 3.4GB. Changing the Conservation header takes ~6 minutes and raise the memory consumption to 5.73MB. Unfortunately memory management of QGraphicsSceneItems is pretty tricky.

comment:4 by Eric Pettersen, 5 years ago

So, while starting to work on this I wanted to save an alignment file so I wouldn't have to run the Blast repeatedly, and noticed there seemed to be some kind of memory leak just saving an alignment file! It turns out that Sequences created in the Python layer start out with an extra reference count and therefore never get garbage collected. It took me awhile to find and quash that bug, but unfortunately it has virtually no effect on the sequence-display performance problems described in this ticket, so further investigation will be necessary.

comment:5 by Eric Pettersen, 3 years ago

Well, measuring the memory use today, it's 2.4GB. So it's 1.0GB less than 17 months ago without me doing anything particular that I can remember. Is it just Qt5→Qt6, or did I forget doing something relevant in that time period? At ant rate, 2.4GB is _still_ too large (about a 2.1GB increase), so investigation is still warranted.

comment:6 by Eric Pettersen, 3 years ago

Well, 1.5GB of that is just asking Qt to put the letters on the canvas, and the rest is Python overhead tracking those letters, so I don't know that I can make tremendous inroads on memory use without a major rewrite to not use individual letters. Time probably better spent on implementing Profile Grids.

comment:7 by Tom Goddard, 3 years ago

Agree, maybe profile grids are a better direction. But I think there is an opportunity to acquire lots of devoted users by making working with alignments of thousands of sequences fast. In AlphaFold structure prediction alignments of 10,000 sequences are the norm and I think there is lots of valuable info in those alignment. AlphaFold is believed to figure out the structure largely from seeing pairwise correlations in residue mutations. Profile grids will throw away that info. The data files for 10,000 sequences are no larger than typical PDB files, so the naively the idea of working with them interactively where most operations run in under a second seems practical. Granted it may be too hard to implement.

I think deep sequence alignments are only going to become more important and I'll be trying to look at them. I think it is an interesting long term direction (years) to try to handle deep sequence alignments at interactive speeds. But you (Eric) would be the one to decide if that is a direction to head in.

comment:8 by Eric Pettersen, 3 years ago

I'm not sure where you're getting the idea that profile grids will be throwing away information. It still has the complete alignment information, it's just depicting it differently. With a profile grid, you can easily filter down to a subalignment with a mutation at. particular position to see what the profile at other positions looks like (depicted either as a profile grid or a "classic" alignment). I'd say it's easier to ferret out mutation correlations with a profile grid than a regular alignment.

in reply to: 9 ; follow-up: 9 comment:9 by goddard@…, 3 years ago

Good point.  Of course there is no need to throw away the fine grain information.

Note: See TracTickets for help on using tickets.

Download in other formats: