Performance – Chimera

Context Navigation

← Previous Version
View Latest Version
Next Version →

Version 4 (modified by goddard, 15 years ago) ( diff )
--

Chimera Speed and Memory Use

What causes Chimera to run slowly or run out of memory? This page is intended as a guide for users and developers of Chimera.

Graphics Speed

Why is rotating the scene slow? Large molecules or volume data can cause display updating slower than the normal 30 frames per second.

Graphics card. Drawing speed is usually limited by the speed of the graphics card. See the table of benchmark results for different graphics cards running Chimera. Chimera uses OpenGL for 3-dimensional graphics. A card that is fast for playing video games is likely to be fast for Chimera. Integrated Intel graphics in laptops and desktops is usually slowest. Graphics cards from Nvidia and ATI are common and work well. On Linux installing a graphics driver is usually necessary. If Chimera menu entry Help / Report a Bug lists the OpenGL renderer as Mesa in the Gathered Information section then you need to install a graphics driver.

Selected objects The green outline around selected objects is created by drawing the selected objects 5 times. If everything is selected than the rendering can be 5 times slower.

Transparency Rotating large transparent surfaces can be slow. For each orientation the triangles making up the surface are sorted by depth and drawn in order from farthest away to nearest. The sorting is done on the CPU, not the graphics card. The time to do the sorting each time the scene is redrawn often limits the frame rate.

Molecule display styles Displaying all atoms of a large molecule (e.g. 100,000 atoms) can cause slow rendering. The wire display style is fastest, while stick, ball and stick, and sphere display styles are typically a 10 to 100 times slower. It would be possible to display those styles perhaps 5 times faster using different OpenGL drawing techniques. Currently each sphere or cylinder is placed with a matrix and the list of matrices is not kept of the graphics card. Lack of hardware acceleration of the matrix stack is the main bottleneck for those styles. Showing only the backbone of a protein using ribbon style is faster than showing all atoms (except for wire). The Chimera subdivision quality setting controls the smoothness of these molecule depictions and higher values will cause slower display.

Calculation Speed

Chimera primarily performs computations that take less than a second on modest size data. For large data sets (molecules of 100,000 atoms, 512³ volumes) simple operations like coloring atoms can take many seconds. For molecules many operations are done in Python where any operations on all atoms can be slow. For volume data operations are done in C++ for optimal speed. Almost no calculations use more than one CPU.

Memory Requirements

Molecules require about 2 Kbytes of memory per atom with 32-bit Chimera and about 4 Kbytes per atom for 64-bit Chimera. 500,000 atoms will use 1-2 Gbytes of memory. A memory efficient implementation would use about 200 bytes per atom, about 10 times less memory. The large memory use is primarily because each atom (and bond) has an associated object in Python. The memory use from the C++ data representation is (probably) much smaller. Having the individual atoms accessible in Python makes it easy to add new features to Chimera. It would be possible to create Python atom data structures only when needed. If basic calculations over atoms (at least ones done for every opened molecule) were ported to C++ this would allow working with larger numbers of atoms. Advanced unoptimized features would still use Python atoms. This optimization would speed up basic operations on large atomic models and still allow slow unoptimized computations if enough memory is available.

Volume data displayed at full resolution usually takes an amount of memory equal to the data file size. For text file formats (e.g. APBS) the in-memory size will be much (4x) smaller. If volume data is displayed subsampled by using step size > 1 then only the needed data is read in. For step size 2, only 1/8 the full data size is read in memory. If a sub-region of a volume is shown, only that data is read in. The native numeric type of the volume data is preserved -- for example, 8-bit data is represented in memory using 8-bits per value. Volume operations do not convert the data to floating point (4-bytes per value). Interpolating a volume data set, even at just one point, will cause the full data set to be read in.

32-bit versus 64-bit Chimera binaries

Chimera 32-bit and 64-bit binaries are available on Windows, Linux and Mac operating systems. Windows 7, any 64-bit Linux, or Mac OS 10.6 is required to run the 64-bit binaries. If you have 4 Gbytes or more of physical memory we recommend you use 64-bit Chimera versions. The speed is nearly the same for 32-bit and 64-bit versions (within 10% in tests). The primary advantage is that 32-bit versions can only address 4 Gbytes of memory. This will only allow you to use about 1 Gbyte of data in Chimera because the address space is used for other purposes (shared libraries, stack, code, sometimes the operating system) and the address space gets fragmented so that no large contiguous space is available. With 64-bit Chimera versions the address space is 4 billion times bigger so all physical memory can be used -- with 4 Gbytes of memory, probably about 3 Gbytes will be usable for data, and with more than 4 Gbytes of memory, all memory will be usable.

A drawback of 64-bit Chimera versions is that molecule data appears to take about 1.7 times more memory than 32-bit Chimera. It is expected that some additional memory will be needed since pointers take twice as much memory. We don't have definitive measures of how much extra is being used because the system memory-allocator for 64-bit may be allocating larger blocks in tests on Mac OS.

Multi-core Calculations

Current 2011 desktop computers have 4 or 8 CPU cores and laptops typically have 2 cores. Future increases in computation speed will mostly be from having more cores rather than faster individual cores that have accounted for past speed-ups in computer hardware.

Most Chimera computations scale linearly with the size of the data, and after optimized for a single thread (e.g. by coding in C++ instead of Python) tend to be memory-bandwidth limited rather than compute-bound.

Currently only one Chimera calculation, Coulomb electrostatic potential uses multiple CPU cores. For a 4 core machine the speed-up is about 3.5 times and for 16 cores is about 14 times faster. The parallelization uses OpenMP. It is only in 64-bit Chimera versions. For Intel processors with hyperthreading the operating system will often report there are twice the number of actual physical cores. Hyperthreading does not speed up the Coulomb calculation. The speed-up factor is approximately equal to the number of actual physical cores.

It may be that energy minimization done in Chimera with the third-party MMTK toolkit can use multiple cores.

Two other multi-core computing tests have been tried in Chimera. Volume contour surface calculation ran about 1.7 times faster using 4 cores. Some extra code to stitch together the surfaces from the separate cores has not been written and would make the total speed-up less. The computation appears to be memory bandwidth limited, so additional cores may or may not provide additional speed-up depending on utilization of separate memory caches per core or per CPU. Current Intel i7 processors have 256 Kbytes of L2 cache per-core. This may be too small for improving memory bottlenecks when chunking contour calculations into separate z-slabs for each core.

Fitting atomic models in volume data was sped-up about 20% using 2 cores and just 15% using 4 cores(!). The test case was fitting using 64 starting rotational orientations using the Fit to Segments tool with rotational search enabled. The fitting calculation is partly in Python and partly in C++. This test used the Python threading module. The poor results were probably because Python code can only run in a single thread, combined with the fact that using multiple threads running Python code causes about a 2x slow-down due to context switching overhead (as explained by Dave Beazley). Tests showed that 2 or 4 core calculations where the C++ code did not release the Python global interpreter lock (GIL) were about 2 times slower than the same calculation using a single-core. If the fitting is spending 40% time in Python and 60% time in C++, then no matter how many cores are used the Python part will still take at least 80% (double single-thread) limiting performance. Porting the Python part of the fitting code to C++ would involve translating about 500 lines of Python code. The obtainable speed-up is not certain since fitting may be memory-bandwidth limited. The main step is interpolation of volume data values and volume gradient values at points corresponding to atom locations. These are not compute intensive (tri-linear interpolation using 8 nearest grid data values).

A good candidate for another multi-core test is the molecular trajectory RMSD map that computes all-by-all trajectory frame RMSD values.

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text