Changes between Version 3 and Version 4 of Performance


Ignore:
Timestamp:
Feb 17, 2011, 8:10:53 PM (15 years ago)
Author:
goddard
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Performance

    v3 v4  
    11== Chimera Speed and Memory Use ==
    22
    3 What causes Chimera to run slowly or run out of memory?
     3What causes Chimera to run slowly or run out of memory?  This page is intended as a guide
     4for users and developers of Chimera.
    45
    56 * [#GraphicsSpeed Graphics speed]
     
    3738Displaying all atoms of a large molecule (e.g. 100,000 atoms) can cause slow rendering.
    3839The ''wire'' display style is fastest, while ''stick'', ''ball and stick'', and ''sphere''
    39 display styles are typically a 10 to 100 times slower.  Showing only the backbone of a protein
     40display styles are typically a 10 to 100 times slower.  It would be possible to display those
     41styles perhaps 5 times faster using different OpenGL drawing techniques.  Currently each sphere
     42or cylinder is placed with a matrix and the list of matrices is not kept of the graphics card.
     43Lack of hardware acceleration of the matrix stack is the main bottleneck for those styles.
     44Showing only the backbone of a protein
    4045using ''ribbon'' style is faster than showing all atoms (except for ''wire'').  The Chimera
    4146'''subdivision quality''' setting controls the smoothness of these molecule depictions and
    4247higher values will cause slower display.
    4348
     49== Calculation Speed ==
    4450
    45 == Calculation Speed ==
     51Chimera primarily performs computations that take less than a second on modest size data.
     52For large data sets (molecules of 100,000 atoms, 512^3^ volumes) simple operations like
     53coloring atoms can take many seconds.  For molecules many operations are done in Python
     54where any operations on all atoms can be slow.  For volume data operations are done in
     55C++ for optimal speed.  Almost no calculations use more than one CPU.
    4656
    4757== Memory Requirements ==
    4858
     59Molecules require about 2 Kbytes of memory per atom with 32-bit Chimera and about
     604 Kbytes per atom for 64-bit Chimera.  500,000 atoms will use 1-2 Gbytes of memory.
     61A memory efficient implementation would use about 200 bytes per atom, about 10 times
     62less memory.  The large memory use is primarily because each atom (and bond) has an
     63associated object in Python.  The memory use from the C++ data representation is
     64(probably) much smaller.  Having the individual atoms accessible in Python makes it
     65easy to add new features to Chimera. It would be possible to create Python atom
     66data structures only when needed. If basic calculations over atoms (at least ones
     67done for every opened molecule) were ported to C++ this would allow working with
     68larger numbers of atoms. Advanced unoptimized features would still use Python atoms.
     69This optimization would speed up basic operations on large atomic models and still
     70allow slow unoptimized computations if enough memory is available.
     71
     72Volume data displayed at full resolution usually takes an amount of memory equal to
     73the data file size.  For text file formats (e.g. APBS) the in-memory size will be much
     74(4x) smaller.  If volume data is displayed subsampled by using step size > 1 then only
     75the needed data is read in.  For step size 2, only 1/8 the full data size is read in
     76memory.  If a sub-region of a volume is shown, only that data is read in.  The native
     77numeric type of the volume data is preserved -- for example, 8-bit data is represented
     78in memory using 8-bits per value.  Volume operations do not convert the data to floating
     79point (4-bytes per value).  Interpolating a volume data set, even at just one point,
     80will cause the full data set to be read in.
     81
    4982== 32-bit versus 64-bit Chimera binaries == #bit64
    5083
     84Chimera 32-bit and 64-bit binaries are available on Windows, Linux and Mac operating
     85systems.  Windows 7, any 64-bit Linux, or Mac OS 10.6 is required to run the 64-bit binaries.
     86If you have 4 Gbytes or more of physical memory we recommend you use 64-bit Chimera versions.
     87The speed is nearly the same for 32-bit and 64-bit versions (within 10% in tests).  The
     88primary advantage is that 32-bit versions can only address 4 Gbytes of memory.  This will
     89only allow you to use about 1 Gbyte of data in Chimera because the address space is used
     90for other purposes (shared libraries, stack, code, sometimes the operating system) and the
     91address space gets fragmented so that no large contiguous space is available.  With 64-bit
     92Chimera versions the address space is 4 billion times bigger so all physical memory can be
     93used -- with 4 Gbytes of memory, probably about 3 Gbytes will be usable for data, and with
     94more than 4 Gbytes of memory, all memory will be usable.
     95
     96A drawback of 64-bit Chimera versions is that molecule data appears to take about 1.7 times
     97more memory than 32-bit Chimera.  It is expected that some additional memory will be needed
     98since pointers take twice as much memory.  We don't have definitive measures of how much extra
     99is being used because the system memory-allocator for 64-bit may be allocating larger blocks
     100in tests on Mac OS.
     101
    51102== Multi-core Calculations == #multicore
     103
     104Current 2011 desktop computers have 4 or 8 CPU cores and laptops typically have 2 cores.
     105Future increases in computation speed will mostly be from having more cores rather than
     106faster individual cores that have accounted for past speed-ups in computer hardware.
     107
     108Most Chimera computations scale linearly with the size of the data, and after optimized
     109for a single thread (e.g. by coding in C++ instead of Python) tend to be memory-bandwidth
     110limited rather than compute-bound.
     111
     112Currently only one Chimera calculation, Coulomb electrostatic potential uses multiple CPU
     113cores.  For a 4 core machine the speed-up is about 3.5 times and for 16 cores is about 14 times
     114faster.  The parallelization uses OpenMP.  It is only in 64-bit Chimera versions.  For Intel
     115processors with hyperthreading the operating system will often report there are twice the number
     116of actual physical cores.  Hyperthreading does not speed up the Coulomb calculation.  The speed-up
     117factor is approximately equal to the number of actual physical cores.
     118
     119It may be that energy minimization done in Chimera with the third-party MMTK toolkit can use
     120multiple cores.
     121
     122Two other multi-core computing tests have been tried in Chimera. Volume contour surface calculation
     123ran about 1.7 times faster using 4 cores. Some extra code to stitch together the surfaces from the
     124separate cores has not been written and would make the total speed-up less.  The computation appears
     125to be memory bandwidth limited, so additional cores may or may not provide additional speed-up
     126depending on utilization of separate memory caches per core or per CPU.  Current Intel i7 processors
     127have 256 Kbytes of L2 cache per-core.  This may be too small for improving memory bottlenecks when
     128chunking contour calculations into separate z-slabs for each core.
     129
     130Fitting atomic models in volume data was sped-up about 20% using 2 cores and just 15% using 4 cores(!).
     131The test case was fitting using 64 starting rotational orientations using the Fit to Segments tool
     132with rotational search enabled.  The fitting calculation is partly in Python and partly in C++.
     133This test used the Python threading module.  The poor results were probably because Python code can
     134only run in a single thread, combined with the fact that using multiple threads running Python code
     135causes about a 2x slow-down due to context switching overhead (as explained by Dave Beazley).  Tests
     136showed that 2 or 4 core calculations where the C++ code did not release the Python global interpreter
     137lock (GIL) were about 2 times slower than the same calculation using a single-core.  If the fitting is
     138spending 40% time in Python and 60% time in C++, then no matter how many cores are used the Python part
     139will still take at least 80% (double single-thread) limiting performance.  Porting the Python part of
     140the fitting code to C++ would involve translating about 500 lines of Python code.  The obtainable speed-up
     141is not certain since fitting may be memory-bandwidth limited.  The main step is interpolation of volume
     142data values and volume gradient values at points corresponding to atom locations.  These are not compute
     143intensive (tri-linear interpolation using 8 nearest grid data values).
     144
     145A good candidate for another multi-core test is the molecular trajectory RMSD map that computes all-by-all
     146trajectory frame RMSD values.