Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Version 3 and Version 4 of Performance

Timestamp:: Feb 17, 2011, 8:10:53 PM (15 years ago)
Author:: goddard
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

Performance

-              v3
+              v4
 == Chimera Speed and Memory Use ==
+What causes Chimera to run slowly or run out of memory?
+What causes Chimera to run slowly or run out of memory?  This page is intended as a guide
+for users and developers of Chimera.
  * [#GraphicsSpeed Graphics speed]
 …
 Displaying all atoms of a large molecule (e.g. 100,000 atoms) can cause slow rendering.
 The ''wire'' display style is fastest, while ''stick'', ''ball and stick'', and ''sphere''
+display styles are typically a 10 to 100 times slower.  Showing only the backbone of a protein
+display styles are typically a 10 to 100 times slower.  It would be possible to display those
+styles perhaps 5 times faster using different OpenGL drawing techniques.  Currently each sphere
+or cylinder is placed with a matrix and the list of matrices is not kept of the graphics card.
+Lack of hardware acceleration of the matrix stack is the main bottleneck for those styles.
+Showing only the backbone of a protein
 using ''ribbon'' style is faster than showing all atoms (except for ''wire'').  The Chimera
 '''subdivision quality''' setting controls the smoothness of these molecule depictions and
 higher values will cause slower display.
+== Calculation Speed ==
+== Calculation Speed ==
+Chimera primarily performs computations that take less than a second on modest size data.
+For large data sets (molecules of 100,000 atoms, 512^3^ volumes) simple operations like
+coloring atoms can take many seconds.  For molecules many operations are done in Python
+where any operations on all atoms can be slow.  For volume data operations are done in
+C++ for optimal speed.  Almost no calculations use more than one CPU.
 == Memory Requirements ==
+Molecules require about 2 Kbytes of memory per atom with 32-bit Chimera and about
+Kbytes per atom for 64-bit Chimera.  500,000 atoms will use 1-2 Gbytes of memory.
+A memory efficient implementation would use about 200 bytes per atom, about 10 times
+less memory.  The large memory use is primarily because each atom (and bond) has an
+associated object in Python.  The memory use from the C++ data representation is
+(probably) much smaller.  Having the individual atoms accessible in Python makes it
+easy to add new features to Chimera. It would be possible to create Python atom
+data structures only when needed. If basic calculations over atoms (at least ones
+done for every opened molecule) were ported to C++ this would allow working with
+larger numbers of atoms. Advanced unoptimized features would still use Python atoms.
+This optimization would speed up basic operations on large atomic models and still
+allow slow unoptimized computations if enough memory is available.
+Volume data displayed at full resolution usually takes an amount of memory equal to
+the data file size.  For text file formats (e.g. APBS) the in-memory size will be much
+(4x) smaller.  If volume data is displayed subsampled by using step size > 1 then only
+the needed data is read in.  For step size 2, only 1/8 the full data size is read in
+memory.  If a sub-region of a volume is shown, only that data is read in.  The native
+numeric type of the volume data is preserved -- for example, 8-bit data is represented
+in memory using 8-bits per value.  Volume operations do not convert the data to floating
+point (4-bytes per value).  Interpolating a volume data set, even at just one point,
+will cause the full data set to be read in.
 == 32-bit versus 64-bit Chimera binaries == #bit64
+Chimera 32-bit and 64-bit binaries are available on Windows, Linux and Mac operating
+systems.  Windows 7, any 64-bit Linux, or Mac OS 10.6 is required to run the 64-bit binaries.
+If you have 4 Gbytes or more of physical memory we recommend you use 64-bit Chimera versions.
+The speed is nearly the same for 32-bit and 64-bit versions (within 10% in tests).  The
+primary advantage is that 32-bit versions can only address 4 Gbytes of memory.  This will
+only allow you to use about 1 Gbyte of data in Chimera because the address space is used
+for other purposes (shared libraries, stack, code, sometimes the operating system) and the
+address space gets fragmented so that no large contiguous space is available.  With 64-bit
+Chimera versions the address space is 4 billion times bigger so all physical memory can be
+used -- with 4 Gbytes of memory, probably about 3 Gbytes will be usable for data, and with
+more than 4 Gbytes of memory, all memory will be usable.
+A drawback of 64-bit Chimera versions is that molecule data appears to take about 1.7 times
+more memory than 32-bit Chimera.  It is expected that some additional memory will be needed
+since pointers take twice as much memory.  We don't have definitive measures of how much extra
+is being used because the system memory-allocator for 64-bit may be allocating larger blocks
+in tests on Mac OS.
 == Multi-core Calculations == #multicore
+Current 2011 desktop computers have 4 or 8 CPU cores and laptops typically have 2 cores.
+Future increases in computation speed will mostly be from having more cores rather than
+faster individual cores that have accounted for past speed-ups in computer hardware.
+Most Chimera computations scale linearly with the size of the data, and after optimized
+for a single thread (e.g. by coding in C++ instead of Python) tend to be memory-bandwidth
+limited rather than compute-bound.
+Currently only one Chimera calculation, Coulomb electrostatic potential uses multiple CPU
+cores.  For a 4 core machine the speed-up is about 3.5 times and for 16 cores is about 14 times
+faster.  The parallelization uses OpenMP.  It is only in 64-bit Chimera versions.  For Intel
+processors with hyperthreading the operating system will often report there are twice the number
+of actual physical cores.  Hyperthreading does not speed up the Coulomb calculation.  The speed-up
+factor is approximately equal to the number of actual physical cores.
+It may be that energy minimization done in Chimera with the third-party MMTK toolkit can use
+multiple cores.
+Two other multi-core computing tests have been tried in Chimera. Volume contour surface calculation
+ran about 1.7 times faster using 4 cores. Some extra code to stitch together the surfaces from the
+separate cores has not been written and would make the total speed-up less.  The computation appears
+to be memory bandwidth limited, so additional cores may or may not provide additional speed-up
+depending on utilization of separate memory caches per core or per CPU.  Current Intel i7 processors
+have 256 Kbytes of L2 cache per-core.  This may be too small for improving memory bottlenecks when
+chunking contour calculations into separate z-slabs for each core.
+Fitting atomic models in volume data was sped-up about 20% using 2 cores and just 15% using 4 cores(!).
+The test case was fitting using 64 starting rotational orientations using the Fit to Segments tool
+with rotational search enabled.  The fitting calculation is partly in Python and partly in C++.
+This test used the Python threading module.  The poor results were probably because Python code can
+only run in a single thread, combined with the fact that using multiple threads running Python code
+causes about a 2x slow-down due to context switching overhead (as explained by Dave Beazley).  Tests
+showed that 2 or 4 core calculations where the C++ code did not release the Python global interpreter
+lock (GIL) were about 2 times slower than the same calculation using a single-core.  If the fitting is
+spending 40% time in Python and 60% time in C++, then no matter how many cores are used the Python part
+will still take at least 80% (double single-thread) limiting performance.  Porting the Python part of
+the fitting code to C++ would involve translating about 500 lines of Python code.  The obtainable speed-up
+is not certain since fitting may be memory-bandwidth limited.  The main step is interpolation of volume
+data values and volume gradient values at points corresponding to atom locations.  These are not compute
+intensive (tri-linear interpolation using 8 nearest grid data values).
+A good candidate for another multi-core test is the molecular trajectory RMSD map that computes all-by-all
+trajectory frame RMSD values.