| | 59 | Molecules require about 2 Kbytes of memory per atom with 32-bit Chimera and about |
| | 60 | 4 Kbytes per atom for 64-bit Chimera. 500,000 atoms will use 1-2 Gbytes of memory. |
| | 61 | A memory efficient implementation would use about 200 bytes per atom, about 10 times |
| | 62 | less memory. The large memory use is primarily because each atom (and bond) has an |
| | 63 | associated object in Python. The memory use from the C++ data representation is |
| | 64 | (probably) much smaller. Having the individual atoms accessible in Python makes it |
| | 65 | easy to add new features to Chimera. It would be possible to create Python atom |
| | 66 | data structures only when needed. If basic calculations over atoms (at least ones |
| | 67 | done for every opened molecule) were ported to C++ this would allow working with |
| | 68 | larger numbers of atoms. Advanced unoptimized features would still use Python atoms. |
| | 69 | This optimization would speed up basic operations on large atomic models and still |
| | 70 | allow slow unoptimized computations if enough memory is available. |
| | 71 | |
| | 72 | Volume data displayed at full resolution usually takes an amount of memory equal to |
| | 73 | the data file size. For text file formats (e.g. APBS) the in-memory size will be much |
| | 74 | (4x) smaller. If volume data is displayed subsampled by using step size > 1 then only |
| | 75 | the needed data is read in. For step size 2, only 1/8 the full data size is read in |
| | 76 | memory. If a sub-region of a volume is shown, only that data is read in. The native |
| | 77 | numeric type of the volume data is preserved -- for example, 8-bit data is represented |
| | 78 | in memory using 8-bits per value. Volume operations do not convert the data to floating |
| | 79 | point (4-bytes per value). Interpolating a volume data set, even at just one point, |
| | 80 | will cause the full data set to be read in. |
| | 81 | |
| | 84 | Chimera 32-bit and 64-bit binaries are available on Windows, Linux and Mac operating |
| | 85 | systems. Windows 7, any 64-bit Linux, or Mac OS 10.6 is required to run the 64-bit binaries. |
| | 86 | If you have 4 Gbytes or more of physical memory we recommend you use 64-bit Chimera versions. |
| | 87 | The speed is nearly the same for 32-bit and 64-bit versions (within 10% in tests). The |
| | 88 | primary advantage is that 32-bit versions can only address 4 Gbytes of memory. This will |
| | 89 | only allow you to use about 1 Gbyte of data in Chimera because the address space is used |
| | 90 | for other purposes (shared libraries, stack, code, sometimes the operating system) and the |
| | 91 | address space gets fragmented so that no large contiguous space is available. With 64-bit |
| | 92 | Chimera versions the address space is 4 billion times bigger so all physical memory can be |
| | 93 | used -- with 4 Gbytes of memory, probably about 3 Gbytes will be usable for data, and with |
| | 94 | more than 4 Gbytes of memory, all memory will be usable. |
| | 95 | |
| | 96 | A drawback of 64-bit Chimera versions is that molecule data appears to take about 1.7 times |
| | 97 | more memory than 32-bit Chimera. It is expected that some additional memory will be needed |
| | 98 | since pointers take twice as much memory. We don't have definitive measures of how much extra |
| | 99 | is being used because the system memory-allocator for 64-bit may be allocating larger blocks |
| | 100 | in tests on Mac OS. |
| | 101 | |
| | 103 | |
| | 104 | Current 2011 desktop computers have 4 or 8 CPU cores and laptops typically have 2 cores. |
| | 105 | Future increases in computation speed will mostly be from having more cores rather than |
| | 106 | faster individual cores that have accounted for past speed-ups in computer hardware. |
| | 107 | |
| | 108 | Most Chimera computations scale linearly with the size of the data, and after optimized |
| | 109 | for a single thread (e.g. by coding in C++ instead of Python) tend to be memory-bandwidth |
| | 110 | limited rather than compute-bound. |
| | 111 | |
| | 112 | Currently only one Chimera calculation, Coulomb electrostatic potential uses multiple CPU |
| | 113 | cores. For a 4 core machine the speed-up is about 3.5 times and for 16 cores is about 14 times |
| | 114 | faster. The parallelization uses OpenMP. It is only in 64-bit Chimera versions. For Intel |
| | 115 | processors with hyperthreading the operating system will often report there are twice the number |
| | 116 | of actual physical cores. Hyperthreading does not speed up the Coulomb calculation. The speed-up |
| | 117 | factor is approximately equal to the number of actual physical cores. |
| | 118 | |
| | 119 | It may be that energy minimization done in Chimera with the third-party MMTK toolkit can use |
| | 120 | multiple cores. |
| | 121 | |
| | 122 | Two other multi-core computing tests have been tried in Chimera. Volume contour surface calculation |
| | 123 | ran about 1.7 times faster using 4 cores. Some extra code to stitch together the surfaces from the |
| | 124 | separate cores has not been written and would make the total speed-up less. The computation appears |
| | 125 | to be memory bandwidth limited, so additional cores may or may not provide additional speed-up |
| | 126 | depending on utilization of separate memory caches per core or per CPU. Current Intel i7 processors |
| | 127 | have 256 Kbytes of L2 cache per-core. This may be too small for improving memory bottlenecks when |
| | 128 | chunking contour calculations into separate z-slabs for each core. |
| | 129 | |
| | 130 | Fitting atomic models in volume data was sped-up about 20% using 2 cores and just 15% using 4 cores(!). |
| | 131 | The test case was fitting using 64 starting rotational orientations using the Fit to Segments tool |
| | 132 | with rotational search enabled. The fitting calculation is partly in Python and partly in C++. |
| | 133 | This test used the Python threading module. The poor results were probably because Python code can |
| | 134 | only run in a single thread, combined with the fact that using multiple threads running Python code |
| | 135 | causes about a 2x slow-down due to context switching overhead (as explained by Dave Beazley). Tests |
| | 136 | showed that 2 or 4 core calculations where the C++ code did not release the Python global interpreter |
| | 137 | lock (GIL) were about 2 times slower than the same calculation using a single-core. If the fitting is |
| | 138 | spending 40% time in Python and 60% time in C++, then no matter how many cores are used the Python part |
| | 139 | will still take at least 80% (double single-thread) limiting performance. Porting the Python part of |
| | 140 | the fitting code to C++ would involve translating about 500 lines of Python code. The obtainable speed-up |
| | 141 | is not certain since fitting may be memory-bandwidth limited. The main step is interpolation of volume |
| | 142 | data values and volume gradient values at points corresponding to atom locations. These are not compute |
| | 143 | intensive (tri-linear interpolation using 8 nearest grid data values). |
| | 144 | |
| | 145 | A good candidate for another multi-core test is the molecular trajectory RMSD map that computes all-by-all |
| | 146 | trajectory frame RMSD values. |