Opened 5 years ago
Closed 5 years ago
#4073 closed enhancement (fixed)
Need reliable linking for python C++ modules using libarrays
| Reported by: | Tom Goddard | Owned by: | Tom Goddard |
|---|---|---|---|
| Priority: | moderate | Milestone: | |
| Component: | Infrastructure | Version: | |
| Keywords: | Cc: | Eric Pettersen, Greg Couch | |
| Blocked By: | Blocking: | ||
| Notify when closed: | Platform: | all | |
| Project: | ChimeraX |
Description
About 35 modules of ChimeraX have C++ code that uses the arrays module for parsing numpy array Python arguments. Loading those C++ modules will fail if "import arrays; arrays.load_libarrays()" is not done first to dlopen libarrays. Currently we just call load_libarrays() in two modules (geometry and atomic) and pray that no other module is imported earlier. This burned us in bug #4068 where the Sphinx documentation imports each module with no specified order.
In the future it would be interesting to allow Python developers to simply import a chimerax module without running the ChimeraX app and use whatever functionality they need. For that to work we can't rely on load_libarrays() being called in one specific place in the application. Every place a C++ module using libarrays is used must assure load_libarrays() is called.
The normal solution for runtime linking is that _map.dylib that needs libarrays.dylib would just reference it with either @rpath/libarrays.dylib or a relative path ../arrays/libarrays.dylib. If using @rpath then the executable (python) needs to define rpath or an environment variable LD_LIBRARY_PATH defines it. Neither of those rpath definitions are convenient for using a Python module. The relative path could work as long as all Python bundles are installed in the same place. With Toolshed installing updates in the users home directory instead of in the app, that would break this assumption. I don't think Toolshed updates of individual ChimeraX bundles is going to be useful. But relying on the relative paths built into the C++ libraries seems fragile.
A more robust solution would be to have a wrapper Python module that calls load_libarray() before importing the actual C++ module. For instance, for the _map C++ libary have also a map_cpp.py file that calls load_libarray() then imports all of _map. All code would use map_cpp. Only map_cpp would directly import _map. This is a bit painful to have an ugly shim library just to assure runtime linking works but seems the most robust. The arrays C++ library being used by many Python modules is an unusual organization for python and runtime linking is always somewhat nightmarish for large software since any changes in directory structure of the distributed code are likely to break it.
Change History (7)
comment:1 by , 5 years ago
comment:2 by , 5 years ago
A quick test static linking showed chimerax/*/*.so 34 modules totaled 5 Mbytes. A small example is _mlp.so at 50 Kbytes when dynamically linked to libarrays. Statically linked to libarrays it becomes 830 Kbytes. libarrays.dylib is 800 Kbytes, but oddly libarrays.a is 3 Mbytes. MLP uses FArray and some python array parsing. It is a somewhat puzzling that it grew nearly 800 Kbytes. If that size increase is typical then this would increase the chimerax C++ modules from 5 Mbytes to maybe 25 Mbytes.
comment:3 by , 5 years ago
It looks like any module that statically links to libarrays is sure to include the entire 800 Kbytes. The reason is that almost every use involves using routines with PyArg_ParseTupleAndKeywords such as parse_float_n3_array() that parse numpy arrays. Any such routine calls some generic code in libarrays that handles an array of any of 11 numeric value types and uses C++ classes with methods for all those types, including methods to cast to any other type. So any use is going to pull all that in.
SInce typical ChimeraX shared library size is 100 Kbytes, it is costly to add 800 Kbytes to most of those.
49920 arrays/_arrays.cpython-38-darwin.so* 50464 atom_search_lib/_load_libs.cpython-38-darwin.so* 50472 pdb_lib/_load_libs.cpython-38-darwin.so* 50656 atomic_lib/_load_libs.cpython-38-darwin.so* 50848 core/_mac_util.so* 50920 dssp/_dssp.cpython-38-darwin.so* 50944 core/_appdirs.so* 52008 spacenavigator/_spacenavigator.cpython-38-darwin.so* 52808 mlp/_mlp.cpython-38-darwin.so* 52944 alignment_algs/_sw.cpython-38-darwin.so* 53984 webcam/_webcam.cpython-38-darwin.so* 54160 morph/_morph.cpython-38-darwin.so* 55512 stl/_stl.cpython-38-darwin.so* 58176 connect_structure/_cs.cpython-38-darwin.so* 58880 coulombic/_esp.cpython-38-darwin.so* 79720 graphics/_graphics.cpython-38-darwin.so* 90400 md_crds/_gromacs.cpython-38-darwin.so* 95136 alignment_algs/_nw.cpython-38-darwin.so* 101712 mask/_mask.cpython-38-darwin.so* 111856 atom_search/ast.cpython-38-darwin.so* 128832 atomic/cytmpl.cpython-38-darwin.so* 164072 atomic/_ribbons.cpython-38-darwin.so* 174248 chem_group/_chem_group.cpython-38-darwin.so* 213592 geometry/_geometry.cpython-38-darwin.so* 231032 mmtf/_mmtf.cpython-38-darwin.so* 252544 core/_serialize.cpython-38-darwin.so* 266928 surface/_surface.cpython-38-darwin.so* 283960 pdb/_pdbio.cpython-38-darwin.so* 482648 segment/_segment.cpython-38-darwin.so* 517288 atomic/cymol.cpython-38-darwin.so* 728816 map/_map.cpython-38-darwin.so* 1044848 mmcif/_mmcif.cpython-38-darwin.so*
comment:4 by , 5 years ago
It would probably be possible to rewrite the libarrays code so bundles that don't need to handle all types of numpy arrays (ie. don't handle volume data) could statically link to something small. But it would be a lot of trouble and looks like a bad solution.
The two viable solutions are to have bundles reference libarrays via relative path like ../arrays/libarrays.dylib or use a shim python module e.g. map_cpp.py that imports the C++ _map after loading libarrays, all other code imports the shim. The relative path will not work if a bundle is not installed in site-packages where the arrays bundle is installed, and that happens when the Toolshed installs into a user-specific directory instead of directly into the application. The shim module is the only reliable approach.
comment:5 by , 5 years ago
I surveyed the 21 modules that depend on libarrays (grep ChimeraX-Arrays */bundle_info.xml) to see how they are using it. Some of the code that works on atomic structures is using it only in trivial ways creating float and int numpy arrays (python_int_array(), python_float_array()) that can be done in a page of code. But those modules depend on libatomstruct from atomic_lib which links to libarrays so all the derived modules also need to runtime link to libarrays.
So it looks like these 21 modules would need a shim to assure that libarrays is loaded first.
align_algs - array_from_python(), Numeric_Array
atomic - array_from_python(), Numeric_Array in molc.cpp, parse_float_*() in ribbon xsection.cpp, spline.cpp, normals.cpp
atomic_lib - array_from_python(), Numeric_Array in session_restore()
chem_group - python_voidp_array(), links to libarrays because it links to libatomstruct.
coulombic - parse_float_n3_array(), parse_float_n_array()
dssp - does not use libarrays directly but links to libarrays because it links to libatomstruct which needs libarrays
geometry
graphics
leap_motion - c_array_to_python() finger positions returned as numpy arrays
map
mask
mlp
mmcif - python_voidp_array(), links to libarrays because it links to libatomstruct.
mmtf - python_voidp_array(), links to libarrays because it links to libatomstruct.
morph
pdb - python_voidp_array(), array_from_python(), links to libarrays because it links to libatomstruct.
realsense
segment
stl
surface
webcam
comment:6 by , 5 years ago
Most of the C++ modules also have a Python part. Instead of a shim map_cpp that imports _map after loading libarrays another approach is just in map/__init__.py load libarrays. This is the current approach used by atomic_lib, atomic/mobject.py and geometry. It is a little less cumbersome and about as clear in cases when the C++ library is imported in __init__.py.
Or if the C++ module is only imported in one place, then load libarrays right before the import.
comment:7 by , 5 years ago
| Resolution: | → fixed |
|---|---|
| Status: | assigned → closed |
Fixed.
Made all 21 C++ Python modules that depend on libarrays call load_libarrays() before they are imported. In cases where the C++ module is imported by the bundle init.py I called load_libarrays() before the import. In other cases where the C++ module is imported from multiple locations but not from init.py I used a shim module e.g. mask_cpp.py.
Another solution to this problem would be to eliminate the libarrays dynamic library and have every module that uses it instead use a static version of the library. The main drawback would be increased code size by including some of the same code in 35 different C++ modules. It might be a small increase in size. libarrays.dylib is 800 Kbytes on macOS. That would be bad to replicate 35 times. But static linking might only use a small part of that. Would need to see what size increases occur in going from dynamic to static linking. Maybe that test is simple, just produce a libarrays.a static library and build all bundles with it and compare the size to the dynamic library version.