Opened 5 years ago
Closed 5 years ago
#3630 closed enhancement (fixed)
Remove binary prereqs from ChimeraX git repository
Reported by: | Tom Goddard | Owned by: | Tom Goddard |
---|---|---|---|
Priority: | moderate | Milestone: | |
Component: | Infrastructure | Version: | |
Keywords: | Cc: | chimera-programmers, tic20@… | |
Blocked By: | Blocking: | ||
Notify when closed: | Platform: | all | |
Project: | ChimeraX |
Description
Remove the large binary distributions of qt, numpy, wxPython, .... from the ChimeraX git repository to reduce its current 11 Gbyte size.
In order to move ChimeraX source code to GitHub we have to remove all files larger than 100 Mbytes since GitHub does not allow such large files (even with a paid account). The overwhelming consensus online is that large binary blobs should not be put in a git repository because it makes cloning the repository exceedingly slow, often resulting in repositories with more than 95% repository bytes being obsolete binaries. That is the state of the ChimeraX repository which is 11 Gbytes in size. 8 Gbytes of our repository are Qt, PyQt and numpy windows wheels. The next biggest chunk are a dozen wxPython binaries taking 0.5 Gbytes. Then comes a variety ffmpeg binaries, a 50 Mbyte ViewDockX test file, openmm, hdf5, llvm, python source and binaries, scipy.
It appears our actual code and documentation is about 700 Mbytes.
I suggest we purge Qt, PyQt, numpy and wxPython to reduce our repository from 11 Gbytes to 2.5 Gbytes. The current versions of these files used in builds can be kept on plato and obtained by rsync for our builds, or for outside developers they can fetch them from plato using https (curl or wget). We can put the fetching code into the prereq Makefiles. With this cleaning our largest files will be under 50 Mbytes.
Attachments (1)
Change History (10)
by , 5 years ago
comment:1 by , 5 years ago
Here is the one-liner to list the sizes of all files in the git history (needs brew install coreutils on Mac for gnumfmt command). I've attached the output for the ChimeraX repository.
git rev-list --objects --all \ | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \ | sed -n 's/^blob //p' \ | sort --numeric-sort --key=2 \ | cut -c 1-12,41- \ | $(command -v gnumfmt || echo numfmt) --field=2 --padding=10 --round=nearest > git-sizes
comment:2 by , 5 years ago
I made the ChimeraX build fetch the numpy wheel on Windows from https://www.rbvi.ucsf.edu/chimerax/data/prereqs using curl. Curl is available on stock Windows 10. (On Mac and Linux numpy is installed from PyPi.) I removed the numpy wheel from git (still in git history though). I also removed the Qt source tar ball (500 Mbytes) from git (still in history) and made it fetch it when needed. We only build Qt from source for rare debugging so Qt source will not be use by the daily or production builds.
Next step will be to purge git history of all numpy wheels and Qt source tar balls and wxPython wheels. Will try this tomorrow barring objections.
comment:3 by , 5 years ago
It turns out git is not designed to allow deleting files from the history. It can be done but it requires cooperation from all developers who have cloned the repository. Specifically, if after a file is deleted from the repository a developer with a previously cloned version which still has the deleted file does a "git pull" and "git push" then the deleted file gets added back into the repository. Instead for all previously cloned copies of the repository it is necessary to rebase the cloned copy with "git pull -r" before doing a "git push". This apparently removes the deleted file from the cloned copy. I tested this with a new repository on github adding two files file1.txt then file2.txt, then git rm file1.txt, then delete file1 from history using bfg (installed on Mac with "brew install bfg")
git clone --mirror https://github.com/tomgoddard/testdelete.git remove.git
bfg --delete-files file1.txt remove.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push
Then with a previously cloned copy with file1.txt and file2.txt I tried "git pull" (without rebase) after I had commited a new change to file2.txt. It surprisingly said their was a merge conflict with file2.txt, not clear why, apparently the deletion of file1.txt had some unintended effect. I resolved the "conflict" and did "git push" and file1.txt reappeared in the repository. I repeated the deletion steps for file1.txt then used "git pull -r" on another previously cloned version of the repository, no errors, changed file2.txt and committed and pushed successfully and file1.txt did not reappear and listing all local git blobs showed no file1.txt in the history.
comment:4 by , 5 years ago
Cc: | added |
---|
Because all cloned copies of the repository need to be rebased after deleting files from the git history we don't want to do this more than once. So I propose we remove all the prereq binaries that we want to remove when we create the GitHub repository. We will leave all the binaries in the original plato repository. This way when all developers switch from plato to github which requires changing the origin of their clones they can also rebase.
To see that this will work we will need to test. We will need to see that Eric's feature branches migrate properly. First I will do a test making a trial ChimeraX github repository from our plato repository with several binaries (qt, numpy, wxpython) removed and then will copy my local clone and try to change its origin and rebase and commit a change and pull and push. If that all works, Eric can try merging a copy of one of his branches into this trial github repository. If these tests succeed I can then make a more complete list of the binaries we want to purge, and after the ChimeraX 1.1 release we can attempt the switch over to GitHub with a fresh GitHub repository copied from the plato one with binaries deleted.
comment:5 by , 5 years ago
I copies the plato git repository to GitHub removing qt, numpy and wxPython binaries using bfg. That all worked smoothly reducing the repository .git directory from 9 Gbytes to 1.4 Gbytes. At that point the bfg git file deletion documentation says:
"At this point, you're ready for everyone to ditch their old copies of the repo and do fresh clones of the nice, new pristine data. It's best to delete all old clones, as they'll have dirty history that you don't want to risk pushing back into your newly cleaned repo."
In other words every developer is expected to abandon their clones of the old repository and clone the new one. This is ok except if clones have unpushed changes. Eric has some feature branches that are longterm projects that I believe he said are not pushed to the plato repository and that have changes. So I tried to reset the origin of a clone of the plato repository to point to github and fix any problems that result from the deleted files. Five hours of effort later and not succeeding I think this approach is ill-advised. The trouble is that bfg deletes the unwanted files by rewriting the entire git commit history. Other git delete methods do the same (e.g. git filter-branch). This is unavoidable by the design of git since the commit ids are file SHA-1 hash values and the commit files change because a commit file references its parent and the objects changed, so once a very old file is delete the entire commit history after that file was added changes to use new ids. In theory an old cloned repository could be updated to use the new history, involving hard resets of all branch head commit ids, all tags, flushing git reflogs (which are cache values), repacking all objects eliminating unreachable ones, .... I'm not sure about the "..." in the last sentence since doing all that I still didn't quite manage to purge all references to the old commit history. All of this monkeying is something git never intended, the repository history was clearly designed to never have things deleted. The trouble if any references to the old commit history remain, then a git push could put them all back on the server (GitHub), undoing the deleted files and potentially causing havoc. For example, in my tests, I sometimes got into a state where I had one unpushed commit, but git status said I had 14500 unpushed commits, basically git decided since all that old history was not on the remote server I must have just committed it locally.
The upshot of this is I think the only sane approach is that if we delete files, indeed all old clones are discarded, and all developers make new clones.
In order to handle Eric's longterm branches which have unpushed commits I propose that Eric push those branches to plato so that plato has the entire state with no unpushed commits. Then we delete the big files and make the new repository on github, and we use only new clones from github. When Eric eventually merges his feature branches into the develop branch he can then delete the feature branch on the server if he wants to. Eric, does that sound feasible?
comment:6 by , 5 years ago
I think my plan it that once 1.1 is out, I will create a new branch in my clone -- homed the same way as my other branches -- commit some trivial changes to it and then try to "de-home" it and push it and see if I get a functional branch on plato. Once that works right, I will leave my other branches as is until it's almost "flag day" (since it's a pain in the ass to work with the de-homed versions) and then switch them over and push all changes.
--Eric
comment:7 by , 5 years ago
I made ChimeraX fetch about 30 tar balls and wheels of third party packages from plato using curl and https and removed those large files from the ChimeraX git repository with git rm, so they are still in the history. I also tested the command to remove these files from git history in preparation for migrating the repository to github, shown below. This reduces the repository size about 100-fold from 9 Gbytes to 88 Mbytes. The plan is to leave these files in plato git repository and only remove them from the GitHub repository.
bfg --delete-files "{qt-everywhere*,numpy-1*.whl,PyQt-gpl-*.tar.gz,PyQt5_gpl-*.tar.gz,PyQt-win-gpl*.zip,numpy-1*.tar.gz,Python-3*.tar.xz,Python-3*.tar.bz2,Python-2*.tar.bz2,python-3*-amd64.exe,python-3.5.1-amd64-webinstall.exe,scipy-1*.whl,scipy-0*.whl,openmm-7*tar.bz2,OpenMM-7*tar.bz2,ffmpeg-3*.exe,ffmpeg-3*.zip,ffmpeg-3*.tar.bz2,ffmpeg-2*.tar.bz2,libvpx-1.6.1.tar.xz,libtheora-1.1.1.tar.xz,yasm-1.3.0.tar.bz2,x264*.tar.xz,x264*.tar.bz2,libogg-1.3.2.tar.xz,wxPython_Phoenix*.tar.gz,wxPython_Phoenix*.whl,giant.mol2,pdb3cc4_atoms.py,pdb3k9f_atoms.py,mmtf-cpp-master.zip,msgpack-c-master.zip,hdf5-1.8.16-win64-vs2015-shared.zip,hdf5-1*.tar.bz2,swagger-codegen-cli.jar,llvm-3*.src.tar.gz,llvm-3*.src.tar.xz,cmake-3.5.2-win32-x86.msi,cmake-2.8.10.2-win32-x86.zip,cmake-2.8.10.2.tar.gz,innosetup-5.5.9-unicode.exe,ovr_sdk_macos_0.4.4_beta.tar.gz,mesa-*tar.gz,mesa-*.tar.xz,Pillow-*.tar.gz,Pillow-*.zip,openssl-*tar.gz,pycollada-*.tar.gz,tables-3*.tar.gz,tables-3*.whl,freefont-ttf-20120719.tar.gz,libxml2*.tar.gz,libxslt-1.1.28.tar.gz,p7zip_9.20.1_src_all.tar.bz2,pyside-*.tar.bz2,shiboken*.tar.bz2,gdcm-*.tar.bz2,gdcm-*.whl,Sphinx-1*.tar.gz,birkenfeld-sphinx-contrib-*.zip,PyOpenGL*win_amd64.whl,PyOpenGL-*.tar.gz,PyOpenGL-*.tar.bz2,Cython-0*.tar.gz,Pygments-*.tar.gz,leap-*.tar.gz,docutils-0*.tar.gz,pcre-*.tar.bz2,jpeg*.tar.gz,distribute-*.zip,distribute-*.tar.gz,Jinja2-*.tar.gz,blew-*.tgz,pythomnic3k-*.tar.gz,xdg-utils-*.tar.gz,pkg-config-0*.tar.gz,libtool-2*.tar.gz,networkx-*.tar.gz,sip-4*.tar.gz,seggerx_*.tar.gz,setuptools-*.tar.gz,zlib-*.tar.gz,expat-2.1.0.tar.gz,blockdiag-*.tar.gz,python-dateutil-*.tar.gz,py-dateutil-*.tar.gz,beautifulsoup4-4.3.2.tar.gz,numexpr-*-win_amd64.whl,numexpr-*.tar.gz,chrpath-0.16.tar.gz,line_profiler-*.tar.gz,msgpack-python-0.5.tar.gz,funcparserlib-*.tar.gz,pyflakes-*.tar.gz,numpydoc-0.5.tar.gz,flake8-2.2.5.tar.gz,six-1.8.0.tar.gz,ordered-set-1.3.tar.gz,filelock-2.0.6.tar.gz,webcolors-1.4.tar.gz,appdirs-1.4.0.tar.gz,semantic_version-2.3.1.tar.gz,pdbfixer-1.3.1.tar.bz2,LeapC.dll,LeapC.lib,HoloPlayCore.dll,libHoloPlayCore.*,glew-1*.tgz}" --no-blob-protection chimerax.git cd chimerax.git git reflog expire --expire=now --all && git gc --prune=now --aggressive
Here are the largest files remaining in the ChimeraX git repository in bytes in second column (first column is git hash), includes some data, test data, movies, images, a bit of third party code. Largest files is 14 Mbytes, Dunbrack rotamer data.
3073b813806b 14510386 src/bundles/rotamer_libs/Dunbrack/src/dependentRotamerData.zip 1229552b8cae 12780834 testdata/1vqn.cif 586aed89269f 6723813 src/bundles/viewdockx/src/example_files/dock3.7/tcams.mol2 e6b4339d7de0 6533892 src/hydra/molecule/bond_templates 1f6fb52f425f 6517016 src/apps/hydra/molecule/bond_templates cf9c72449e78 3453147 src/bundles/maestro/test-data/kegg_dock5.mae e63e059451a9 2205623 testdata/cell15_timeseries.cmap c4852751a520 2109215 testdata/1gcf.cif a3fd718b2ccb 1964699 src/bundles/maestro/test-data/glide-test2.mae 86bb74fbab25 1655182 src/bundles/maestro/test-data/InducedFit_tut1_all2-out.mae d0c7c247b385 1592798 docs/quickstart/images/spin.ogv 3aa3781e4a7a 1406724 src/bundles/maestro/test-data/InducedFit_tut1-out.mae 5d3ca70f1695 1324918 docs/quickstart/images/spin.ogv 558aa8e42f9d 1312098 src/bundles/viewdockx/src/example_files/nudock.pdb 088008bbb67f 1302602 src/bundles/meeting/src/face.png 078a2a899dcf 1139147 docs/quickstart/images/spin.mp4 5191d92927bc 1077447 docs/quickstart/images/spin.mp4 a85e1683dbeb 974311 docs/quickstart/images/spin.mp4 51cfeb69f6b6 958303 src/apps/ChimeraX/Chimera-icon.ai 9cce7b85a721 936046 webdemo/www/1a0m.json cccfae7cc95a 913870 src/bundles/map_data/src/ims/imaris.webarchive f48802c6bfd8 888870 docs/quickstart/images/spin.mp4 7daa4e4cf27b 812018 docs/user/tutorials/oculus-touch.png 7128d73afd33 807843 src/bundles/shortcuts/src/icons/hydrophobicity.cxs 319adef12882 772038 docs/user/blimp.png 281e8cfb0aec 740180 webdemo/www/three/Three.js 94de3f7df27b 739526 webdemo/www/three/Three.js 9f8cee8ebd80 725039 src/bundles/help_viewer/Minicons-Free-Pack.zip f240b6cd3fe7 712891 src/bundles/viewdockx/src/example_files/gold_protein.mol2 e1201e0f15be 704618 docs/user/blimp.png c472a48418a5 649913 src/bundles/viewdockx/flot-0.8.3.zip 7a8079a64f0a 644291 docs/user/tutorials/dicom-vr3.png ebeeed3db437 632344 src/bundles/viewdockx/src/example_files/dock3.5.pdb d4aa740ea43f 624951 docs/user/tutorials/dicom-vr4-full.png 0733208403e6 604004 src/apps/hydra/Hydra.app/Contents/Resources/hydra.icns b98348116c98 491311 docs/quickstart/images/chimerax.png
comment:8 by , 5 years ago
I pushed the 88 Mbyte ChimeraX repository that does not include prereq binaries to GitHub, then made a new clone from GitHub and built on Mac. All worked fine. A new clone takes about 200 Mbytes of space, the .git directory is 88 Mbytes and the rest ~110 Mbytes is checked out files. This makes sense since the .git pack files are compressed. Most of the remaining files in the repository are data, example files, images and movies. There is probably about 10 Mbytes that is our actual code, documentation, Makefiles -- shocking how little there is.
It remains for Eric to figure out pushing is local branches to plato. Then after the ChimeraX 1.1 release we will be ready to move to GitHub.
comment:9 by , 5 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
The ChimeraX source code was migrated to github on September 14 with binary prereqs removed.
List of files sorted by size in git history.