Opened 5 years ago

Closed 5 years ago

#3630 closed enhancement (fixed)

Remove binary prereqs from ChimeraX git repository

Reported by: Tom Goddard Owned by: Tom Goddard
Priority: moderate Milestone:
Component: Infrastructure Version:
Keywords: Cc: chimera-programmers, tic20@…
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

Remove the large binary distributions of qt, numpy, wxPython, .... from the ChimeraX git repository to reduce its current 11 Gbyte size.

In order to move ChimeraX source code to GitHub we have to remove all files larger than 100 Mbytes since GitHub does not allow such large files (even with a paid account). The overwhelming consensus online is that large binary blobs should not be put in a git repository because it makes cloning the repository exceedingly slow, often resulting in repositories with more than 95% repository bytes being obsolete binaries. That is the state of the ChimeraX repository which is 11 Gbytes in size. 8 Gbytes of our repository are Qt, PyQt and numpy windows wheels. The next biggest chunk are a dozen wxPython binaries taking 0.5 Gbytes. Then comes a variety ffmpeg binaries, a 50 Mbyte ViewDockX test file, openmm, hdf5, llvm, python source and binaries, scipy.

It appears our actual code and documentation is about 700 Mbytes.

I suggest we purge Qt, PyQt, numpy and wxPython to reduce our repository from 11 Gbytes to 2.5 Gbytes. The current versions of these files used in builds can be kept on plato and obtained by rsync for our builds, or for outside developers they can fetch them from plato using https (curl or wget). We can put the fetching code into the prereq Makefiles. With this cleaning our largest files will be under 50 Mbytes.

Attachments (1)

git-sizes (2.1 MB ) - added by Tom Goddard 5 years ago.
List of files sorted by size in git history.

Change History (10)

by Tom Goddard, 5 years ago

Attachment: git-sizes added

List of files sorted by size in git history.

comment:1 by Tom Goddard, 5 years ago

Here is the one-liner to list the sizes of all files in the git history (needs brew install coreutils on Mac for gnumfmt command). I've attached the output for the ChimeraX repository.

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --key=2 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --padding=10 --round=nearest > git-sizes
Last edited 5 years ago by Tom Goddard (previous) (diff)

comment:2 by Tom Goddard, 5 years ago

I made the ChimeraX build fetch the numpy wheel on Windows from https://www.rbvi.ucsf.edu/chimerax/data/prereqs using curl. Curl is available on stock Windows 10. (On Mac and Linux numpy is installed from PyPi.) I removed the numpy wheel from git (still in git history though). I also removed the Qt source tar ball (500 Mbytes) from git (still in history) and made it fetch it when needed. We only build Qt from source for rare debugging so Qt source will not be use by the daily or production builds.

Next step will be to purge git history of all numpy wheels and Qt source tar balls and wxPython wheels. Will try this tomorrow barring objections.

comment:3 by Tom Goddard, 5 years ago

It turns out git is not designed to allow deleting files from the history. It can be done but it requires cooperation from all developers who have cloned the repository. Specifically, if after a file is deleted from the repository a developer with a previously cloned version which still has the deleted file does a "git pull" and "git push" then the deleted file gets added back into the repository. Instead for all previously cloned copies of the repository it is necessary to rebase the cloned copy with "git pull -r" before doing a "git push". This apparently removes the deleted file from the cloned copy. I tested this with a new repository on github adding two files file1.txt then file2.txt, then git rm file1.txt, then delete file1 from history using bfg (installed on Mac with "brew install bfg")

git clone --mirror https://github.com/tomgoddard/testdelete.git remove.git
bfg --delete-files file1.txt remove.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push

Then with a previously cloned copy with file1.txt and file2.txt I tried "git pull" (without rebase) after I had commited a new change to file2.txt. It surprisingly said their was a merge conflict with file2.txt, not clear why, apparently the deletion of file1.txt had some unintended effect. I resolved the "conflict" and did "git push" and file1.txt reappeared in the repository. I repeated the deletion steps for file1.txt then used "git pull -r" on another previously cloned version of the repository, no errors, changed file2.txt and committed and pushed successfully and file1.txt did not reappear and listing all local git blobs showed no file1.txt in the history.

comment:4 by Tom Goddard, 5 years ago

Cc: tic20@… added

Because all cloned copies of the repository need to be rebased after deleting files from the git history we don't want to do this more than once. So I propose we remove all the prereq binaries that we want to remove when we create the GitHub repository. We will leave all the binaries in the original plato repository. This way when all developers switch from plato to github which requires changing the origin of their clones they can also rebase.

To see that this will work we will need to test. We will need to see that Eric's feature branches migrate properly. First I will do a test making a trial ChimeraX github repository from our plato repository with several binaries (qt, numpy, wxpython) removed and then will copy my local clone and try to change its origin and rebase and commit a change and pull and push. If that all works, Eric can try merging a copy of one of his branches into this trial github repository. If these tests succeed I can then make a more complete list of the binaries we want to purge, and after the ChimeraX 1.1 release we can attempt the switch over to GitHub with a fresh GitHub repository copied from the plato one with binaries deleted.

comment:5 by Tom Goddard, 5 years ago

I copied the plato git repository to GitHub removing qt, numpy and wxPython binaries using bfg. That all worked smoothly reducing the repository .git directory from 9 Gbytes to 1.4 Gbytes. At that point the bfg git file deletion documentation says:

"At this point, you're ready for everyone to ditch their old copies of the repo and do fresh clones of the nice, new pristine data. It's best to delete all old clones, as they'll have dirty history that you don't want to risk pushing back into your newly cleaned repo."

In other words every developer is expected to abandon their clones of the old repository and clone the new one. This is ok except if clones have unpushed changes. Eric has some feature branches that are longterm projects that I believe he said are not pushed to the plato repository and that have changes. So I tried to reset the origin of a clone of the plato repository to point to github and fix any problems that result from the deleted files. Five hours of effort later and not succeeding I think this approach is ill-advised. The trouble is that bfg deletes the unwanted files by rewriting the entire git commit history. Other git delete methods do the same (e.g. git filter-branch). This is unavoidable by the design of git since the commit ids are file SHA-1 hash values and the commit files change because a commit file references its parent and the objects changed, so once a very old file is delete the entire commit history after that file was added changes to use new ids. In theory an old cloned repository could be updated to use the new history, involving hard resets of all branch head commit ids, all tags, flushing git reflogs (which are cache values), repacking all objects eliminating unreachable ones, .... I'm not sure about the "..." in the last sentence since doing all that I still didn't quite manage to purge all references to the old commit history. All of this monkeying is something git never intended, the repository history was clearly designed to never have things deleted. The trouble if any references to the old commit history remain, then a git push could put them all back on the server (GitHub), undoing the deleted files and potentially causing havoc. For example, in my tests, I sometimes got into a state where I had one unpushed commit, but git status said I had 14500 unpushed commits, basically git decided since all that old history was not on the remote server I must have just committed it locally.

The upshot of this is I think the only sane approach is that if we delete files, indeed all old clones are discarded, and all developers make new clones.

In order to handle Eric's longterm branches which have unpushed commits I propose that Eric push those branches to plato so that plato has the entire state with no unpushed commits. Then we delete the big files and make the new repository on github, and we use only new clones from github. When Eric eventually merges his feature branches into the develop branch he can then delete the feature branch on the server if he wants to. Eric, does that sound feasible?

Last edited 5 years ago by Tom Goddard (previous) (diff)

comment:6 by pett, 5 years ago

I think my plan it that once 1.1 is out, I will create a new branch in my clone -- homed the same way as my other branches -- commit some trivial changes to it and then try to "de-home" it and push it and see if I get a functional branch on plato. Once that works right, I will leave my other branches as is until it's almost "flag day" (since it's a pain in the ass to work with the de-homed versions) and then switch them over and push all changes.

--Eric

comment:7 by Tom Goddard, 5 years ago

I made ChimeraX fetch about 30 tar balls and wheels of third party packages from plato using curl and https and removed those large files from the ChimeraX git repository with git rm, so they are still in the history. I also tested the command to remove these files from git history in preparation for migrating the repository to github, shown below. This reduces the repository size about 100-fold from 9 Gbytes to 88 Mbytes. The plan is to leave these files in plato git repository and only remove them from the GitHub repository.

bfg --delete-files "{qt-everywhere*,numpy-1*.whl,PyQt-gpl-*.tar.gz,PyQt5_gpl-*.tar.gz,PyQt-win-gpl*.zip,numpy-1*.tar.gz,Python-3*.tar.xz,Python-3*.tar.bz2,Python-2*.tar.bz2,python-3*-amd64.exe,python-3.5.1-amd64-webinstall.exe,scipy-1*.whl,scipy-0*.whl,openmm-7*tar.bz2,OpenMM-7*tar.bz2,ffmpeg-3*.exe,ffmpeg-3*.zip,ffmpeg-3*.tar.bz2,ffmpeg-2*.tar.bz2,libvpx-1.6.1.tar.xz,libtheora-1.1.1.tar.xz,yasm-1.3.0.tar.bz2,x264*.tar.xz,x264*.tar.bz2,libogg-1.3.2.tar.xz,wxPython_Phoenix*.tar.gz,wxPython_Phoenix*.whl,giant.mol2,pdb3cc4_atoms.py,pdb3k9f_atoms.py,mmtf-cpp-master.zip,msgpack-c-master.zip,hdf5-1.8.16-win64-vs2015-shared.zip,hdf5-1*.tar.bz2,swagger-codegen-cli.jar,llvm-3*.src.tar.gz,llvm-3*.src.tar.xz,cmake-3.5.2-win32-x86.msi,cmake-2.8.10.2-win32-x86.zip,cmake-2.8.10.2.tar.gz,innosetup-5.5.9-unicode.exe,ovr_sdk_macos_0.4.4_beta.tar.gz,mesa-*tar.gz,mesa-*.tar.xz,Pillow-*.tar.gz,Pillow-*.zip,openssl-*tar.gz,pycollada-*.tar.gz,tables-3*.tar.gz,tables-3*.whl,freefont-ttf-20120719.tar.gz,libxml2*.tar.gz,libxslt-1.1.28.tar.gz,p7zip_9.20.1_src_all.tar.bz2,pyside-*.tar.bz2,shiboken*.tar.bz2,gdcm-*.tar.bz2,gdcm-*.whl,Sphinx-1*.tar.gz,birkenfeld-sphinx-contrib-*.zip,PyOpenGL*win_amd64.whl,PyOpenGL-*.tar.gz,PyOpenGL-*.tar.bz2,Cython-0*.tar.gz,Pygments-*.tar.gz,leap-*.tar.gz,docutils-0*.tar.gz,pcre-*.tar.bz2,jpeg*.tar.gz,distribute-*.zip,distribute-*.tar.gz,Jinja2-*.tar.gz,blew-*.tgz,pythomnic3k-*.tar.gz,xdg-utils-*.tar.gz,pkg-config-0*.tar.gz,libtool-2*.tar.gz,networkx-*.tar.gz,sip-4*.tar.gz,seggerx_*.tar.gz,setuptools-*.tar.gz,zlib-*.tar.gz,expat-2.1.0.tar.gz,blockdiag-*.tar.gz,python-dateutil-*.tar.gz,py-dateutil-*.tar.gz,beautifulsoup4-4.3.2.tar.gz,numexpr-*-win_amd64.whl,numexpr-*.tar.gz,chrpath-0.16.tar.gz,line_profiler-*.tar.gz,msgpack-python-0.5.tar.gz,funcparserlib-*.tar.gz,pyflakes-*.tar.gz,numpydoc-0.5.tar.gz,flake8-2.2.5.tar.gz,six-1.8.0.tar.gz,ordered-set-1.3.tar.gz,filelock-2.0.6.tar.gz,webcolors-1.4.tar.gz,appdirs-1.4.0.tar.gz,semantic_version-2.3.1.tar.gz,pdbfixer-1.3.1.tar.bz2,LeapC.dll,LeapC.lib,HoloPlayCore.dll,libHoloPlayCore.*,glew-1*.tgz}" --no-blob-protection chimerax.git

cd chimerax.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive

Here are the largest files remaining in the ChimeraX git repository in bytes in second column (first column is git hash), includes some data, test data, movies, images, a bit of third party code. Largest files is 14 Mbytes, Dunbrack rotamer data.

3073b813806b   14510386 src/bundles/rotamer_libs/Dunbrack/src/dependentRotamerData.zip
1229552b8cae   12780834 testdata/1vqn.cif
586aed89269f    6723813 src/bundles/viewdockx/src/example_files/dock3.7/tcams.mol2
e6b4339d7de0    6533892 src/hydra/molecule/bond_templates
1f6fb52f425f    6517016 src/apps/hydra/molecule/bond_templates
cf9c72449e78    3453147 src/bundles/maestro/test-data/kegg_dock5.mae
e63e059451a9    2205623 testdata/cell15_timeseries.cmap
c4852751a520    2109215 testdata/1gcf.cif
a3fd718b2ccb    1964699 src/bundles/maestro/test-data/glide-test2.mae
86bb74fbab25    1655182 src/bundles/maestro/test-data/InducedFit_tut1_all2-out.mae
d0c7c247b385    1592798 docs/quickstart/images/spin.ogv
3aa3781e4a7a    1406724 src/bundles/maestro/test-data/InducedFit_tut1-out.mae
5d3ca70f1695    1324918 docs/quickstart/images/spin.ogv
558aa8e42f9d    1312098 src/bundles/viewdockx/src/example_files/nudock.pdb
088008bbb67f    1302602 src/bundles/meeting/src/face.png
078a2a899dcf    1139147 docs/quickstart/images/spin.mp4
5191d92927bc    1077447 docs/quickstart/images/spin.mp4
a85e1683dbeb     974311 docs/quickstart/images/spin.mp4
51cfeb69f6b6     958303 src/apps/ChimeraX/Chimera-icon.ai
9cce7b85a721     936046 webdemo/www/1a0m.json
cccfae7cc95a     913870 src/bundles/map_data/src/ims/imaris.webarchive
f48802c6bfd8     888870 docs/quickstart/images/spin.mp4
7daa4e4cf27b     812018 docs/user/tutorials/oculus-touch.png
7128d73afd33     807843 src/bundles/shortcuts/src/icons/hydrophobicity.cxs
319adef12882     772038 docs/user/blimp.png
281e8cfb0aec     740180 webdemo/www/three/Three.js
94de3f7df27b     739526 webdemo/www/three/Three.js
9f8cee8ebd80     725039 src/bundles/help_viewer/Minicons-Free-Pack.zip
f240b6cd3fe7     712891 src/bundles/viewdockx/src/example_files/gold_protein.mol2
e1201e0f15be     704618 docs/user/blimp.png
c472a48418a5     649913 src/bundles/viewdockx/flot-0.8.3.zip
7a8079a64f0a     644291 docs/user/tutorials/dicom-vr3.png
ebeeed3db437     632344 src/bundles/viewdockx/src/example_files/dock3.5.pdb
d4aa740ea43f     624951 docs/user/tutorials/dicom-vr4-full.png
0733208403e6     604004 src/apps/hydra/Hydra.app/Contents/Resources/hydra.icns
b98348116c98     491311 docs/quickstart/images/chimerax.png

comment:8 by Tom Goddard, 5 years ago

I pushed the 88 Mbyte ChimeraX repository that does not include prereq binaries to GitHub, then made a new clone from GitHub and built on Mac. All worked fine. A new clone takes about 200 Mbytes of space, the .git directory is 88 Mbytes and the rest ~110 Mbytes is checked out files. This makes sense since the .git pack files are compressed. Most of the remaining files in the repository are data, example files, images and movies. There is probably about 10 Mbytes that is our actual code, documentation, Makefiles -- shocking how little there is.

It remains for Eric to figure out pushing is local branches to plato. Then after the ChimeraX 1.1 release we will be ready to move to GitHub.

comment:9 by Tom Goddard, 5 years ago

Resolution: fixed
Status: assignedclosed

The ChimeraX source code was migrated to github on September 14 with binary prereqs removed.

https://github.com/RBVI/ChimeraX/tree/develop

Note: See TracTickets for help on using tickets.