Opened 6 months ago
Last modified 5 months ago
#17555 assigned enhancement
Optimize Boltz predictions for large structures
Reported by: | Tom Goddard | Owned by: | Tom Goddard |
---|---|---|---|
Priority: | moderate | Milestone: | |
Component: | Structure Prediction | Version: | |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Notify when closed: | Platform: | all | |
Project: | ChimeraX |
Description
Boltz runs out of memory or takes a very long time predicting structures over 1000 residues. Here are some timings
https://www.rbvi.ucsf.edu/chimerax/data/boltz-apr2025/boltz_help.html#runtimes
An idea to improve the run time and memory use is to use bfloat16 values instead of float32 both in the model weights and the activations. Another idea is to use matrix multiplications using bfloat16 operations provided by tensor cores in Nvidia GPUs. Test these ideas to see if they help.
Change History (11)
comment:1 by , 6 months ago
comment:2 by , 6 months ago
Changing Boltz code to use bfloat16 weights and activations allowed successful predictions in cases that formerly gave out-of-memory errors for 9fz5 (1025 tokens) running in 181 seconds and 8sa0 (1371 residues) in 331 seconds. Possibly it will be able to predict up to twice the former size limit of 900 residues on the Nvidia 4090 on minsky. Unfortunately no, test from 1467 to 1770 residues all failed with out of memory. The structures looked fine but should be compared to the same predictions on the Mac Studio for accuracy.
The Boltz code change in main.py in function predict() changes
trainer.predict(...)
to
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
trainer.predict(...)
Not sure if this autocast approach can be used on Mac. There is another approach
model = model.to(bfloat32)
data = data.to(bfloat32)
that might work on Mac. Online sources suggest Mac M2 and newer support bfloat16 in Metal shaders. But it is not clear if PyTorch supports it on Mac.
I should test this improvement on Windows with Nvidia cards with less memory (our 3070 and 4070).
If it appears to be a valuable optimization I can make a pull request on the Boltz repository to add an option "--use_bfloat32 true" to the boltz predict command to allow using this without modifying Boltz source during installation.
comment:3 by , 6 months ago
This post suggests torch.autocast() to bfloat32 can work on Mac with MPS.
comment:4 by , 6 months ago
Tests show that using bfloat16 on Windows with Nvidia is very helpful, but on Windows with only CPU it is much slower.
I tried torch.autocast('cuda', dtype=torch.bfloat16) on Windows 11 with Nvidia 3070 (8 GB) i7-12700K CPU vizvault.cgl.ucsf.edu and it improved speeds for all runs. Here are some timings.
8rf4, no autocast 1 min, with bfloat16 1 min
1hho, no autocast 290 sec, with bfloat16 108 sec, almost all side chain positions look identical
9moj, no autocast 7 min, with bfloat16 2.6 min (149 sec)
9h1k, no autocast 17 min, with bfloat16 7 min (482 sec)
I also tried bfloat16 with 'cpu' device (no cuda) on this same machine. torch.autocast('cpu', dtype=torch.bfloat16)
8rf4, no autocast 2.5 min (151 sec), with bfloat16 16 min (971 sec)
So apparently bfloat16 is not optimized on this CPU and is much much slower than float32.
I also tried autocast('cuda', dtype.torch.bfloat16) while using the cpu device and as expected it had no effect on the run time.
8rf4, autocast cuda bfloat32, 2.2 min (135 sec)
comment:5 by , 6 months ago
From the previous post it looks like bfloat16 may not be supported in hardware on the i7-12700K CPU tested in the preceding comment. The Intel Extension for PyTorch (https://github.com/intel/intel-extension-for-pytorch) which is only for Linux suggests that Intel CPUs need AVX512-BF16 and AMX-BF16 instruction set support for good performance and AVX-512 is was dropped in Intel Core i3/i5/i7/i9 CPUs because the efficiency cores those CPUs offer don't have it and the performance cores disable it so all cores are equivalent. Apparently it is an Intel Xeon CPU feature. So that won't help people with laptop and desktop machines since the Intel Xeon is usually only found in servers.
It also seems that PyTorch does not support common laptop and desktop Intel GPUs. Here is the PyTorch docs describing Intel GPU support (platform torch.xpu).
https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html
It may be that there is nothing that can be done to optimize Boltz on Intel CPUs, so it will simply run much slower on those machines compared to Mac ARM or Windows/Linux Nvidia.
comment:6 by , 6 months ago
Tried bfloat16 on Mac ARM using
with torch.autocast(device_type='mps', dtype=torch.bfloat16):
trainer.predict(...)
PDB 8rf4 took 52 seconds with no autocast and 73 seconds with autocast bfloat16.
PDB 9moj took 130 sec (10.8 - 12.8 GB) with no autocast and 159 sec with autocast bfloat16 (9.5 - 11.5 GB in activity monitor).
PDB 9fz5 took 394 sec (39.9 - 42.3 GB) with no autocast, and 436 sec with autocast bfloat16 (32.5 - 35 GB).
So it appears to run about 10% slower on a large prediction, 20% on medium, 40% on small and use only about 20% less memory on large prediction, 10% on medium prediction.
Might be worth casting model and data to bfloat16 instead of using autocast to see if larger memory reduction is achieved.
comment:7 by , 6 months ago
On Mac tried using Trainer(precision = 16) instead of precision = 32. 8rf4 47 seconds. But got a new warning to stderr suggesting autocast was not used.
/Users/goddard/boltz/lib/python3.11/site-packages/torch/amp/autocast_mode.py:266: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
comment:8 by , 5 months ago
I timed Boltz on PDB 9moj (660 residues) on an Intel Xeon processor on Linux with AVX-512 CPU instructions to see if torch is much faster. Completed successfully in 18 minutes compared to 29 minutes on Windows with an i7-12700K processor.
Details: 4 Intel Xeon Gold 6226R CPU @ 2.90GHz (each 16 cores), watson.cgl.ucsf.edu, 9moj, 1098 seconds = 18.3 minutes, 2600-2800% CPU reported by top, 12.5G VIRT, 6.8G RSS.
Given that it used about 28 cores versus the i7 test that used 12 cores it seems this Intel Xeon from 2020 is slower than the i7 per core. I did not install the Intel Extensions for PyTorch which are supposed to use AVX-512 CPU instructions, so probably this result just means the PyPi default linux pytorch (which includes cuda support) does not have these Intel-specific optimizations.
comment:9 by , 5 months ago
I tried pytorch 2.7.0 for xpu (Intel GPU) on Linux (minsky.cgl.ucsf.edu). It did not find any available GPU. Apparently this pytorch only supports Intel Arc discrete graphics, Intel Core Ultra processors with built-in Intel Arc graphics and Intel Data Center GPU Max Series, and not generic desktop CPUs like our Core i9-13900K.
comment:10 by , 5 months ago
I tried intel extensions for pytorch (ipex) on minsky and had no luck.
9blp, 103 sec minsky with intel extensions for pytorch (IPEX) imported, shows 2400% CPU
9blp, 103 sec minsky standard cuda torch using only cpu, no IPEX. I wasn't calling ipex.optimize().
Calling ipex.optimize(model, dtype=torch.bfloat16) ran but hit error
"AssertionError: BF16 weight prepack needs the cpu support avx_ne_convert or avx512bw, avx512vl and avx512dq, but the desired instruction sets are not available. Please set dtype to torch.float or set weights_prepack to False."
Calling ipex.optimize(model, dtype=torch.float) ran but got error
File "/home/goddard/boltz_0.4.1_cpu_ipex/lib/python3.11/site-packages/boltz/model/modules/trunk.py", line 375, in forward
msa_dropout = get_dropout_mask(self.msa_dropout, m, self.training)
AttributeError: 'NoneType' object has no attribute 'msa_dropout'
comment:11 by , 5 months ago
I added code to add a bfloat16 option to the ChimeraX boltz gui and command and tested with a modified github boltz repository where I added a --use_cuda_bfloat16 option. I've commented out the ChimeraX code while I wait for the boltz github pull request to be accepted since that is needed to allow using this option.
I ran tests on minsky.cgl.ucsf.edu with Nvidia 4090 graphics editing the Boltz Python code. Allowing matrix multiplications to use several bfloat16 operations instead of one float32 operation in tensor cores using
did not help. The run time decreased a negligible amount from 155 to 149 seconds for 911 amino acid PDB 9b3h, and out-of-memory error still occured for 9fz5 with 1025 tokens.
The boltz code has
which always uses full float32 operations in matrix multiplication.