Opened 4 days ago
Last modified 26 hours ago
#19227 assigned task
Survey of Claude AI problems controlling ChimeraX
| Reported by: | Tom Goddard | Owned by: | Tom Goddard |
|---|---|---|---|
| Priority: | moderate | Milestone: | |
| Component: | UI | Version: | |
| Keywords: | Cc: | Zach Pearson, Eric Pettersen, a.rohou@…, iamkaant@… | |
| Blocked By: | Blocking: | ||
| Notify when closed: | Platform: | all | |
| Project: | ChimeraX |
Description
The purpose of this ticket is to document the many problems that prevent Claude from successfully controlling ChimeraX, and use this list to figure out how best to improve Claude's capabilities with ChimeraX.
I've used Claude to control ChimeraX through MCP (Model Context Protocol) for about 5 hours on real-world easy-to-hard difficult biology problems that ChimeraX can handle. I'd estimate Claude runs into some problem about 20 times per hour, and I have encountered dozens of different failure modes.
Since there are so many problems I'll number them so it is easy to add comments refering to specific issues. Some of the problems are very general (e.g. Claude hits maximum conversation size limit) and some are narrow and techical (e.g. Claude gets model spec ranges wrong #1.1-1.5 instead of #1.1-5). The general ones are fewer and I think more interesting in terms of improving the utility of Claude controlling ChimeraX so I'll put a "G" in the number for those issues to highlight those.
Change History (3)
comment:1 by , 4 days ago
comment:2 by , 4 days ago
G13) Claude usage limit. With the free Claude plan I was only able to ask about 15 questions before it said I reached the usage limit for a 5 hour period. That was about 30 minutes of use. So I signed up for the Claude Pro plan, $20 per month, that gives 5x the usage limits. I also have run out of usage with the Pro plan twice so far in a total of just 5 hours of use over 3 days. I only get about 90 minutes of use before it says I must wait another few hours to get more usage. The next Claude price plan Max starts at $100 per month which would probably limit use by academic biologists. The Max plan gives 5x more usage than Pro and would probably allow a full day work with Claude without hitting limits.
comment:3 by , 26 hours ago
| Cc: | added |
|---|
Here are just a few of the many problems I encountered with Claude. I tried to recall the more general problems.
1) Help command. Claude uses the ChimeraX help command (e.g. help boltz predict) but the command does not return the actual documentation text so I don't think Claude gets any benefit.
2) Usage command. Claude uses the ChimeraX usage command (e.g. usage boltz predict) but the syntax of the allowed values for the some options are not very helpful. For instance the boltz prediction "affinity" option says it is "a text string", but it really has to be a CCD code or SMILES string of a ligand in the prediction.
3) Superfluous command options. Claude uses frivolous command options that mess up the result. For instance I asked it to color an cryoEM atomic model by bfactor and it used "color bfactor #1 palette alphafold". The alphafold palette is only for predicted structures with pLDDT scores in the bfactor column. This command gave a wrong and confusing all blue coloring (typical of high B-factors) and only by checking what command Claude ran did I catch the error.
4) Commands that do nothing. Claude sometimes runs ChimeraX commands that do nothing (e.g. hide #1.1-1.5) but the command does not say it did nothing in the return value. This commonly happens when an atomspec refers to nothing. Many ChimeraX commands silently do nothing in that case. Some do return an indication that the atomspec was empty, for instance, color #9 red -> "colored 0 atoms, 0 surfaces..." status message but nothing is logged and the Python return value does not indicate this.
G5) Wrong result only visible in graphics. This is a huge problem, and will be a fundamental limitation of how useful Claude is controlling ChimeraX. Many cases where Claude goes astray it is obvious from what happened in the graphics pane, even though there were no errors. The original videos of Claude controlling ChimeraX Alexis sent showed Claude hiding everything (graphics blank) when it was supposed to show something. I've asked for morphs and the molecules fly wildly with immensely long bonds seen in the graphics. Claude has positioned a carfentanil on top of a fentanyl only it there were two fentanyls in 8ef5 and the carfentanil ended up incorrectly half way between the two distance fentanyls. Claude tried to adjust carfentanil torsions to match fentanyl and it was visually obvious it got it all wrong. Claude tried to match different atom names between fentanyl (8ef5) and carfentanil (SMILES) and could not "see" the labels in the graphics that showed it got it wrong. Zach added a Capture View MCP method that could help this. But it is hard to imagine Claude will be able to make sense of what it sees.
G6) Different server and ChimeraX file systems. Claude fails to realize that files ChimeraX writes are not visible to Python scripts that Claude writes on the server, and that ChimeraX Python scripts Claude writes on the server (in /tmp) are not visible to ChimeraX. This is a pretty fundamental problem that defeats a great many steps Claude attempts, for example, using Python RDKit installed on the server to analyze a ligand structure saved to a file by ChimeraX. Claude tries these scripts and always gets file not found errors and is then confused and says "Let me try something else". It seems like Anthropic could fix this. Also in some case Claude can get this write. I asked it to write all questions and answers to a log file. It was only able to do that on the server, not my local file system, but it could provide a link in Claude Desktop that showed the server log file in TextEdit on my Mac. I also saw Claude attempt to write a Python script on the server and then include it inline as an option to a ChimeraX command (maybe it was "runscript"). By allowing that to work we could fix part of this problem.
G7) Failure to plan ahead. Claude often fails to plan ahead when deciding on an approach to a problem leading to disasterous results, and it doesn't seem to go back and realize a different approach was better. For example, I asked it to replace fentanyl with carfentanil with the same pose in opiod receptor structure 8ef5. It decided to open a SMILES carfentanil and try to align it to fentanyl instead of what I would do of just using build structure to add a few atoms to the fentanyl. By using SMILES it got all different atoms names and bond rotations which it tried for literally hours to fix unsuccessfully. I think this is another probably fundamental limitation, and the only solution is probably the user has to direct Claude to a better approach. Unfortunately the user often does not know the better approach.
G8) Conversation too long. Claude about 10 times ended my conversation with "Maximum conversation size reached. Please start a new chat." This is disasterous because it then loses all context from the original conversation. Somehow the Capture View code triggered this after only about 3 prompts, possibly a defect in the capture view MCP code. But more troubling I hit it after conversations with 30 prompts when Claude ran some commands like "info distmat" that produced output with 20,000 words. Alexis Rohou notes he hit this alot when Claude uses "info residues" on very large structures. If we make the help command return Elaine's long documentation pages that may also cause this problem. This is a another pretty fundamental limitation. Maybe Claude can be told to truncate long output from commands. I think the standard Claude context window size limit is 200,000 tokens.
G9) Claude doesn't remember what it learns about ChimeraX between conversations. I am not sure how true this is. But I have often observed it making the same mistakes and solving them again when I repeat the same analysis in two different conversations. For example, it will try to align two ligands with matchmaker, get the error, then try align, then since the number of atoms differ try other things. The next conversation that calls for this it repeats all the same mistakes. If Claude learned from previous ChimeraX conversations that could be incredibly powerful. If it can't learn this will be a severe fundamental limitation.
10) Claude cannot inspect the ChimeraX log except when it gets a command return value. For example, I had Claude run a boltz prediction with affinity, that runs in the background and later spontaneously adds the affinity results to the log. I asked Claude to compare affinity values and it tried mightly to look at the Log using various hallucinated commands but failed and eventually had to ask me to tell it what the log said. We could add a command (e.g. "log text maxLines 100") that returns the log text, maybe limited to some number of lines because it can be very long.
G11) Claude does not know what I did manually in ChimeraX. While using Claude I often hide and show models, change their styles, even open new models. Also Boltz predictions spontaneously open new models. Claude does not get notifications about any of this. So when asked to change what is shown Claude can fail, or at have to try to look at all the state of ChimeraX to try to guess what has changed. I'm not sure if MCP allows spontaneous messages from an application to be sent to Claude so it can track what the user is doing directly in ChimeraX. Lacking this it will be quite limiting trying to use Claude and manual ChimeraX interaction in the same session.
G12) Claude fails to find easily available information. For example, I asked it how many hydrogen bonds fentanyl makes with the opioid receptor in 8ef5. It amazingly homed in on a valid hbond command which returned to Claude the message "0 hydrogen bonds found". Yet Claude then went to heroic lengths to figure out how many hbonds (e.g. saving the hbond list to a file then trying to read the file and count the lines, or find the hbond pseudobond model and use the info command to report how many pseudobonds there were). As another example I asked it to select the S4 helix in 9my3. It did internet searches and found a generic residue range 220-245, but the actual range from UniProt annotations is 217-232 for Xenopus. I pointed it directly at the Uniprot page yet it said it could not find the correct range. I also tried to point it at the ChimeraX ability to open uniprot annotations which it used, and has the right range, but it failed.