Opened 5 years ago

Closed 5 years ago

#3430 closed defect (fixed)

Support >4 character atom names

Reported by: vlad_gunko@… Owned by: Eric Pettersen
Priority: moderate Milestone:
Component: Core Version:
Keywords: Cc: Tom Goddard
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

Formats such as mol2 can have >4 character atom names. To avoid memory bloat, use the same approach as Chimera (a symbol table) which will probably overall save memory. Could be used for residue names too.

Attachments (1)

sio88.mol2 (29.5 KB ) - added by Eric Pettersen 5 years ago.
example mol2 file with >4 character atom names

Download all attachments as: .zip

Change History (10)

by Eric Pettersen, 5 years ago

Attachment: sio88.mol2 added

example mol2 file with >4 character atom names

in reply to:  2 comment:1 by goddard@…, 5 years ago

I think ChimeraX should definitely be using C++ strings for atom names and residue names.  The memory saving of custom solutions is not worth future development costs. A single pointer is 8 bytes, with say 8 bytes for a character string (that is likely the minimum malloc size) means 16 bytes per name, for a 1 million atom structure, 16 Mbytes, negligible.  I have on multiple occasions the pleasure of unraveling our own custom string implementation where it differs from C++ std::string and think it is not sensible.

comment:2 by Eric Pettersen, 5 years ago

My recollection is that the default memory allocation for std::string is more on the order of 256 bytes, so about a 0.25 GB for a million-atom structure. Clearly, this simple thing to do here is write a trivial C++ program that creates a million std::strings and see how much memory it consumes. Also, calls like .reserve() and .shrink_to_fit() are "advisory" and do not require the string to reduce its allocated memory (and in my experience, they don't).

I'll probably write the test program sooner rather than later since this is trivial to fix if the memory usage is closer to what you think it is than what I think it is.

comment:3 by Eric Pettersen, 5 years ago

Well, the memory usage of a string varies quite a bit by platform. The usage is a combination of the size of the string object itself plus the minimum memory capacity it allocates for the string. Which is:

my home Mac (clang++): 24 + 23(!) = 47
plato (g++ 4.8.5): 8 + actual length = 10-13
rbvi-gassho (g++ 7.5.0): 32 + 16 = 48
essex (g++/clang++): 24 + 23 = 47

So the actual usage is between our estimates, about 3x yours and about 1/5 mine, roughly 44.8MB for a million-atom structure (+40MB over current).

I don't have a good feel for whether 40MB is "large enough to matter". Certainly one fast/cheap intermediate solution would be to increase the fixed atom-name maximum to 8, which would allow the mol2 file in this bug report to be read.

comment:4 by Tom Goddard, 5 years ago

ChimeraX residue names and chain ids currently use C++ std::string. But of course atom names are the most numerous. The original description in this report suggested using a symbol table --that may have bad speed implications reading PDB files since C++ hash maps are quite slow. And it adds complexity.

While knowing the memory use for std::string is nice what we really care about is performance in use. So I would find it more interesting to know how std::string memory use and say mmCIF load speed compares to our current CString, say with 3j3q for a big enough test where the difference may be visible. This takes into account hard to predict stuff like how much hidden space the memory allocator wastes and how much copying operations slow things down. It looks like only a few lines of code need to be changed to test that, a typedef. What I recall from having this discussion maybe 5 years ago was that the memory / performance differences were a few percent, too small to make the code more complex. But it may not be a few percent, I didn't run those tests. An additional important consideration is whether anyone cares. If no one else is going to write C++ code with our atomic data structures besides Eric then this is probably not worth messing with -- leave it as is. But if we envision other developers will one day use C++ Atom and Atom::name, then it becomes more attractive to use std::string to give those developers one less quirky thing to figure out.

comment:5 by Tom Goddard, 5 years ago

One more thought. The most important thing is a long atom name should not cause an inexplicable crash. It definitely needs to propagate an error say "Atom name abcdef is too long, max supported length is 4 characters."

comment:6 by Tom Goddard, 5 years ago

I just tested the mmCIF reader and it does give a good error message if a name is too long.

comment:7 by Eric Pettersen, 5 years ago

So I switched from char*[5] to string on my home Mac and for 3j3q the time to open was identical (12.9 seconds) and the memory use for char* was 1.73GB, vs. 1.78GB for string, which is a bit less of a difference than my estimate. My estimate was 40MB for a million atoms whereas here it was 50MB for 2.4 million atoms.

I don't think a symbol table would be slow, because map lookups only become slow once the binary tree underlying the map becomes large, and atom names are typically massively duplicative so there wouldn't be that many unique map entries. The point about code complexity for a symbol table is quite valid, though if I did implement a symbol table approach it could also be used for IDATM types, which would be a win.

So now that we have at least one set of numbers, any feeling about which way to go? My gut assessment is to switch to string, but put implementing a symbol-table approach on a low-priority to-do list since it also benefits atom types.

in reply to:  9 comment:8 by goddard@…, 5 years ago

I think the few percent improvement in memory use with char *[5] instead of std::string is probably not worth the weird limitation on length, but it is a close call.  I would go with std::string and skip the symbol table for now.  A symbol table isn't likely to decrease memory use more than a few percent so it does not look attractive to me if it increases complexity.  But maybe the symbol table would speed up some parts of ChimeraX significantly?  I have my doubts, and could not even suggest what it might speed up -- looking up millions of atoms by name seems uncommon.

comment:9 by Eric Pettersen, 5 years ago

Resolution: fixed
Status: assignedclosed

Okay, buckle your seatbelts, I've changed atom.name to string. That involved bumping the AtomicLibrary version to 2.0 and all that entails.

Note: See TracTickets for help on using tickets.