Skip to content

Conversation

drroe
Copy link
Contributor

@drroe drroe commented Sep 26, 2017

At long last, trying to address #71.

This PR expands cpptraj's mask selection syntax by introducing the following:

  • Selection by molecule index. Uses the caret ^ character, e.g.
^1-2:ASP

would select residues named ASP in molecules 1 and 2. I decided on ^ because it's highly unlikely to be used in an atom/residue name and it had no overlap with existing operators/operands. @swails, @hainm, @dacase, can you think of any reasons why not to use ^ here?

  • Distance-based selection by molecule. Use ^ in distance-based mask instead of : or @, e.g.
@5<^3.0

would select all molecules within 3.0 Angstroms of atom 5.

  • Selection by chain ID. Uses forward slash / modifier for residue :, e.g.
:/B

would select all residues with chain ID B.

  • Selection by original (typically PDB) residue number. Uses semicolon ; modifier for residue :, e.g.
:;2-4,8

would select all residues originally numbered 2-4 and 8.

This also includes a slight change to how argument lists search for masks. Previously an argument could be considered a mask if it contained @, :, or * anywhere in the argument string. The new method only considers an argument string a mask if it contains @ : ^ * or = after any leading ! and ( characters. This is more robust, but can lead to situations where bad masks are ignored and effectively changed to *. For example, with previous method select test@ fails with Tokenize: Wrong syntax because test@ is treated as a mask, but with the new method test@ is ignored and select assumes no mask present and selects everything. Not sure which behavior I like better and would be interested in opinions from @swails and @hainm.

This PR also includes some cleanup of the MaskToken and MaskTokenArray classes, and adds tests for the new syntax.

Daniel R. Roe added 23 commits September 26, 2017 10:14
…Modify ArgList::ArgIsMask so that it looks for characters which are valid to start masks, not just mask characters in a string.
…ill add option to return to original behavior later.
@drroe drroe self-assigned this Sep 26, 2017
@hainm
Copy link
Contributor

hainm commented Sep 26, 2017

@drroe let me have a bit thinking first.

@hainm
Copy link
Contributor

hainm commented Sep 26, 2017

Just quick question.

Will you update mask selection for other programs in AT too? then ParmEd needs to update too.

@hainm
Copy link
Contributor

hainm commented Sep 26, 2017

Saying

^1-2:ASP@CA: does this mean "select CA atoms from ASP in mol 1 and 2"? If yes, can you add test for that?

  • so on for @5<^3.0@CA

Overall, I think the change is pretty good. Will add some testing for pytraj for sure.

@swails
Copy link
Contributor

swails commented Sep 26, 2017

I'll respond when I'm on my computer tonight.

@swails
Copy link
Contributor

swails commented Sep 27, 2017

Selection by chain ID. Uses forward slash / modifier for residue :, e.g.

What Chimera does here is :: for chains, which is a scheme familiar to a non-negligible crowd of users and from their experience seems to have been working fine for quite some time. Any reason not to do the same? This could actually double as the molecule selector where integers select molecule numbers and letters select chains. The arguments for this approach are

  1. Chains and molecules are very similar conceptually and I've yet to see any chain named with a number so there won't be any practical conflict.
  2. Other operators behave the same way in terms of treating integers as indexes and alphanums as names (@ and :)

The arguments against this approach are

  1. Chains and molecules are similar, but also different. Cross-linked chains are the same molecules, and using the same syntax for both may lead to confusion (and S-S bonds are quite common).
  2. Solvent is typically all one chain, but are all separate molecules
  3. You could also just have ::<number> select the <number>th chain.

If the arguments against are more persuasive, I think ^ is a reasonable character to use for molecules. In fact, I think point 3 above has pushed me away from this idea anyway (and in support of chain #). However, I do like :: better than :/ for chain selection. The former is what Chimera does and the latter will be rendered into an emoji whenever you try to send it to someone 😏

Selection by original (typically PDB) residue number. Uses semicolon ; modifier for residue :, e.g.

What would happen if someone used this syntax on a prmtop file that wasn't modified via addPDB? It would probably be quite perplexing to users that didn't realize prmtop files don't store the information needed to use :; by default (after all, the prmtop is a bit of a black-box format), so maybe an error is the way to go. But I don't know. I can see the utility of original residue number, but I don't have strong feelings about it at the moment.

As a general question, what happens when selecting based on information that doesn't exist? (Chain, original residue number, ... not sure if I'm missing one.)

@drroe
Copy link
Contributor Author

drroe commented Sep 27, 2017

Thanks @hainm and @swails for the feedback! I'll address concerns in order.

Will you update mask selection for other programs in AT too? then ParmEd needs to update too.

I will try to make this a priority, I don't have a lot of time for development these days. There are quite a few programs that would need to be updated: parmed, sander/ambmask, pmemd (feels like there should be a common parser for those two), mdgx, nab, and probably some I'm missing. I'd at the very least create an Issue or bug report asking for these to be added :-).

^1-2:ASP@CA: does this mean "select CA atoms from ASP in mol 1 and 2"? If yes, can you add test for that?

Yes, it does mean that, it does work, and I'll add a test.

so on for @5<^3.0@CA

There is already a test for distance selection by molecule in Test_DistBasedMask (unless I'm not understanding your meaning).

What Chimera does here is :: for chains ... Any reason not to do the same?

I did try to implement this initially, and to make a long story short the way the cpptraj mask tokenizer currently works (character by character) this is not straightforward to do. I'd have to introduce some extra logic that says "if this character is : and the character just prior was also : ..." (currently it just tracks "was : the last selection token I encountered?"). This admittedly isn't conceptually hard but is potentially time consuming - really best solution is probably to just rewrite the tokenizer. And as we all know rewrites often sound like they'll take 2 hours tops and then end up taking 2 days :-). That I'm willing to do the work if it's the best solution - get it right the first time.

Chains and molecules are similar, but also different.

I would go farther and say that they're very different. In a non-insignificant fraction of PDBs I have encountered, chains span multiple molecules (for example in dimers/tetramers etc all molecules that make up a monomer are often given the same chain ID).

What would happen if someone used this syntax on a prmtop file that wasn't modified via addPDB?

For chain ID, you'd get no atoms selected. For original residue number, if it's not present cpptraj fills it in with 1-based ordering (so e.g. residue number and original residue number are exactly the same for Amber topology files).

To to summarize it seems that ^ is good for molecule selection, and :: is recommended for chain ID selection. If the latter is implemented should :/ be used for original residue number, or keep it as :;? The only reason I used :/ in the first place was because / is already a "modifier" for @, but I'm not set on using it.

@hainm
Copy link
Contributor

hainm commented Sep 27, 2017

@5<^3.0@CA

I mean "does this select CA atoms of molecule within 3 A from atom 5?"

@drroe
Copy link
Contributor Author

drroe commented Sep 27, 2017

I mean "does this select CA atoms of molecule within 3 A from atom 5?"

No, but to be fair it doesn't mean that for @ or : either - you have always needed to explicitly provide the & in this case.

@drroe
Copy link
Contributor Author

drroe commented Sep 27, 2017

Turns out the tokenizer modification wasn't as bad as I thought. Chain ID selection is now ::.

@drroe
Copy link
Contributor Author

drroe commented Sep 27, 2017

@swails @hainm any comments on the change in how argument lists search for masks (final paragraph of the original PR post)?

@hainm
Copy link
Contributor

hainm commented Sep 27, 2017

No I have not.

@drroe
Copy link
Contributor Author

drroe commented Sep 27, 2017

In that case I'll merge now - easy enough to change behavior back later if needed.

@drroe drroe merged commit b517576 into Amber-MD:master Sep 27, 2017
@drroe drroe deleted the expand_mask branch September 27, 2017 17:58
@drroe drroe mentioned this pull request Apr 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants