Skip to content

Conversation

fwaibl
Copy link
Contributor

@fwaibl fwaibl commented Feb 16, 2022

This pull request contains two changes to the GIST entropy calculation:

  • Correction term for the "reference volume" in the nearest neighbor entropy. This significantly improves convergence of the entropy in bulk, at the cost of a minor slow-down. There is a flag "oldnnvolume" to recover the old behavior.
  • This implements a new version of the nearest neighbor search. The new version allow the user to choose how many layers of neighboring voxels should be used, using a new option "nnsearchlayers". With nnsearchlayers 1 (default), the results are almost exactly the same. Using a higher value improves bulk convergence of the entropy. Furthermore, the new implementation deals differently with waters that are numerically at the same position (this is very rare).

Timings (Action Post, using 10000 frames of a pure water simulation):

  • old, already including the previous speedup: 258s
  • with the volume correction: 267 s
  • with new NN search (1 layers), and the volume correction: 287 s
  • with new NN search (5 layers), and the volume correction: 439 s (all values >1 should have similar timing, since voxels are omitted if it can be proven that the nearest neighbor is already found)

I attach a plot of the average entropy contributions calculated from a pure water MD (lower is better, since the entropy is relative to bulk). Note the log scale.

average-entropy-of-pure-water

In my opinion, the volume correction term is a significant improvement, since it reduces the bulk entropy a lot, and the slow-down is very small. I am not so sure about the new NN search however, since it it less relevant for the convergence (the log scale makes it look bigger, but in fact most of the error is compensated by the volume correction), and the calculation is always a bit slower. I also don't know exactly why it is slower with "nnsearchlayers 1".

Another question: I added some unit tests in the unitTests folder. To do so, I needed to move the new functionality into a separate header (GistEntropyUtils.h) and cpp file. Is the file naming etc. ok with you? I could easily move everything into Action_GIST.cpp, but then I can't use the unit tests any more.

In GIST, the six-dimensional entropy calculation requires the
translational and orientational distance to the nearest neighbor (NN).
However, if the translational distance is larger than the 6D distance to
the current NN, the orientational distance can be skipped safely.  This
is a significant speedup (about 2x in my test), since the orientational
distance contains an arccos.
The nearest-neighbor density is approximated by finding the distance to
the nearest neighbor of each point, computing a sphere volume from this
distance, and comparing this to the volume that is expected from a
homogeneous density. In GIST, the sphere volume in rotational space and
the combined translation+rotation space was approximated by a Taylor
expansion. This patch implements the exact volume in rotational space
analytically, and uses a tabulated correction factor for distances up to
10 Angstrom to correct the 6-D volume.

The result is a better convergence of the bulk entropy, at the cost of
about 3.4% calculation time (Action Post) in my test (a pure water box
containg 3333 water molecules, using 10000 frames). The bulk entropy
(should be zero) is reduced from 2.3e-4 to 5e-5 kcal/mol/A^3 for dTSsix
and from 7.7e-4 to 2.3e-4 for dTSorient. The individual voxel values are
almost perfectly correlated to the old results (R^2=0.99996).
In the old algorithm, neighbors were searched in the respective voxel
and in all of its direct neighbors. The new algorithm lets the user
choose how many layers of neighboring voxels should be search. The
default is 1, which corresponds to the old behavior.
There is a tiny difference in how distances of 0 are treated. In the old
code, they were omitted. In the new code, there is a constant
(GIST_TINY) which is used if the resulting distance is 0. This leads to
numerically different values in about 0.01% of the voxels.
@AmberJenkins
Copy link
Collaborator

The PGI build in Jenkins failed.

@fwaibl
Copy link
Contributor Author

fwaibl commented Feb 16, 2022

Obviously, the tests are failing because the GIST results are not the same any more. How is this usually handled in cpptraj? Do we need to increase the major (or minor) version number, or do we wait until a new cpptraj version is released?

@fwaibl
Copy link
Contributor Author

fwaibl commented Feb 16, 2022

If we need to obtain exactly the same results, I can replace the "oldnnvolume" flag for a "newnnvolume" one and make the old behavior the default. Further, I would have to revert the behavior for water molecules at the same position to the old behavior (I am not sure which behavior is actually preferred...)
Should I do that?

@drroe
Copy link
Contributor

drroe commented Feb 16, 2022

Obviously, the tests are failing because the GIST results are not the same any more. How is this usually handled in cpptraj? Do we need to increase the major (or minor) version number, or do we wait until a new cpptraj version is released?

So typically the way I have been versioning is revisions are for changes which do not change existing output, minor versions are for when input/output changes, and major versions are for when a significant part of the code (API etc) are altered. In this case, we would probably want a minor version bump (resetting the revision number) since output changes (see description in sec/Version.h).

If we need to obtain exactly the same results, I can replace the "oldnnvolume" flag for a "newnnvolume" one and make the old behavior the default.

Is there any advantage to retaining the old behavior? The new method seems like it is better behaved. However, there is often a benefit to retaining old behavior (for e.g. reproducing old results), particularly if the new results are numerically different. @gosldorf @EricChen521 @tkurtzman what do you think?

@fwaibl if we decide to make this the default, the test results (after being validated) should be updated. If we decide to make it an option, a test should be added for this option. Let me know if this makes sense.

@drroe
Copy link
Contributor

drroe commented Feb 16, 2022

@fwaibl also I forgot to mention this, but please run make depend in the src/ directory so that the file dependencies are all updated. Thanks!

@AmberJenkins
Copy link
Collaborator

The PGI build in Jenkins failed.

@drroe
Copy link
Contributor

drroe commented Feb 17, 2022

I'm going to bring your branch up to date. Let me know if it causes any issues.

@AmberJenkins
Copy link
Collaborator

The PGI build in Jenkins failed.

fwaibl and others added 6 commits February 22, 2022 15:35
In the GIST input, the gridcntr and griddim arguments were previously
parsed as positional arguments (but if the keyword was present), rather
than being parsed behind the keyword. I.e.,
gist 0.6 1.0 1.6 50 gridcntr griddim 60 70 gridspacn 0.5 out gist.dat refdens 0.0329
was valid, while
gist griddim 50 60 70 gridspacn 0.5 out gist.dat refdens 0.0329 gridcntr 0.6 1.0 1.6
was invalid (because the center values had to go first)

This commit changes the parsing to the expected bevavior, i.e., the first
example is now invalid and the second is valid.
Changes:
* after the nearest neighbor (NN) search, introduces a check whether the
  distance is smaller than sqrt(GIST_HUGE) (i.e., a NN has been found). If
  not, this distance is omitted from the entropy calculation.
* fix a bug where some non-border voxels were omitted instead of voxels
  on the border. (GIST voxels on the border of the grid are supposed to
  obtain 0 entropy)
* don't look for nearest neighbors if we are on the border.

Explanation:
In GIST, the entropy is calculated per water molecule based on the
NN distance, and binned to a grid. With the recent change of the entropy
calculation, I removed some checks that:
* omit the entropy of waters if no NN distance is found. Instead, a high
  value (sqrt(GIST_HUGE)) is assigned to the distance.
* omit the NN if the distance to it is numerically close to zero.
  Instead, a small value (GIST_TINY) is then assigned to the distance.
The second first behavior is reasonable (omitting the NN means
that an arbitrary higher distance is chosen, and GIST_TINY is probably
smaller than that arbitrary distance, i.e., closer to the real one.
However, the first one is not so reasonable, because sqrt(GIST_HUGE) is
probably far off from the real NN distance. Therefore, I re-introduce
the old behavior in this case.

This is very relevant for the results of the cpptraj tests, since they
use very few frames, leading to many cases of high NN distances.
* This adds 2 tests for the new flags oldnnvolume and nnsearchlayers.
  The .save files for oldnnvolume (Gist6) are just renamed from the
  previous Gist2 test. The other test .save files are changed to match
  the new entropy results.
* Increase cpptraj version to 6.3.0
@fwaibl
Copy link
Contributor Author

fwaibl commented Feb 25, 2022

I updated the tests and also found a bug meanwhile (in the changed entropy - same pull request), where some voxels that are not on the border of the grid were assigned zero entropy values.
I also took the liberty of changing the command line parsing, which was IMO buggy (it was previously assumed that gridcntr is specified before griddim, and the order of the arguments was independent of the keywords - see the commit message for examples.)

nx = dimArgs.getNextInteger(-1);
ny = dimArgs.getNextInteger(-1);
nz = dimArgs.getNextInteger(-1);
}
griddim_ = Vec3((double)nx, (double)ny, (double)nz);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should exit with an error here if negative dimension values have been specified. I should have done this in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can add this quickly. I guess it should also exit if a dimension is zero?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it should also exit if a dimension is zero?

Yes definitely - I don't think there's any situation where one of the dims can be zero and it would be a valid calculation.

mprintf("Warning: No grid center values specified, using default (origin)\n");
gridcntr_ = Vec3(0.0);
} else {
gridcntr_[0] = centerArgs.getNextDouble(-1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change. If we want to be extra careful we can make sure the arguments are valid doubles with the validDouble() routine from StringRoutines.h, e.g.

if (!validDouble(centerArgs[0])) {
  mprinterr("Error: '%s' is not a valid argument for 'gridcntr'\n");
  return Action::ERR;
}
gridcntr_[0] = convertToDouble(centerArgs[0]);

and so on for the other arguments. Can do something similar for the grid dimensions with validInteger().

@@ -0,0 +1,88 @@
#ifndef GIST_ENTROPY_UTILS_H
#define GIST_ENTROPY_UTILS_H

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we could put all of this in a namespace, e.g.

namespace Cpptraj {
namespace GistEntropyUtils {

I've done this with the clustering routines recently (see the Cluster subdirectory) and I think it makes for nicer code. For some things however it doesn't work since I made Cpptraj a class (when I started cpptraj I knew far less about good C++ practices than I do now). I think since only the GIST action will pull this in it would be ok. Let me know what you think.


typedef std::vector<float> Farray;

#define GIST_HUGE 10000.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to make this a const double in a GistEntropyUtils namespace, see above.

@drroe
Copy link
Contributor

drroe commented Feb 25, 2022

@fwaibl I just made a bunch of comments, let me know if you'd like me to handle any of these if you don't have the time. I think I should be able to work on this PR somehow. I can also accept as-is and make the changes in a future PR. Let me know what works best for you.

Thanks for this, it's great work!

@drroe
Copy link
Contributor

drroe commented Feb 25, 2022

Also be aware that the code deadline for the next Amber release is March 11, so any changes we want to get into the next Amber release should be done a few days before that.

@fwaibl
Copy link
Contributor Author

fwaibl commented Feb 25, 2022

Thanks for your comments. I can add them on Monday, but probably not today. Otherwise, feel free to change something yourself. I think you have write access, since you already pushed a merge commit to my fork (but I am no expert on git...)

Also be aware that the code deadline for the next Amber release is March 11, so any changes we want to get into the next Amber release should be done a few days before that.

That's good to know. I am confident we can finish this pull request until then, but I probably won't be on time for the extension to ions and multiple solvents, since it turns out that there is quite some stuff that needs to be changed (mostly bookkeeping, i.e., which atoms are solute/solvent, etc.)

@AmberJenkins
Copy link
Collaborator

The PGI build in Jenkins failed.

fwaibl and others added 2 commits February 28, 2022 09:28
* Move GistEntropyUtils functions into a GistEntropyUtils namespace
* make GIST_HUGE a const double (instead of a macro)
* add validDouble and validInteger checks for griddim and gridcntr
  arguments.
@drroe
Copy link
Contributor

drroe commented Feb 28, 2022

@fwaibl thanks!

@drroe drroe merged commit 8a1543b into Amber-MD:master Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants