-
Notifications
You must be signed in to change notification settings - Fork 66
GIST Entropy improvements #947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In GIST, the six-dimensional entropy calculation requires the translational and orientational distance to the nearest neighbor (NN). However, if the translational distance is larger than the 6D distance to the current NN, the orientational distance can be skipped safely. This is a significant speedup (about 2x in my test), since the orientational distance contains an arccos.
The nearest-neighbor density is approximated by finding the distance to the nearest neighbor of each point, computing a sphere volume from this distance, and comparing this to the volume that is expected from a homogeneous density. In GIST, the sphere volume in rotational space and the combined translation+rotation space was approximated by a Taylor expansion. This patch implements the exact volume in rotational space analytically, and uses a tabulated correction factor for distances up to 10 Angstrom to correct the 6-D volume. The result is a better convergence of the bulk entropy, at the cost of about 3.4% calculation time (Action Post) in my test (a pure water box containg 3333 water molecules, using 10000 frames). The bulk entropy (should be zero) is reduced from 2.3e-4 to 5e-5 kcal/mol/A^3 for dTSsix and from 7.7e-4 to 2.3e-4 for dTSorient. The individual voxel values are almost perfectly correlated to the old results (R^2=0.99996).
In the old algorithm, neighbors were searched in the respective voxel and in all of its direct neighbors. The new algorithm lets the user choose how many layers of neighboring voxels should be search. The default is 1, which corresponds to the old behavior. There is a tiny difference in how distances of 0 are treated. In the old code, they were omitted. In the new code, there is a constant (GIST_TINY) which is used if the resulting distance is 0. This leads to numerically different values in about 0.01% of the voxels.
The PGI build in Jenkins failed. |
Obviously, the tests are failing because the GIST results are not the same any more. How is this usually handled in cpptraj? Do we need to increase the major (or minor) version number, or do we wait until a new cpptraj version is released? |
If we need to obtain exactly the same results, I can replace the "oldnnvolume" flag for a "newnnvolume" one and make the old behavior the default. Further, I would have to revert the behavior for water molecules at the same position to the old behavior (I am not sure which behavior is actually preferred...) |
So typically the way I have been versioning is revisions are for changes which do not change existing output, minor versions are for when input/output changes, and major versions are for when a significant part of the code (API etc) are altered. In this case, we would probably want a minor version bump (resetting the revision number) since output changes (see description in
Is there any advantage to retaining the old behavior? The new method seems like it is better behaved. However, there is often a benefit to retaining old behavior (for e.g. reproducing old results), particularly if the new results are numerically different. @gosldorf @EricChen521 @tkurtzman what do you think? @fwaibl if we decide to make this the default, the test results (after being validated) should be updated. If we decide to make it an option, a test should be added for this option. Let me know if this makes sense. |
@fwaibl also I forgot to mention this, but please run |
The PGI build in Jenkins failed. |
I'm going to bring your branch up to date. Let me know if it causes any issues. |
The PGI build in Jenkins failed. |
In the GIST input, the gridcntr and griddim arguments were previously parsed as positional arguments (but if the keyword was present), rather than being parsed behind the keyword. I.e., gist 0.6 1.0 1.6 50 gridcntr griddim 60 70 gridspacn 0.5 out gist.dat refdens 0.0329 was valid, while gist griddim 50 60 70 gridspacn 0.5 out gist.dat refdens 0.0329 gridcntr 0.6 1.0 1.6 was invalid (because the center values had to go first) This commit changes the parsing to the expected bevavior, i.e., the first example is now invalid and the second is valid.
Changes: * after the nearest neighbor (NN) search, introduces a check whether the distance is smaller than sqrt(GIST_HUGE) (i.e., a NN has been found). If not, this distance is omitted from the entropy calculation. * fix a bug where some non-border voxels were omitted instead of voxels on the border. (GIST voxels on the border of the grid are supposed to obtain 0 entropy) * don't look for nearest neighbors if we are on the border. Explanation: In GIST, the entropy is calculated per water molecule based on the NN distance, and binned to a grid. With the recent change of the entropy calculation, I removed some checks that: * omit the entropy of waters if no NN distance is found. Instead, a high value (sqrt(GIST_HUGE)) is assigned to the distance. * omit the NN if the distance to it is numerically close to zero. Instead, a small value (GIST_TINY) is then assigned to the distance. The second first behavior is reasonable (omitting the NN means that an arbitrary higher distance is chosen, and GIST_TINY is probably smaller than that arbitrary distance, i.e., closer to the real one. However, the first one is not so reasonable, because sqrt(GIST_HUGE) is probably far off from the real NN distance. Therefore, I re-introduce the old behavior in this case. This is very relevant for the results of the cpptraj tests, since they use very few frames, leading to many cases of high NN distances.
* This adds 2 tests for the new flags oldnnvolume and nnsearchlayers. The .save files for oldnnvolume (Gist6) are just renamed from the previous Gist2 test. The other test .save files are changed to match the new entropy results. * Increase cpptraj version to 6.3.0
I updated the tests and also found a bug meanwhile (in the changed entropy - same pull request), where some voxels that are not on the border of the grid were assigned zero entropy values. |
nx = dimArgs.getNextInteger(-1); | ||
ny = dimArgs.getNextInteger(-1); | ||
nz = dimArgs.getNextInteger(-1); | ||
} | ||
griddim_ = Vec3((double)nx, (double)ny, (double)nz); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should exit with an error here if negative dimension values have been specified. I should have done this in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I can add this quickly. I guess it should also exit if a dimension is zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it should also exit if a dimension is zero?
Yes definitely - I don't think there's any situation where one of the dims can be zero and it would be a valid calculation.
mprintf("Warning: No grid center values specified, using default (origin)\n"); | ||
gridcntr_ = Vec3(0.0); | ||
} else { | ||
gridcntr_[0] = centerArgs.getNextDouble(-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change. If we want to be extra careful we can make sure the arguments are valid doubles with the validDouble()
routine from StringRoutines.h
, e.g.
if (!validDouble(centerArgs[0])) {
mprinterr("Error: '%s' is not a valid argument for 'gridcntr'\n");
return Action::ERR;
}
gridcntr_[0] = convertToDouble(centerArgs[0]);
and so on for the other arguments. Can do something similar for the grid dimensions with validInteger()
.
@@ -0,0 +1,88 @@ | |||
#ifndef GIST_ENTROPY_UTILS_H | |||
#define GIST_ENTROPY_UTILS_H | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we could put all of this in a namespace, e.g.
namespace Cpptraj {
namespace GistEntropyUtils {
I've done this with the clustering routines recently (see the Cluster subdirectory) and I think it makes for nicer code. For some things however it doesn't work since I made Cpptraj
a class (when I started cpptraj I knew far less about good C++ practices than I do now). I think since only the GIST action will pull this in it would be ok. Let me know what you think.
src/GistEntropyUtils.h
Outdated
|
||
typedef std::vector<float> Farray; | ||
|
||
#define GIST_HUGE 10000.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to make this a const double
in a GistEntropyUtils
namespace, see above.
@fwaibl I just made a bunch of comments, let me know if you'd like me to handle any of these if you don't have the time. I think I should be able to work on this PR somehow. I can also accept as-is and make the changes in a future PR. Let me know what works best for you. Thanks for this, it's great work! |
Also be aware that the code deadline for the next Amber release is March 11, so any changes we want to get into the next Amber release should be done a few days before that. |
Thanks for your comments. I can add them on Monday, but probably not today. Otherwise, feel free to change something yourself. I think you have write access, since you already pushed a merge commit to my fork (but I am no expert on git...)
That's good to know. I am confident we can finish this pull request until then, but I probably won't be on time for the extension to ions and multiple solvents, since it turns out that there is quite some stuff that needs to be changed (mostly bookkeeping, i.e., which atoms are solute/solvent, etc.) |
The PGI build in Jenkins failed. |
* Move GistEntropyUtils functions into a GistEntropyUtils namespace * make GIST_HUGE a const double (instead of a macro) * add validDouble and validInteger checks for griddim and gridcntr arguments.
@fwaibl thanks! |
This pull request contains two changes to the GIST entropy calculation:
Timings (Action Post, using 10000 frames of a pure water simulation):
I attach a plot of the average entropy contributions calculated from a pure water MD (lower is better, since the entropy is relative to bulk). Note the log scale.
In my opinion, the volume correction term is a significant improvement, since it reduces the bulk entropy a lot, and the slow-down is very small. I am not so sure about the new NN search however, since it it less relevant for the convergence (the log scale makes it look bigger, but in fact most of the error is compensated by the volume correction), and the calculation is always a bit slower. I also don't know exactly why it is slower with "nnsearchlayers 1".
Another question: I added some unit tests in the unitTests folder. To do so, I needed to move the new functionality into a separate header (GistEntropyUtils.h) and cpp file. Is the file naming etc. ok with you? I could easily move everything into Action_GIST.cpp, but then I can't use the unit tests any more.