-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Describe the Problem
Multiple runtime problems were discovered when MET was compiled and tested with the -O2
level of optimization in support of creating a METplus build through Conda. This issue is to resolve these runtime problems related to optimization. It will only be complete when (1) all MET unit tests run to completion with no error and (2) no significant differences remain between the output from MET compiled with optimization and without.
Additional recommendations:
- Update the Dockerfile to routinely test MET with optimization (GNU
-O2
) enabled to more closely match how it is run in production. - GNU unit testing.
- Compiled and tested with
-O1
,-O2
, and-O3
. - Unit tests run to completion and output is compared to main_v12.0 NB20250204 nightly build output.
- For
-O1
, 3 files have size diffs > 1% and real diffs are flagged in 1 NetCDF file:
- Compiled and tested with
file1: ../../test_output/pb2nc_indy/nam.20210311.t00z.prepbufr.tm00.pbl.nc
file2: /d1/projects/MET/MET_regression/main_v12.0/NB20250204/MET-main_v12.0/test_output/pb2nc_indy/nam.20210311.t00z.prepbufr.tm00.pbl.nc
ERROR: found 4 differences in var obs_val - max abs: 0.05004883
- For
-O2
, same 4 diff files as-O1
plus 15 more related to BCLP derivation in TC-Pairs. - For
-O3
, same 19 diff files as-O2
. - So not optimizing with
FFLAGS
did NOT solve the BCLP derivation diffs. - Intel unit testing.
- Compiled and tested with
-O1
,-O2
, and-O3
. - Note that
unit_modis.xml
andunit_lidar2nc.xml
are skipped because those tools are not compiled. - Also note that
unit_tc_diag.xml
andunit_python.xml
are skipped because of issues running Python. unit_point2grid.xml
segfaults on theTEST: point2grid_GOES_16_AOD_TO_G212_GAUSSIAN
test:
- Compiled and tested with
DEBUG 1: Reading data file: /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M3_G16_s20181341702215_e20181341704588_c20181341711418.nc
terminate called after throwing an instance of 'netCDF::exceptions::NcAttMeta'
what(): NetCDF: Can't open HDF5 attribute
file: ncCheck.cpp line:92
FATAL ERROR (SEGFAULT): Process 3478583 got signal 11 @ local time = 2025-02-05 18:09:28Z
- For
-O1
, ... - For
-O2
, ... - For
-O3
, ...
The runtime problems, along with their fixes, are described below:
-
unit_pb2nc.xml
fails as described in MET issue Bugfix: Fix the PARUSR BUFRLIB failure when PB2NC is compiled with-O2
optimization #3054 and fixed by pull request Bugfix #3054 main_v12.0 parusr #3068. This was due to referencing temporary memory that was impacted by optimization. -
unit_point2grid.xml
segfaults on thepoint2grid_GOES_16_ADP
test.
The point2grid_GOES_16_ADP unit test fails to run when installed via conda. I suspect that this is due to optimization being turned on, similar to the bug from #3054.
Here is an example command that fails on seneca:
point2grid \
/d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M6_G16_s20192662141196_e20192662143569_c20192662145547.nc \
G212 \
/d1/personal/mccabe/met_test_output/point2grid/point2grid_GOES_16_ADP.nc \
-field name="AOD_Smoke"; level="(*,*)"; \
-adp /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-ADPC-M6_G16_s20192662141196_e20192662143569_c20192662144526.nc \
-goes_qc 0,1 -method MAX -v 1
This produces a segfault when compiled with the -O2
optimization flag.
This problem has been fixed on the bugfix branch for this issue and was due to buffer overflow, reading > 512 characters into a buffer of that size.
-
unit_tc_pairs.xml
produces bad output in the TCDIAG lines, where the diagnostic names contains garbage strings.
While tc_pairs
completes without error, the unit test validation logic fails when reading the corrupted strings. This was fixed by updating the TrackInfo::diag_name()
accessor function to return a string copy of the diagnostic name rather than a const char *
pointer to temporary memory.
-
unit_point2grid.xml
aborts on thepoint2grid_2D_time
test on line 442 ofnc_cf_file.cc
. The debugger indicates that the NetCDF groupId value being read is corrupted:
p *_yDim
$31 = {nullObject = false, myId = 0, groupId = 1167840000}
When it should have the same groupId as _xDim (confirmed by debugging the non-optimized version):
p *_xDim
$32 = {nullObject = false, myId = 1, groupId = 131072}
The problem lies in the read_netcdf_grid()
function in nc_cf_file.cc
where the _xDim
and _yDim
pointers were pointing to a local variable. The fix is to search for and point them to class variables that won't go out of scope.
- All unit tests now run to completion when compiled using
-g -O2
. However, comparing to the output from the unoptimized nightly build runs reveals differences, including differences in the number of TCST lines written bytc_pairs
:
file1: test_output/tc_pairs/al022013_interp12_fill.tcst
file2: /d1/projects/MET/MET_regression/main_v12.0/NB20250204/MET-main_v12.0/test_output/tc_pairs/al022013_interp12_fill.tcst
ERROR: differing number of rows 1162 vs. 1198 for row type TCST_TCMPR between versions 12_0 vs. 12_0
However, all the differences in these 2 files are for the BCLP model, which is computed using NHC-provided FORTRAN code. So it's possible an optimization problem also resides in the in FORTRAN code within MET. For the time being, I'll retest WITHOUT setting FFLAGS='-O2'
to see if differences remain.
-
unit_tc_diag.xml
segfaults the same way on Intel builds for all optimization levels. The problem occurs in the Python diagnostic computation step.unit_python.xml
also fails on thepython_plot_point_obs_met_nc_to_pandas
test after completing many successful tests.
These Python failures could very well be related to us linking to this /nrit/ral/met-python3/bin/python3.10
version of Python built with GNU. Perhaps we need to link to an Intel Python build instead? For the time being, I've just skipped over unit_tc_diag.xml
and unit_python.xml
.
Expected Behavior
All MET unit tests should run to completion and create numerically equivalent output regardless of the optimization level.
Environment
Describe your runtime environment:
Reproduced on the MET development machine named seneca.
To Reproduce
Describe the steps to reproduce the behavior:
- Compile the MET main_v12.0 branch with
CFLAGS
,CXXFLAGS
, andFFLAGS
set to include-O2
. - Run
bin/unit_test.sh
to run all of the MET unit tests.
Relevant Deadlines
List relevant project deadlines here or state NONE.
Funding Source
2702701 to support conda builds.
Define the Metadata
Assignee
- Select engineer(s) or no engineer required
- Select scientist(s) or no scientist required
Labels
- Review default alert labels
- Select component(s)
- Select priority
- Select requestor(s)
Milestone and Projects
- Select Milestone as the next bugfix version
- Select Coordinated METplus-X.Y Support project for support of the current coordinated release
- Select MET-X.Y.Z Development project for development toward the next official release
Define Related Issue(s)
Consider the impact to the other METplus components.
Bugfix Checklist
See the METplus Workflow for details.
- Complete the issue definition above, including the Time Estimate and Funding Source.
- Fork this repository or create a branch of main_<Version>.
Branch name:bugfix_<Issue Number>_main_<Version>_<Description>
- Fix the bug and test your changes.
- Add/update log messages for easier debugging.
- Add/update unit tests.
- Add/update documentation.
- Push local changes to GitHub.
- Submit a pull request to merge into main_<Version>.
Pull request:bugfix <Issue Number> main_<Version> <Description>
- Define the pull request metadata, as permissions allow.
Select: Reviewer(s) and Development issue
Select: Milestone as the next bugfix version
Select: Coordinated METplus-X.Y Support project for support of the current coordinated release - Iterate until the reviewer(s) accept and merge your changes.
- Delete your fork or branch.
- Complete the steps above to fix the bug on the develop branch.
Branch name:bugfix_<Issue Number>_develop_<Description>
Pull request:bugfix <Issue Number> develop <Description>
Select: Reviewer(s) and Development issue
Select: Milestone as the next official version
Select: MET-X.Y.Z Development project for development toward the next official release - Close this issue.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status