Skip to content

Conversation

practicalswift
Copy link
Contributor

@practicalswift practicalswift commented Mar 7, 2020

Add MemorySanitizer (MSan) in Travis to detect use of uninitialized memory.

First UBSan, then ASan followed by TSan... and now: yes, the wait is over -- MSan is finally here! :)

Some historical context:

What is the next step? What tools should we add to CI to keep bugs from entering master? :)

@practicalswift practicalswift force-pushed the msan-in-travis branch 3 times, most recently from 356a2d1 to 9bcded4 Compare March 7, 2020 21:08
@maflcko
Copy link
Member

maflcko commented Mar 7, 2020

We already run against valgrind and msan is a subset of valgrind according to: "MSan implements a subset of functionality found in Valgrind (Memcheck tool). It is significantly faster than Memcheck (TODO:benchmark)." (https://github.com/google/sanitizers/wiki/MemorySanitizer)

@practicalswift
Copy link
Contributor Author

practicalswift commented Mar 7, 2020

@MarcoFalke

I think we should use both. While both tools try to find uses of uninitialized memory (UUM) they are implemented in totally different ways: Valgrind's memcheck is using binary translation while MSan is using compiler-based instrumentation. From my experience it is not guaranteed that Valgrind finds all issues reported by MSan and vice versa.

Furthermore, having the combination libFuzzer+MSan readily available is really crucial when fuzzing. The combination libFuzzer+Valgrind is only usable for testing against an existing fuzzing corpus like we do in Travis -- not for doing any actual fuzzing. For the libFuzzer+MSan combination to be available it is important to have the cumbersome MSan setup process documented and also kept up-to-date and working over time. We get that for free by having an MSan build in Travis.

Also MSan is order of magnitudes faster than Valgrind which means that we can run also tests that are too slow for Valgrind in Travis.

And last but not least MSan can report the name of the variable that has not been initialized (thanks to MSan using compiler-based instrumentation). Valgrind only points to the function name.

In summary: I think having MSan in Travis is a very clear win. Let's use all available tools to try to kill the uninitialized memory (UUM) bug class :)

Let me know if you can think of any problems that could arise from testing also under MSan in Travis.

@practicalswift
Copy link
Contributor Author

practicalswift commented Mar 7, 2020

@MarcoFalke

We already run against valgrind […]

Correct if you mean the unit tests.

Incorrect if you mean the functional tests :)

Due to Valgrind's slowness we currently only run a single(!) functional test under Valgrind (TEST_RUNNER_EXTRA="p2p_segwit.py").

Note that this PR runs all available functional tests (and unit tests) under MSan.

@gmaxwell
Copy link
Contributor

gmaxwell commented Mar 8, 2020

I'd love to see a bug that msan catches that valgrind doesn't.

@practicalswift
Copy link
Contributor Author

practicalswift commented Mar 8, 2020

From my experience it is not guaranteed that Valgrind finds all issues reported by MSan and vice versa.

I'd love to see a bug that msan catches that valgrind doesn't.

Sorry about making a vague claim that I cannot back up with an example. Until I can provide an example we can simplify the discussion by assuming that I was flat out wrong about the possibility of Valgrind not catching an issue found by MSan (given enough runtime to compensate for being orders of magnitude slower than MSan).

From a practical perspective the main reason to start using MSan is the speedup that will allow us to go from detecting use of uninitialized memory in code covered by a single(!) functional test (p2p_segwit.py) to detecting use of uninitialized memory in all code covered by our functional tests.

Please note that I'm not suggesting removing the Valgrind job: I think we should use both and ideally bump the timeouts so that Valgrind can test a couple of more functional tests at least :)

(Just to be clear: I love both Valgrind (proof of love: #18166, #18162, #18159, #17633, #17640, #17624, #17455, #15117, #11035, #11024, #10977) and MSan – I use them daily :))

Let me know if you can think of any problems that could arise from testing also under MSan in Travis.

@gmaxwell
Copy link
Contributor

gmaxwell commented Mar 8, 2020

That's fine, if you ever find any-- I'd be interested in seeing them whenever that is. The question was primarily driven out of my own curiosity.

Faster can be pretty useful, though I've found msan in the past to be somewhat lacking compared to valgrind in terms of sensitivity -- though my experience is mostly from when msan was very new. I've found msan most useful in the context of fuzzing, where any speed difference makes for a meaningful throughput difference which ultimately results in a sensitivity difference.

The fact that even paid travis is so cpu starved that this is a big issue for just CI tests is something of is own problem-- labour is expensive, insight is rare. CPU time is the one resource needed by the project which should never be the limiting ingredient, this matters because a surplus of CPU time can be a force multiplier for human labour (though my idea of 'reasonable amounts of computing power' starts at about a thousand core ghz per person, and I've always found it really weird how software people who are at the top of their field often work from just a single underpowered laptop; so maybe I'm just a weirdo). FWIW, for a number of years I tested every release of bitcoin in valgrind, including doing IBDs (overnight).

I don't see any problem with making more use of msan, though it's not a complete replacement for valgrind. But more is more. :)

@practicalswift
Copy link
Contributor Author

Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK. I had issues getting valgrind run on non-x86 arches, so this might help in that regard. Not sure though if we should also run this on travis, given that travis hopefully will run valgrind on everything: #18304

@practicalswift
Copy link
Contributor Author

practicalswift commented Mar 10, 2020

Not sure though if we should also run this on travis, given that travis hopefully will run valgrind on everything: #18304

Even if we assume that we get Valgrind to run all the unit and functional tests within the Travis time limit (which I unfortunately doubt is technically possible) I still don't see why we would not like to use both Valgrind and MSan in Travis? :)

As @gmaxwell put it: "more is more" :)

MSan is literally orders of magnitude faster than Valgrind and also gives better debugging info in case of presence of UUM.

Let me know if you can think of any problems that could arise from testing also under MSan in Travis.

maflcko pushed a commit that referenced this pull request Mar 11, 2020
4444edc ci: Enable all functional tests in valgrind (MarcoFalke)

Pull request description:

  The travis timeout for our repo has been bumped to 2h, so we can run all tests in valgrind now

ACKs for top commit:
  practicalswift:
    ACK 4444edc -- regarding the three disabled cases (`feature_abortnode`, `feature_block` and `rpc_bind`): not a big deal since MSan will take care of those once #18288 is merged. More is more :)

Tree-SHA512: ea2f798112911b6d1f3d88cfcdf0a7cdb698687248343703d6fe55da144542c961c15d888bffb41672c10aa76765615cb7c7ff93d468bfad3c51f962f24e7abb
@Sjors
Copy link
Member

Sjors commented Mar 12, 2020

Concept ACK. cd120cf looks sane, doesn't hurt my macOS build, but otherwise a bit above my pay-grade. The changes are contained to their own machine, except:

For the libFuzzer+MSan combination to be available it is important to have the cumbersome MSan setup process documented and also kept up-to-date and working over time. We get that for free by having an MSan build in Travis.

Keeping this up to date, by having it run on Travis, seems useful. We can always triage Travis resources if funding runs dry.

The build took 1 hours and 15 minutes the first time. You could push a trivial commit to see much of that is shaved off by the cache.

Slow Travis builds can be a burden on review, because Github won't show a build as failed until all Travis machines of the current phase are done. I often find myself reviewing PRs in EU morning with trivial bugs not spotted by US devs before they went to be.

This can be mitigated if you add a new stage grind(?) and move the machine there. (I also agree with @gmaxwell about developers not using enough tooling, guilty as charged)

It could also run as part of a cron job on master ("$TRAVIS_EVENT_TYPE" = "cron"), but I'm not sure who gets notified when those fail. And it's better to catch stuff before merge, so let's not do that unless we run short on resources.

@practicalswift
Copy link
Contributor Author

The build took 1 hours and 15 minutes the first time.

That seems to be roughly the time it takes also with caching. That might sound long but it is still 45 minutes quicker than the current valgrind job which only covers a subset of the functional tests whereas this one covers all of them :)

(To be clear: even with the two hour valgrind job I think it easily is worth taking that cost in terms of Travis build time to protect our users from at least a subset of UUM bugs. This kind of (partial) protection is long-overdue :))

@jonatack
Copy link
Member

Concept ACK

Copy link
Contributor

@fjahr fjahr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

Code looks good but I don't have extensive experience with the CI scripts so far.

@practicalswift
Copy link
Contributor Author

@fanquake

@practicalswift can you drop your boost changes here, and rebase on #18820? We'll get the macOS issues sorted out.

That has been done :) Ready for final review? :)

Copy link
Member

@fanquake fanquake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. I also wrapped up these changes into a msan dockerfile for my own use. Will check that we can use a special case list rather than patching random.cpp

@practicalswift
Copy link
Contributor Author

@MarcoFalke @fanquake Thanks a lot for great and detailed feedback!

All feedback should now be addressed:

  • Dropped --disable-asm.
  • Now building with ZeroMQ (dropped NO_ZMQ=1 and --disable-zmq).
  • Bumped to clang-9 and llvm-9.
  • No longer patching src/random.cpp with MemorySanitizer annotations.
  • Removed unneccessary bash export of LIBCXX_DIR.

The current version of this PR only touches only .travis.yml and ci/test/.

Should be ready for final review :)

@maflcko
Copy link
Member

maflcko commented Jun 24, 2020

ACK 870f0cd

Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (for a follow-up)

export BDB_PREFIX="${BASE_ROOT_DIR}/db4"

export CONTAINER_NAME="ci_native_msan"
export PACKAGES="clang-9 llvm-9 cmake"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export PACKAGES="clang-9 llvm-9 cmake"
export PACKAGES="clang-9 llvm-9 cmake" # When changing the clang version, also adjust ci/test/04_install.sh

@maflcko maflcko merged commit 7027c67 into bitcoin:master Jul 2, 2020
@practicalswift practicalswift deleted the msan-in-travis branch April 10, 2021 19:42
vijaydasmp pushed a commit to vijaydasmp/dash that referenced this pull request Oct 4, 2021
…etect use of uninitialized memory

870f0cd build: Add MemorySanitizer (MSan) in Travis to detect use of uninitialized memory (practicalswift)

Pull request description:

  Add MemorySanitizer (MSan) in Travis to detect use of uninitialized memory.

  First UBSan, then ASan followed by TSan... and now: yes, the wait is over -- **MSan is finally here!** :)

  Some historical context:
  * 2017: Continuous compilation with Clang Thread Safety analysis enabled (bitcoin#10866, bitcoin#10923)
  * 2018: Continuous testing with trapping on signed integer overflows (`-ftrapv`) (bitcoin#12686)
  * 2018: Continuous testing of use of locale dependent functions (bitcoin#13041)
  * 2018: Continuous testing of format strings (bitcoin#13705)
  * 2018: Continuous compilation with MSVC `TreatWarningAsError` (bitcoin#14151)
  * 2018: Continuous testing under UndefinedBehaviorSanitizer – UBSan (bitcoin#14252, bitcoin#14673, bitcoin#17006)
  * 2018: Continuous testing under AddressSanitizer – ASan (bitcoin#14794, bitcoin#17205, bitcoin#17674)
  * 2018: Continuous testing under ThreadSanitizer – TSan (bitcoin#14829)
  * 2019: Continuous testing in an unsigned char environment (`-funsigned-char`) (bitcoin#15134)
  * 2019: Continuous compile-time testing of assumptions we're making (bitcoin#15391)
  * 2019: Continuous testing of fuzz test cases under Valgrind (bitcoin#17633, bitcoin#18159, bitcoin#18166)
  * 2020: Finally... MemorySanitizer – MSAN! :)

  What is the next step? What tools should we add to CI to keep bugs from entering `master`? :)

ACKs for top commit:
  MarcoFalke:
    ACK 870f0cd

Tree-SHA512: 38327c8b75679d97d469fe42e704cacd1217447a5a603701dd8a58ee50b3be2c10248f8d68a479ed081c0c4b254589d3081c9183f991640b06ef689061f75578
maflcko pushed a commit to bitcoin-core/gui that referenced this pull request Mar 10, 2022
…MSAN job

3566353 ci: remove compiled-but-unused BDB from MSAN job (fanquake)

Pull request description:

  Self-compiled BDB was added to this job as opposed to using depends BDB [due to linking issues](bitcoin/bitcoin#18288 (comment)), however the compiled BDB is not actually used. Remove it for now, given we don't actually lose any coverage (note that BDB is also not currently used in the naitve MSAN fuzz job or for [OSS Fuzz](https://github.com/google/oss-fuzz/blob/master/projects/bitcoin-core/build.sh#L32) builds).

  In future, we can use depends BDB, however introducing it now will cause false positives, which can be fixed by upgrading the versions of Clang / LLVM we use, however upgrading to those newer versions causes other issues, which appear in standard library code, and require more involved suppressing, which can be solved in a follow up or another PR i.e #23008.

Top commit has no ACKs.

Tree-SHA512: 9e8fdd95246cafa27cda7bcf0641b428d4573f6748ecdf07cc6205a64351db22ba383ec943e88a69df3694ccb9f125d994b64345a4e44fb6fea4a014504760d1
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants