Merge 2.8 #11604

wallyworld · 2020-05-19T02:37:13Z

Description

Merge 2.8 branch with these commits:

#11585 don't write errors when pubsub socket disconnects
#11583 k8s pod initialisaton races
#11582 increase max wait time for WaitAdance
#11573 finish legacy lease removal
#11591 increase timeout in machine agent tests
#11592 expose engine worker restart time as package variable
#11507 add k8s model agent pod
#11595 fix caas uniter restart on error
#11593 add revoke leases so dead units lose their lease immediately
#11594 fix deferreturn and bolt
#11599 update deployment name used for model op

QA

Se individual PRs

This reverts commit f122f4b, re-enabling the tests so that they can be fixed properly.

The raft workers needed some extra steps to get them running: the cluster (of one machine) needed bootstrapping and the raft transport needs to know its own address before the raft worker can start. This is normally published by the peergrouper, but the peergrouper was failing because mongo isn't running with --replSet in these tests. I initially tried turning it on and getting the replicaset configured but for reasons I couldn't work out the config changes wouldn't stick. Instead we get hold of the hub and publish the test address directly. (It's likely that the code grabbing the hub is racy - I couldn't check because of the unsafe pointer conversion error when running the race checker.) There's one test remaining that relies on claiming a singular lease in state before starting the agent - I'll fix that next.

This was relying on being able to get a singular claimer from state. Unfortunately there's no easy way to get the equivalent claimer from the lease manager from outside the dependency engine - the API server gets it as a dependency through the manifolds. Instead inject a snapshot into the raft store in the same way as we do when migrating from legacy leases in upgrades (but simpler because it doesn't need to go through the raft logs to find config and current term and index).

And remove MachineLegacyLeasesSuite.

Having the other side of the websocket go away isn't an error. We now check to see if the error is a close error type, and if it is, just log at trace. Only log an error if there is some other error type. Also had the pubsub handler use a more detailed logger for writing errors so they can be filtered more easily.

… one remote state for remote-init;

juju#11585 Having the other side of the websocket go away isn't an error. We now check to see if the error is a close error type, and if it is, just log at trace. Only log an error if there is some other error type. Also had the pubsub handler use a more detailed logger for writing errors so they can be filtered more easily. ## QA steps ```sh # bootstrap and enable-ha juju switch controller juju model-config logging-config=juju=WARN:juju.apiserver.pubsub=TRACE # in one window juju debug-log # in another window juju ssh 2 sudo service jujud-machine-2 stop ``` Observe in the log that we get a "websocket closed" log line, not an error. ## Bug reference https://bugs.launchpad.net/juju/+bug/1759372

These were all changed to `if false` in a prior commit, now remove them fully.

Some generic features are still used for interacting with state/raftlease. The state/lease package is kept because it is used by upgrade code. Remove the code in globalclockupdater that allows it to take either a state or a lease manager as input - now it always requires a lease manager (and a raft).

The only use of this was for the legacy-leases-flag worker in the machine agent.

This means we won't create it in the db. The leases collection is still used in upgrade steps. Updated MigrateLeasesToGlobalTime to use a jujutxn.Runner directly, since using st.db().Run() uses a multimodel runner that looks in allCollections to find out whether a given collection is global.

juju#11583 ## Please provide the following details to expedite Pull Request review: ### Checklist - [ ] ~Checked if it requires a [pylibjuju](https://github.com/juju/python-libjuju) change?~ - [ ] ~Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR?~ - [ ] ~Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed?~ - [x] Do comments answer the question of why design decisions were made? ---- ## Description of change In `upgrade-charm` process, the pod will be re-created if the k8s spec changed. Because the container running status is refreshed when the operator gets watcher event triggered by the state, this might happen before the container is recreated. Thus, the remote-init process might be execing to the newly created container using the running status of the previously terminated container. In this PR, instead of using the cached running status in the uniter watcher's remote state, we fetch the current running status before starting the remote-init process. We also retry the remote-init process if we got a container not running error. ## QA steps ```console ## deploy mariab db; $ juju deploy /tmp/charm-builds/mariadb-k8s/ --debug --resource mysql_image=mariadb --storage database=100M $ juju scale-application mariadb-k8s -m t1 5 ## change k8s spec then re-build the charm and run upgrade-charm; $ juju upgrade-charm mariadb-k8s --path /tmp/charm-builds/mariadb-k8s/ --debug --resource mysql_image=mariadb --storage database=100M ## watch the status of pods to see if they are rolling upgraded peacefully. $ mkubectl get all -o wide ``` ## Documentation changes No ## Bug reference https://bugs.launchpad.net/juju/+bug/1877935

juju#11582 Under load, or on slow instances, the testing.ShortWait time (50ms) is not sufficient for the dependent code to get to the required situation of waiting on the clock. There is no real impact for changing testing.ShortWait for testing.LongWait. The WaitAdvance code will still check every 10ms for the appropriate number of waiters. Backporting fix that landed in develop by mistake.

…tests juju#11573 ### Checklist - [ ] Checked if it requires a [pylibjuju](https://github.com/juju/python-libjuju) change? - [ ] Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR? - [ ] Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed? - [ ] Do comments answer the question of why design decisions were made? ---- ## Description of change This PR makes a number of changes to complete the removal of legacy leases that was begun in juju#11543. It's probably easiest to review in separate commits. ### Fix the skipped machine agent tests These were in `MachineLegacyLeasesSuite` because they depended on leases working and there were a few prerequisites preventing the raft workers from operating. They were fixed by bootstrapping the (one machine) raft cluster so that it knew the machine had a vote, and publishing a test address on the central hub so that the raft-transport worker had the current address (which was preventing the raft worker from starting). The latter would normally be done by the peergrouper, but that doesn't run properly in the tests because the replicaset isn't configured, and I couldn't work out how to do it in a way that would stick. The tests have all been moved back to `MachineSuite` and `MachineLegacyLeasesSuite` is gone. ### Remove the `if false` blocks from juju#11543 Including removing anything that's not used anymore because of that. In particular, the state leadership and singular lease managers are gone, which only leaves the `txnlog` worker. The `state/lease` package is still kept around because it's used from upgrades (for migrating leases to a new time format). ### Remove the state input from globalclockupdater Now that we only run one of them in an agent it doesn't need to be able to switch between taking a state or a lease manager as input. ### Remove the featureflag worker It's not specific to legacy leases, but the legacy-leases-flag was the only use of it in the codebase. Hopefully when it's needed next time someone remembers "oh yeah, we used to have something like that" and resurrects it rather than coding it again (although that's better than having to maintain it just in case). ## QA steps Same as for juju#11543 : ```sh juju bootstrap lxd f --build-agent # Deploy an application with three units. juju deploy ~jameinel/ubuntu-lite -n 3 # Remove the leader and see that the leadership is claimed by one of the others. juju remove-unit ubuntu-lite/0 ``` In the database check that the leases collection is empty and the leaseholders collection (filled by the raft lease store) is populated: ```javascript db.leases.find() db.leaseholders.find() ``` ## Documentation changes None ## Bug reference None

This sometimes fails because the raft workers take too long to get started, use the longer wait that was added to fix this problem.

…nt-machine-test juju#11591 ## Description of change Now that the agent that gets run in the machine agent tests uses raft leases, the raft workers sometimes take too long to get started and ready, which made `TestJobManageModelRunsMinUnitsWorker` sometimes fail. Use the longer timeout (double the LongWait) that other tests in the suite already use. ## QA steps Running the test under stress doesn't fail after a lot of runs. ## Documentation changes None ## Bug reference None

This allows tests to change the restart time so older wall clock type tests don't have to wait three seconds for workers to restart. In particular some of the MachineSuite tests are impacted by this as the api caller worker is restarted when the addresses change. Which is signalled by another worker. Without this patch, the TestManageModelRunsCleaner test ran in approx six seconds, ten seconds, or timed out about one in 12. With this patch the test is a consistent six seconds, which is still too long, but at least consistent.

juju#11592 This allows tests to change the restart time so older wall clock type tests don't have to wait three seconds for workers to restart. In particular some of the MachineSuite tests are impacted by this as the api caller worker is restarted when the addresses change. Which is signalled by another worker. Without this patch, the TestManageModelRunsCleaner test ran in approx six seconds, ten seconds, or timed out about one in 12. With this patch the test is a consistent six seconds, which is still too long, but at least consistent. ## QA steps *Please replace with how we can verify that the change works?* ```sh cd cmd/jujud/agent stress -check.v -check.f TestManageModelRunsCleaner ``` ## Documentation changes Test change only.

… new units initialization;

- Introduces a new command juju model that will be triggered as a k8s deployment by juju for each model created. - Moves the cass admission logic to this new command - Adds passwords for the operator to the model doc state - New http server worker for simple agents

juju#11507 ### Checklist - [x] Checked if it requires a [pylibjuju](https://github.com/juju/python-libjuju) change? - [x] Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR? - [x] Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed? - [x] Do comments answer the question of why design decisions were made? ---- ## Description of change Deploys the jujud model command into a caas env ## QA steps ```sh juju bootstrap microks juju add-model test kubectl get deploy -n test ```

juju#11595 ## Please provide the following details to expedite Pull Request review: ### Checklist - [ ] ~Checked if it requires a [pylibjuju](https://github.com/juju/python-libjuju) change?~ - [ ] ~Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR?~ - [ ] ~Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed?~ - [x] Do comments answer the question of why design decisions were made? ---- ## Description of change Fixes: 1. Uniter should not block the `containerStartChan` in operator; 2. Change to use sharing exec client for uniters; 3. CAAS uniter now can restart itself correctly; ## QA steps ```console $ juju deploy /tmp/charm-builds/mariadb-k8s/ --debug --resource mysql_image=mariadb --storage database=100M ## exec into operator to enable wrench $ mkubectl exec pod/mariadb-k8s-operator-0 -n t1 -it -- bash ## enable wrench $ mkdir wrench && echo "fatal-error" > wrench/remote-init ## scale application to 2 to create a new unit $ juju scale-application mariadb-k8s -m t1 2 ## from the log we can see uniter tries to restart itself because the wrench returns an error in remote init process $ juju debug-log -m k1:t1 --color --replay --level WARNING ## remove wrench in the operator pod; $ rm wrench/remote-init ## now we can see the uniter restarts successfully; ``` ## Documentation changes No ## Bug reference https://bugs.launchpad.net/juju/+bug/1878329

shim out concrete *state.StatePool. This unsafe pointer conversion causes Go 1.14's checkptr to raise panics in the race test suite.

juju#11597 Backports juju#11596 from develop, which fixes unsafe pointer conversion used in multi-watcher tests.

… remove the manual expire api. Also, introduce a RevokeLeases() api to allow a lease to be manually revoked.

Extract the lease upgrade related logic into its own file so all the remaining legacy lease stuff is kept together.

… of a unit when the unit goes to dead.

juju#11598 Forward port of juju#11585.

juju#11593 ## Description of change The highlight is that when a unit transition to dead, its leadership lease is revoked. There's 4 main bits of work: 1. Update the leadership worker - we only use a raft based lease store now which auto expires leases so remove the manual expire api. Add revoke api. 2. Update the raft fsm to support revoking leases. 3. Remote legacy lease code in state and consolidate remaining legacy code needed foe upgrades. (move code from updates.go to leaseupdates.go) 4. Update the api facades to remove the unit leadership when the unit goes to dead. ## QA steps Deploy 3 units watch juju status remove the leader unit see that juju status shows the transition to a new leader as soon as the deleted leader is gone ## Bug reference https://bugs.launchpad.net/bugs/1469731

juju#11594 ## Description of change - Fix deferreturn panic on go1.14 on arm64 and ppc64 by disabling optimizations - Fix unaligned ptrs by moving to go.etcd.io/bbolt ## QA steps - Run unit tests - Build for ppc64 and arm64 (juju bins will be quite large) - Deploy a HA controller (+ anything else you can think to test raft) ## Documentation changes N/A ## Bug reference https://golang.org/issue/39049 etcd-io/bbolt#201

- Changed the name used on the model operator kubernetes deployment

juju#11599 ### Checklist - [x] Checked if it requires a [pylibjuju](https://github.com/juju/python-libjuju) change? - [x] Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR? - [x] Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed? - [x] Do comments answer the question of why design decisions were made? ---- ## Description of change Changed the name used on the model operator kubernetes deployment ## QA steps ```sh cd tests && ./main.sh caasadmission ```

juju#11600 ## Description of change Merge 2.8-rc branch with these commits: juju#11585 don't write errors when pubsub socket disconnects juju#11583 k8s pod initialisaton races juju#11582 increase max wait time for WaitAdance juju#11573 finish legacy lease removal juju#11591 increase timeout in machine agent tests juju#11592 expose engine worker restart time as package variable juju#11507 add k8s model agent pod juju#11595 fix caas uniter restart on error juju#11593 add revoke leases so dead units lose their lease immediately juju#11594 fix deferreturn and bolt juju#11599 update deployment name used for model op ## QA steps See individual PRs.

wallyworld · 2020-05-19T03:13:49Z

!!build!!

wallyworld · 2020-05-19T04:03:43Z

$$merge$$

Christian Muirhead and others added 30 commits May 15, 2020 13:52

Revert "Skip legacy leases tests until they can be updated"

4a7f31a

This reverts commit f122f4b, re-enabling the tests so that they can be fixed properly.

Move all the LegacyLeases tests back to MachineSuite

a6911f5

And remove MachineLegacyLeasesSuite.

Change to fetch container running status rather than using the cached…

1547d03

… one remote state for remote-init;

Remove if legacy-leases sections

1fe5486

These were all changed to `if false` in a prior commit, now remove them fully.

Remove unused featureflag worker

d368c59

The only use of this was for the legacy-leases-flag worker in the machine agent.

Fix tests that use the removed state.LeadershipClaimer/Checker

9d3f0bc

Drop the leases collection when finished migrating to raft leases

b9668fd

Increase timeout in TestJobManageModelRunsMinUnitsWorker

0caa46a

This sometimes fails because the raft workers take too long to get started, use the longer wait that was added to fix this problem.

Fix deferreturn go bug and bolt race bugs

ec10885

Change to make uniters having its own exec client;

be2a638

Fix containerStartChan fan out to non blocking which currently blocks…

ded4367

… new units initialization;

Fix to make the CAAS uniter re-initialised itself when it restarts;

549d0a8

Fix tests for CAAS uniter restart change;

1963964

Merge branch '2.8-rc' into fix-deferreturn-and-bolt

5e42f1b

Remove arm64

c1d380f

manadart and others added 17 commits May 18, 2020 16:04

Works around unsafe pointer conversion used in multi-watcher test to

dce3d43

shim out concrete *state.StatePool. This unsafe pointer conversion causes Go 1.14's checkptr to raise panics in the race test suite.

Merge pull request juju#11597 from manadart/backport-unsafe-ptr-fix

e5d0efd

juju#11597 Backports juju#11596 from develop, which fixes unsafe pointer conversion used in multi-watcher tests.

Merge 2.7 branch and resolve conflict.

a5add5e

We only use a raft based lease store now which auto expires leases so…

44b0170

… remove the manual expire api. Also, introduce a RevokeLeases() api to allow a lease to be manually revoked.

Update the raft fsm to support revoking leases.

5373ec5

Remove state support for legacy leases.

b2ee830

Extract the lease upgrade related logic into its own file so all the remaining legacy lease stuff is kept together.

Use the new revoke lease functionality to revoke the leadership lease…

93634e0

… of a unit when the unit goes to dead.

Implement RevokeLease() in the dummy lease store

c768fa6

Merge pull request juju#11598 from howbazaar/update-2.8

6d128f2

juju#11598 Forward port of juju#11585.

Minor optimisations

720cc6e

Update the deployment name used for model op

4a65238

- Changed the name used on the model operator kubernetes deployment

Merge branch '2.8-rc' into merge-2.8rc-20200519

ad95861

Merge branch '2.8' into merge-2.8-20200519

9d260f8

hpidcock approved these changes May 19, 2020

View reviewed changes

jujubot merged commit 50f03eb into juju:develop May 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge 2.8 #11604

Merge 2.8 #11604

Uh oh!

wallyworld commented May 19, 2020 •

edited

Loading

Uh oh!

wallyworld commented May 19, 2020

Uh oh!

wallyworld commented May 19, 2020

Uh oh!

Uh oh!

Merge 2.8 #11604

Merge 2.8 #11604

Uh oh!

Conversation

wallyworld commented May 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

QA

Uh oh!

wallyworld commented May 19, 2020

Uh oh!

wallyworld commented May 19, 2020

Uh oh!

Uh oh!

wallyworld commented May 19, 2020 •

edited

Loading