-
Notifications
You must be signed in to change notification settings - Fork 558
Merge 2.8 #11604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Merge 2.8 #11604
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This reverts commit f122f4b, re-enabling the tests so that they can be fixed properly.
The raft workers needed some extra steps to get them running: the cluster (of one machine) needed bootstrapping and the raft transport needs to know its own address before the raft worker can start. This is normally published by the peergrouper, but the peergrouper was failing because mongo isn't running with --replSet in these tests. I initially tried turning it on and getting the replicaset configured but for reasons I couldn't work out the config changes wouldn't stick. Instead we get hold of the hub and publish the test address directly. (It's likely that the code grabbing the hub is racy - I couldn't check because of the unsafe pointer conversion error when running the race checker.) There's one test remaining that relies on claiming a singular lease in state before starting the agent - I'll fix that next.
This was relying on being able to get a singular claimer from state. Unfortunately there's no easy way to get the equivalent claimer from the lease manager from outside the dependency engine - the API server gets it as a dependency through the manifolds. Instead inject a snapshot into the raft store in the same way as we do when migrating from legacy leases in upgrades (but simpler because it doesn't need to go through the raft logs to find config and current term and index).
And remove MachineLegacyLeasesSuite.
Having the other side of the websocket go away isn't an error. We now check to see if the error is a close error type, and if it is, just log at trace. Only log an error if there is some other error type. Also had the pubsub handler use a more detailed logger for writing errors so they can be filtered more easily.
… one remote state for remote-init;
juju#11585 Having the other side of the websocket go away isn't an error. We now check to see if the error is a close error type, and if it is, just log at trace. Only log an error if there is some other error type. Also had the pubsub handler use a more detailed logger for writing errors so they can be filtered more easily. ## QA steps ```sh # bootstrap and enable-ha juju switch controller juju model-config logging-config=juju=WARN:juju.apiserver.pubsub=TRACE # in one window juju debug-log # in another window juju ssh 2 sudo service jujud-machine-2 stop ``` Observe in the log that we get a "websocket closed" log line, not an error. ## Bug reference https://bugs.launchpad.net/juju/+bug/1759372
These were all changed to `if false` in a prior commit, now remove them fully.
Some generic features are still used for interacting with state/raftlease. The state/lease package is kept because it is used by upgrade code. Remove the code in globalclockupdater that allows it to take either a state or a lease manager as input - now it always requires a lease manager (and a raft).
The only use of this was for the legacy-leases-flag worker in the machine agent.
This means we won't create it in the db. The leases collection is still used in upgrade steps. Updated MigrateLeasesToGlobalTime to use a jujutxn.Runner directly, since using st.db().Run() uses a multimodel runner that looks in allCollections to find out whether a given collection is global.
juju#11583 ## Please provide the following details to expedite Pull Request review: ### Checklist - [ ] ~Checked if it requires a [pylibjuju](https://github.com/juju/python-libjuju) change?~ - [ ] ~Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR?~ - [ ] ~Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed?~ - [x] Do comments answer the question of why design decisions were made? ---- ## Description of change In `upgrade-charm` process, the pod will be re-created if the k8s spec changed. Because the container running status is refreshed when the operator gets watcher event triggered by the state, this might happen before the container is recreated. Thus, the remote-init process might be execing to the newly created container using the running status of the previously terminated container. In this PR, instead of using the cached running status in the uniter watcher's remote state, we fetch the current running status before starting the remote-init process. We also retry the remote-init process if we got a container not running error. ## QA steps ```console ## deploy mariab db; $ juju deploy /tmp/charm-builds/mariadb-k8s/ --debug --resource mysql_image=mariadb --storage database=100M $ juju scale-application mariadb-k8s -m t1 5 ## change k8s spec then re-build the charm and run upgrade-charm; $ juju upgrade-charm mariadb-k8s --path /tmp/charm-builds/mariadb-k8s/ --debug --resource mysql_image=mariadb --storage database=100M ## watch the status of pods to see if they are rolling upgraded peacefully. $ mkubectl get all -o wide ``` ## Documentation changes No ## Bug reference https://bugs.launchpad.net/juju/+bug/1877935
juju#11582 Under load, or on slow instances, the testing.ShortWait time (50ms) is not sufficient for the dependent code to get to the required situation of waiting on the clock. There is no real impact for changing testing.ShortWait for testing.LongWait. The WaitAdvance code will still check every 10ms for the appropriate number of waiters. Backporting fix that landed in develop by mistake.
…tests juju#11573 ### Checklist - [ ] Checked if it requires a [pylibjuju](https://github.com/juju/python-libjuju) change? - [ ] Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR? - [ ] Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed? - [ ] Do comments answer the question of why design decisions were made? ---- ## Description of change This PR makes a number of changes to complete the removal of legacy leases that was begun in juju#11543. It's probably easiest to review in separate commits. ### Fix the skipped machine agent tests These were in `MachineLegacyLeasesSuite` because they depended on leases working and there were a few prerequisites preventing the raft workers from operating. They were fixed by bootstrapping the (one machine) raft cluster so that it knew the machine had a vote, and publishing a test address on the central hub so that the raft-transport worker had the current address (which was preventing the raft worker from starting). The latter would normally be done by the peergrouper, but that doesn't run properly in the tests because the replicaset isn't configured, and I couldn't work out how to do it in a way that would stick. The tests have all been moved back to `MachineSuite` and `MachineLegacyLeasesSuite` is gone. ### Remove the `if false` blocks from juju#11543 Including removing anything that's not used anymore because of that. In particular, the state leadership and singular lease managers are gone, which only leaves the `txnlog` worker. The `state/lease` package is still kept around because it's used from upgrades (for migrating leases to a new time format). ### Remove the state input from globalclockupdater Now that we only run one of them in an agent it doesn't need to be able to switch between taking a state or a lease manager as input. ### Remove the featureflag worker It's not specific to legacy leases, but the legacy-leases-flag was the only use of it in the codebase. Hopefully when it's needed next time someone remembers "oh yeah, we used to have something like that" and resurrects it rather than coding it again (although that's better than having to maintain it just in case). ## QA steps Same as for juju#11543 : ```sh juju bootstrap lxd f --build-agent # Deploy an application with three units. juju deploy ~jameinel/ubuntu-lite -n 3 # Remove the leader and see that the leadership is claimed by one of the others. juju remove-unit ubuntu-lite/0 ``` In the database check that the leases collection is empty and the leaseholders collection (filled by the raft lease store) is populated: ```javascript db.leases.find() db.leaseholders.find() ``` ## Documentation changes None ## Bug reference None
This sometimes fails because the raft workers take too long to get started, use the longer wait that was added to fix this problem.
…nt-machine-test juju#11591 ## Description of change Now that the agent that gets run in the machine agent tests uses raft leases, the raft workers sometimes take too long to get started and ready, which made `TestJobManageModelRunsMinUnitsWorker` sometimes fail. Use the longer timeout (double the LongWait) that other tests in the suite already use. ## QA steps Running the test under stress doesn't fail after a lot of runs. ## Documentation changes None ## Bug reference None
This allows tests to change the restart time so older wall clock type tests don't have to wait three seconds for workers to restart. In particular some of the MachineSuite tests are impacted by this as the api caller worker is restarted when the addresses change. Which is signalled by another worker. Without this patch, the TestManageModelRunsCleaner test ran in approx six seconds, ten seconds, or timed out about one in 12. With this patch the test is a consistent six seconds, which is still too long, but at least consistent.
juju#11592 This allows tests to change the restart time so older wall clock type tests don't have to wait three seconds for workers to restart. In particular some of the MachineSuite tests are impacted by this as the api caller worker is restarted when the addresses change. Which is signalled by another worker. Without this patch, the TestManageModelRunsCleaner test ran in approx six seconds, ten seconds, or timed out about one in 12. With this patch the test is a consistent six seconds, which is still too long, but at least consistent. ## QA steps *Please replace with how we can verify that the change works?* ```sh cd cmd/jujud/agent stress -check.v -check.f TestManageModelRunsCleaner ``` ## Documentation changes Test change only.
… new units initialization;
- Introduces a new command juju model that will be triggered as a k8s deployment by juju for each model created. - Moves the cass admission logic to this new command - Adds passwords for the operator to the model doc state - New http server worker for simple agents
juju#11507 ### Checklist - [x] Checked if it requires a [pylibjuju](https://github.com/juju/python-libjuju) change? - [x] Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR? - [x] Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed? - [x] Do comments answer the question of why design decisions were made? ---- ## Description of change Deploys the jujud model command into a caas env ## QA steps ```sh juju bootstrap microks juju add-model test kubectl get deploy -n test ```
juju#11595 ## Please provide the following details to expedite Pull Request review: ### Checklist - [ ] ~Checked if it requires a [pylibjuju](https://github.com/juju/python-libjuju) change?~ - [ ] ~Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR?~ - [ ] ~Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed?~ - [x] Do comments answer the question of why design decisions were made? ---- ## Description of change Fixes: 1. Uniter should not block the `containerStartChan` in operator; 2. Change to use sharing exec client for uniters; 3. CAAS uniter now can restart itself correctly; ## QA steps ```console $ juju deploy /tmp/charm-builds/mariadb-k8s/ --debug --resource mysql_image=mariadb --storage database=100M ## exec into operator to enable wrench $ mkubectl exec pod/mariadb-k8s-operator-0 -n t1 -it -- bash ## enable wrench $ mkdir wrench && echo "fatal-error" > wrench/remote-init ## scale application to 2 to create a new unit $ juju scale-application mariadb-k8s -m t1 2 ## from the log we can see uniter tries to restart itself because the wrench returns an error in remote init process $ juju debug-log -m k1:t1 --color --replay --level WARNING ## remove wrench in the operator pod; $ rm wrench/remote-init ## now we can see the uniter restarts successfully; ``` ## Documentation changes No ## Bug reference https://bugs.launchpad.net/juju/+bug/1878329
shim out concrete *state.StatePool. This unsafe pointer conversion causes Go 1.14's checkptr to raise panics in the race test suite.
juju#11597 Backports juju#11596 from develop, which fixes unsafe pointer conversion used in multi-watcher tests.
… remove the manual expire api. Also, introduce a RevokeLeases() api to allow a lease to be manually revoked.
Extract the lease upgrade related logic into its own file so all the remaining legacy lease stuff is kept together.
… of a unit when the unit goes to dead.
juju#11598 Forward port of juju#11585.
juju#11593 ## Description of change The highlight is that when a unit transition to dead, its leadership lease is revoked. There's 4 main bits of work: 1. Update the leadership worker - we only use a raft based lease store now which auto expires leases so remove the manual expire api. Add revoke api. 2. Update the raft fsm to support revoking leases. 3. Remote legacy lease code in state and consolidate remaining legacy code needed foe upgrades. (move code from updates.go to leaseupdates.go) 4. Update the api facades to remove the unit leadership when the unit goes to dead. ## QA steps Deploy 3 units watch juju status remove the leader unit see that juju status shows the transition to a new leader as soon as the deleted leader is gone ## Bug reference https://bugs.launchpad.net/bugs/1469731
juju#11594 ## Description of change - Fix deferreturn panic on go1.14 on arm64 and ppc64 by disabling optimizations - Fix unaligned ptrs by moving to go.etcd.io/bbolt ## QA steps - Run unit tests - Build for ppc64 and arm64 (juju bins will be quite large) - Deploy a HA controller (+ anything else you can think to test raft) ## Documentation changes N/A ## Bug reference https://golang.org/issue/39049 etcd-io/bbolt#201
- Changed the name used on the model operator kubernetes deployment
juju#11599 ### Checklist - [x] Checked if it requires a [pylibjuju](https://github.com/juju/python-libjuju) change? - [x] Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR? - [x] Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed? - [x] Do comments answer the question of why design decisions were made? ---- ## Description of change Changed the name used on the model operator kubernetes deployment ## QA steps ```sh cd tests && ./main.sh caasadmission ```
juju#11600 ## Description of change Merge 2.8-rc branch with these commits: juju#11585 don't write errors when pubsub socket disconnects juju#11583 k8s pod initialisaton races juju#11582 increase max wait time for WaitAdance juju#11573 finish legacy lease removal juju#11591 increase timeout in machine agent tests juju#11592 expose engine worker restart time as package variable juju#11507 add k8s model agent pod juju#11595 fix caas uniter restart on error juju#11593 add revoke leases so dead units lose their lease immediately juju#11594 fix deferreturn and bolt juju#11599 update deployment name used for model op ## QA steps See individual PRs.
!!build!! |
hpidcock
approved these changes
May 19, 2020
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Merge 2.8 branch with these commits:
#11585 don't write errors when pubsub socket disconnects
#11583 k8s pod initialisaton races
#11582 increase max wait time for WaitAdance
#11573 finish legacy lease removal
#11591 increase timeout in machine agent tests
#11592 expose engine worker restart time as package variable
#11507 add k8s model agent pod
#11595 fix caas uniter restart on error
#11593 add revoke leases so dead units lose their lease immediately
#11594 fix deferreturn and bolt
#11599 update deployment name used for model op
QA
Se individual PRs