It started during RSS on a new 32 sled rack where we hit this failure: ``` ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Current Rack Setup Status │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Status: Initialization Failed │ │ Last initialization operation ID: 116f7904-35dd-4301-a895-139f71c603f4 │ │ Message: Errors initializing datasets: Error initializing dataset oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crucible: Zpool Not │ │ Found: oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773, Error initializing dataset oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/debug: │ │ Zpool Not Found: oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773, Error initializing dataset │ │ oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414/crypt/debug: Zpool Not Found: oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414, Error initializing │ │ dataset oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414/crypt/zone/oxz_crucible_65a0d412-2bc5-4b37-9f57-d5830d688c86: Zpool Not Found: │ │ oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414, Error initializing dataset oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414/crucible: Zpool Not │ │ Found: oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414, Error initializing dataset oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/zone: │ │ Zpool Not Found: oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773, Error initializing dataset │ │ oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414/crypt/zone: Zpool Not Found: oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414, Error initializing │ │ dataset oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/zone/oxz_crucible_586474b6-bcc3-4013-894c-6422ed120cef: Zpool Not Found: │ │ oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773 ``` Before starting RSS, we verified all sleds were seen, and all sleds had disk present (diskinfo). Logs gathered can be found here: /staff/core/omicron-8203 Interleaved here is an edited chat log with highlights of the debug we walked through. john (he/him): (cc sean (he/him) who have more context) I'm pretty confused by this failure, on its face. It's reported by RSS here: https://github.com/oxidecomputer/omicron/blob/d4e0bc1465ced854ba33fb3aec5e9c49c02665c9/sled-agent/src/rack_setup/service.rs#L429-L450 the error came from some sled here, I think: https://github.com/oxidecomputer/omicron/blob/d4e0bc1465ced854ba33fb3aec5e9c49c02665c9/sled-storage/src/manager.rs#L1507-L1518 but based on the RSS check, we just successfully set up all the disks so I'm not sure why we would be missing them when we go to do datasets immediately after ("immediately after" comes from the http handler for putting a new config: we set up the disks then immediately set up the datasets, but only if disks all succeeded: https://github.com/oxidecomputer/omicron/blob/d4e0bc1465ced854ba33fb3aec5e9c49c02665c9/sled-agent/src/sled_agent.rs#L978-L1000) alan: We did find the pools were created on sled 26: ``` support@oxz_switch0:~$ pilot host exec -c 'zfs list | grep oxp_8809d9a9' 0-31 ... 26 BRM06240015 ok: oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773 1.44M 2.82T 96K /oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773 oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt 648K 2.82T 248K /pool/ext/8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/debug 200K 100G 200K /pool/ext/8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/debug oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/zone ``` sean (he/him): > john (he/him): > the error came from some sled here, I think: https://github.com/oxidecomputer/omicron/blob/d4e0bc1465ced854ba33fb3aec5e9c49c02665c9/sled-storage/src/manager.rs#L1507-L1518 This looks like it's invoking https://github.com/oxidecomputer/omicron/blob/d4e0bc1465ced854ba33fb3aec5e9c49c02665c9/sled-agent/src/sled_agent.rs#L978-L1008 , which, as you said, should be setting up disks, reporting zero errors, and then attempting to set up datasets I'm going to dig into the code more closely, but the ordering here should be okay, unless some part of "disk setup" is returning too early and racing with dataset initialization john (he/him): possible smoking gun in sled-agent.log at the end of ensuring physical disks, we see 10 successes: ``` 01:50:52.920Z INFO SledAgent (StorageManager): Successfully synchronized disks without error file = sled-storage/src/manager.rs:709 result = DisksManagementResult { status: [DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800536" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800633" }, err: None }, DiskManagementStatus { identity: D iskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800674" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800723" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800724" } , err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800895" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800985" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", m odel: "WUS5EA138ESP7E3", serial: "23470B800991" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B801016" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B801059" }, err: None }] } ``` but the next couple lines, we see two disks removed: ``` 01:50:52.920Z INFO SledAgent (StorageManager): Received DetectedRawDiskRemoval { raw_disk: Real(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,2722@0/blkdev@w0014EE8401D00800,0", dev_path: Some("/dev/dsk/c6t0014EE8401D00800d0") }, slot: 8, variant: U2, identity: DiskIdentity { vendor: "1b96", mo del: "WUS5EA138ESP7E3", serial: "23470B800991" }, is_boot_disk: false, firmware: DiskFirmware { active_slot: 2, next_active_slot: Some(2), slot1_read_only: true, number_of_slots: 4, slot_firmware_versions: [Some("RC610004"), Some("RC610007"), Some("RC610004"), Some("RC610004")] } }), tx: ... } file = sled-storage/src/manager.rs:615 01:50:52.920Z INFO SledAgent (StorageResources): Removing disk file = sled-storage/src/resources.rs:569 identity = DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800991" } 01:50:52.920Z INFO SledAgent (StorageManager): Received DetectedRawDiskRemoval { raw_disk: Real(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,3/pci1b96,2722@0/blkdev@w0014EE8401D0C500,0", dev_path: Some("/dev/dsk/c7t0014EE8401D0C500d0") }, slot: 7, variant: U2, identity: DiskIdentity { vendor: "1b96", mo del: "WUS5EA138ESP7E3", serial: "23470B801016" }, is_boot_disk: false, firmware: DiskFirmware { active_slot: 2, next_active_slot: Some(2), slot1_read_only: true, number_of_slots: 4, slot_firmware_versions: [Some("RC610004"), Some("RC610007"), Some("RC610004"), Some("RC610004")] } }), tx: ... } file = sled-storage/src/manager.rs:615 01:50:52.920Z INFO SledAgent (StorageResources): Removing disk file = sled-storage/src/resources.rs:569 identity = DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B801016" } ``` john (he/him): looks like that tracks: ``` { "identity": { "vendor": "1b96", "model": "WUS5EA138ESP7E3", "serial": "23470B800991" }, "id": "ece45ca3-f756-43bc-836e-425e650fe194", "pool_id": "6ee32179-9c93-4ad2-9816-05d8e5a9b414" }, { "identity": { "vendor": "1b96", "model": "WUS5EA138ESP7E3", "serial": "23470B801016" }, "id": "c873bf6a-bcb3-45fd-95a1-1c0362861289", "pool_id": "8809d9a9-1544-42d5-b3b7-28129a2f6773" }, ``` those two serials do match the pool IDs that we see failures for so: sled-agent successfully started managing all 10 disks, then thinks those two disks went away sean (he/him): so, next question: Why did we see "raw disk removal" john (he/him): it looks like one of them then immediately came back? ``` 01:50:52.920Z INFO SledAgent (StorageManager): Received DetectedRawDisk { raw_disk: Real(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,2722@0/blkdev@w0014EE8401D00800,0", dev_path: Some("/dev/dsk/c6t0014EE8401D00800d0") }, slot: 8, variant: U2, identity: DiskIdentity { vendor: "1b96", model: "W US5EA138ESP7E3", serial: "23470B800991" }, is_boot_disk: false, firmware: DiskFirmware { active_slot: 2, next_active_slot: Some(2), slot1_read_only: true, number_of_slots: 4, slot_firmware_versions: [Some("RC610004"), Some("RC610007"), Some("RC610004"), Some("RC610004")] } }), tx: ... } file = sled-storage/src/manager.rs:615 01:50:52.920Z INFO SledAgent (StorageResources): Inserting disk file = sled-storage/src/resources.rs:491 identity = DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800991" } ``` but maybe we didn't "re-manage" it in time sean (he/him): I think the other one also came back (serial: 23470B801016), within about a second ``` 01:50:52.920Z INFO SledAgent (StorageManager): Received DetectedRawDisk { raw_disk: Real(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,2722@0/blkdev@w0014EE8401D00800,0", dev_path: Some("/dev/dsk/c6t0014EE8401D00800d0") }, slot: 8, variant: U2, identity: DiskIdentity { vendor: "1b96", model: "W US5EA138ESP7E3", serial: "23470B800991" }, is_boot_disk: false, firmware: DiskFirmware { active_slot: 2, next_active_slot: Some(2), slot1_read_only: true, number_of_slots: 4, slot_firmware_versions: [Some("RC610004"), Some("RC610007"), Some("RC610004"), Some("RC610004")] } }), tx: ... } file = sled-storage/src/manager.rs:615 01:50:52.920Z INFO SledAgent (StorageResources): Inserting disk file = sled-storage/src/resources.rs:491 identity = DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800991" } ``` sean (he/him): To lay out a timeline: 01:48:41.541Z INFO SledAgent: physical disks ensure - I think we started the "set omicron config" call here 01:48:46.802Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Device with serial 23470B800536 was observed removed from the device tree 01:48:51.869Z INFO SledAgent (StorageResources): Ensuring zpool has datasets - Interestingly, I see additional calls to "Ensure zpool has datasets" which are continuing to operated on the disk which was "removed" (serial: 23470B800536). I don't think the removal interrupts the "physical disks ensure" call, and the operations are succeed ing, so it's still going. This includes: setting the LBA on that disk to 4096, doing other setup, etc. 01:48:54.309Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - We see the disk come back, with serial 23470B800536. 01:50:27.018Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B800991 01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B801016 01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - Addition of disk with serial 23470B800991 01:50:52.556Z INFO SledAgent (StorageManager): Received DetectedRawDiskRemoval - We see the disk with serial 23470B800536 go away again. 01:50:52.556Z INFO SledAgent (StorageManager): Received DetectedRawDisk - We see the same disk with serial 23470B800536 come back 01:50:52.920Z INFO SledAgent (StorageManager): Successfully synchronized disks without error - We think the disk exists again. I see the serial in the result output for DisksManagementResult. 01:50:53.336Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - Addition of disk with serial 23470B801016 01:50:53.685Z INFO SledAgent: datasets ensure - We try to ensure datasets. I see the "Failed to find zpool" errors here, in particular, for the zpool with UUID 8809d9a9-1544-42d5-b3b7-28129a2f6773. Digging back further, this is the zpool associated with the disk with serial 23470B801016 - this was removed earlier, and was just detected by the device tree again 01:51:02.863Z INFO SledAgent (StorageResources): Managing disk - I see the sled agent trying to manage the disk with serial 23470B801016 ; I think this is a down-stream attempt to re-synchronize the disks. Looks successful. 01:51:03.147Z INFO SledAgent (StorageManager): Successfully synchronized disks without error - Finishes the "disk synchronization" from earlier TL;DR: We asked the sled agent to manage the disks it could, this seemed to happen successfully The device tree was flapping a lot, showing multiple devices being added and removed and re-added. Weirdly, while the devices were missing from the device tree, this still appeared to be mutable. While one of those devices was "removed", we asked to put datasets on it. That failed, which I believe is currently expected, because we consider "managed" devices to be ones that we see in the configuration + ones that we think are physically plugged in. The disk later re-appeared, and was "managed" by the control plane again, but this was after the request to put datasets on it failed. rm: Just thinking out loud for a moment, to do the namespace reformat we actually will detach the blkdev instance to issue the format and then once that's done it'll come back. So that is a way that without the nvme level device disappearing we can see that happen. But I don't know where we do that kind of disk device manipulation in that timeline or if relevant. Just trying to think out loud here and what little we can get from what's in here. It's certainly possible (though weird) that the NVMe device went away and came back. rm: It's specifically doing a blkdev detach so it can reformat. The problem is that we have to do something that destroys all the data on the device and renumbers all the sectors. So we can't do it while it is up. mike: Not at the laptop at the moment, but yea detach, format, attach rm: So if rescanning and this operation are independent from the hardware monitor's loop we're going to be a little split brain. Do we have logs around the reformat actions that we might be able to correlate? sean (he/him): We do, I think this roughly lines up with the timeline I posted earlier 01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B801016 01:50:36.228Z INFO SledAgent (StorageResources): Resized 23470B801016 from 3840 to 3200 01:50:40.627Z INFO SledAgent (StorageResources): Formatted disk with serial 23470B801016 to an LBA with data size 4096 01:50:53.336Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded So: yeah, looks like the formatting happened in the window between "device removed" and "device added" sean (he/him): So this makes more sense: * We decide "what disks are plugged in" by scanning the device tree in the HardwareMonitor * This information exists in the Sled's inventory, and gets plumbed to RSS * When RSS asks for a disk to be used, we call ensure_partition_layout -> if it's a U.2, we call ensure_size_and_formatting, which can remove the device from the device tree * If we poll the device tree during this period of removal, this is interpreted as "this disk has been removed" * When the disk later re-appears in the device tree, we'll try to manage it again, and "re-synchronize" the requested config with the disks observed in the device tree * However, if datasets are requested while we are not synchronized (which can happen as a part of formatting the disk, when the block device is detached) they'll show up as "not found" sean (he/him): To answer the question "Why is the disk not visible immediately after it's re-added to the device tree", the answer is basically: "Monitoring hardware and propagating that info to the sled agent is asynchronous". In this particular case, we may want to add messaging for force things to be more synchronous -- e.g., after formatting any disks, ensure we're operating on an up-to-date view of the device tree, and ensure that storage manage state is synchronized and by "synchronized", I mean "at least up-to-date enough that it can see the block device which has been re-added" sean (he/him): I think the timeline is: * ensure_partition_layout removes the device, formats it. * If HardwareMonitor scans the device tree, it'll see the device as missing. This fires off a message to the "storage manager" part of the sled agent, telling it "this disk is gone". * If the HardwareMonitor scans the device tree and sees the disk again, it'll send a message to the "storage manager" telling it "this disk was just added, adopt it into the control plane at your convenience" * Every ten seconds, the storage manager "synchronizes disks", which means: it reads the configured disks, it looks at the disks that it things are online, and it adopts the disks it thinks should be part of the control plane. rm: I think if we scanned NVMe devices rather than blkdev we could probably avoid that. (And wouldn't get tripped up by device missing blkdev namespaces or with too many) sean (he/him): Yeah, this makes sense to me which would basically mean: from the perspective of the sled agent, reformatting the disk does not cause it to "go away"