RSS on a new rack failed: Errors initializing datasets

It started during RSS on a new 32 sled rack where we hit this failure:                                    
```
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮                                  
│ Current Rack Setup Status                                                                                                           │  
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Status: Initialization Failed                                                                                                       │ 
│ Last initialization operation ID: 116f7904-35dd-4301-a895-139f71c603f4                                                              │                                  
│ Message: Errors initializing datasets: Error initializing dataset oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crucible: Zpool Not      │ 
│ Found: oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773, Error initializing dataset oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/debug:   │
│ Zpool Not Found: oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773, Error initializing dataset                                               │
│ oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414/crypt/debug: Zpool Not Found: oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414, Error initializing │
│ dataset oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414/crypt/zone/oxz_crucible_65a0d412-2bc5-4b37-9f57-d5830d688c86: Zpool Not Found:     │
│ oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414, Error initializing dataset oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414/crucible: Zpool Not   │
│ Found: oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414, Error initializing dataset oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/zone:    │
│ Zpool Not Found: oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773, Error initializing dataset                                               │
│ oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414/crypt/zone: Zpool Not Found: oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414, Error initializing  │
│ dataset oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/zone/oxz_crucible_586474b6-bcc3-4013-894c-6422ed120cef: Zpool Not Found:     │
│ oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773    
```

Before starting RSS, we verified all sleds were seen, and all sleds had disk present (diskinfo).

Logs gathered can be found here: /staff/core/omicron-8203
Interleaved here is an edited chat log with highlights of the debug we walked through.

john (he/him):                                                                                                                                                     
(cc sean (he/him) who have more context) I'm pretty confused by this failure, on its face. It's reported by RSS here: https://github.com/oxidecomputer/omicron/blob/d4e0bc1465ced854ba33fb3aec5e9c49c02665c9/sled-agent/src/rack_setup/service.rs#L429-L450                                                                                       
the error came from some sled here, I think: https://github.com/oxidecomputer/omicron/blob/d4e0bc1465ced854ba33fb3aec5e9c49c02665c9/sled-storage/src/manager.rs#L1507-L1518                                                                                                                                                                       
but based on the RSS check, we just successfully set up all the disks                                                                                                                                                                                                                                                                             
so I'm not sure why we would be missing them when we go to do datasets immediately after                                                                                 
("immediately after" comes from the http handler for putting a new config: we set up the disks then immediately set up the datasets, but only if disks all succeeded: https://github.com/oxidecomputer/omicron/blob/d4e0bc1465ced854ba33fb3aec5e9c49c02665c9/sled-agent/src/sled_agent.rs#L978-L1000)                                             
                                                                                                                                                                         
alan:                                                                                                                                                                        
We did find the pools were created on sled 26:                                                                                                                           
```                                                                                                                                                                       
support@oxz_switch0:~$ pilot host exec -c 'zfs list | grep oxp_8809d9a9' 0-31                                                                                            
...                                                                                                                                                                      
26  BRM06240015        ok:                                                                                                                                               
oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773              1.44M  2.82T  96K  /oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773                                                       
oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt         648K  2.82T 248K  /pool/ext/8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt                                                                                                                                                                                                                     
oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/debug   200K   100G 200K  /pool/ext/8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/debug                                                                                                                                                                                                               
oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/zone                                                                                                                                                                                                                                                                                               
```                                                                                                                                                                                                                                                              

sean (he/him):                                                                                                                                                            
 > john (he/him):                                                                                                                                                                                                                                                                                                                                
 > the error came from some sled here, I think: https://github.com/oxidecomputer/omicron/blob/d4e0bc1465ced854ba33fb3aec5e9c49c02665c9/sled-storage/src/manager.rs#L1507-L1518                                                                                                                                                                    

This looks like it's invoking https://github.com/oxidecomputer/omicron/blob/d4e0bc1465ced854ba33fb3aec5e9c49c02665c9/sled-agent/src/sled_agent.rs#L978-L1008 , which, as you said, should be setting up disks, reporting zero errors, and then attempting to set up datasets                                                                      
I'm going to dig into the code more closely, but the ordering here should be okay, unless some part of "disk setup" is returning too early and racing with dataset initialization                                                                                                                                                                 

john (he/him):
possible smoking gun in sled-agent.log
at the end of ensuring physical disks, we see 10 successes:

```
01:50:52.920Z INFO SledAgent (StorageManager): Successfully synchronized disks without error                                                                                                                                                                                                                                                      
    file = sled-storage/src/manager.rs:709                                                                                                                                                                                                                                                                                                        
    result = DisksManagementResult { status: [DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800536" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800633" }, err: None }, DiskManagementStatus { identity: D
iskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800674" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800723" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800724" }
, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800895" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800985" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", m
odel: "WUS5EA138ESP7E3", serial: "23470B800991" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B801016" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B801059" }, err: None }] }               
```
but the next couple lines, we see two disks removed:
```
01:50:52.920Z INFO SledAgent (StorageManager): Received DetectedRawDiskRemoval { raw_disk: Real(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,2722@0/blkdev@w0014EE8401D00800,0", dev_path: Some("/dev/dsk/c6t0014EE8401D00800d0") }, slot: 8, variant: U2, identity: DiskIdentity { vendor: "1b96", mo
del: "WUS5EA138ESP7E3", serial: "23470B800991" }, is_boot_disk: false, firmware: DiskFirmware { active_slot: 2, next_active_slot: Some(2), slot1_read_only: true, number_of_slots: 4, slot_firmware_versions: [Some("RC610004"), Some("RC610007"), Some("RC610004"), Some("RC610004")] } }), tx: ... }                                            
    file = sled-storage/src/manager.rs:615                                          
01:50:52.920Z INFO SledAgent (StorageResources): Removing disk                      
    file = sled-storage/src/resources.rs:569                                        
    identity = DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800991" }                                                                         
01:50:52.920Z INFO SledAgent (StorageManager): Received DetectedRawDiskRemoval { raw_disk: Real(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,3/pci1b96,2722@0/blkdev@w0014EE8401D0C500,0", dev_path: Some("/dev/dsk/c7t0014EE8401D0C500d0") }, slot: 7, variant: U2, identity: DiskIdentity { vendor: "1b96", mo
del: "WUS5EA138ESP7E3", serial: "23470B801016" }, is_boot_disk: false, firmware: DiskFirmware { active_slot: 2, next_active_slot: Some(2), slot1_read_only: true, number_of_slots: 4, slot_firmware_versions: [Some("RC610004"), Some("RC610007"), Some("RC610004"), Some("RC610004")] } }), tx: ... }                                            
    file = sled-storage/src/manager.rs:615                                          
01:50:52.920Z INFO SledAgent (StorageResources): Removing disk                      
    file = sled-storage/src/resources.rs:569                                        
    identity = DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B801016" }                                                                         
```

john (he/him):
looks like that tracks:                                                             
```
    {                                                                               
      "identity": {                                                                 
        "vendor": "1b96",                                                           
        "model": "WUS5EA138ESP7E3",                                                 
        "serial": "23470B800991"                                                    
      },                                                                            
      "id": "ece45ca3-f756-43bc-836e-425e650fe194",                                 
      "pool_id": "6ee32179-9c93-4ad2-9816-05d8e5a9b414"                             
    },                                                                              
    {                                                                               
      "identity": {                                                                 
        "vendor": "1b96",                                                           
        "model": "WUS5EA138ESP7E3",                                                 
        "serial": "23470B801016"                                                    
      },                                                                            
      "id": "c873bf6a-bcb3-45fd-95a1-1c0362861289",                                 
      "pool_id": "8809d9a9-1544-42d5-b3b7-28129a2f6773"                             
    },                                                                              
```
those two serials do match the pool IDs that we see failures for                    
so: sled-agent successfully started managing all 10 disks, then thinks those two disks went away                                                                         

sean (he/him): so, next question: Why did we see "raw disk removal"                  

john (he/him): it looks like one of them then immediately came back?                 
```
01:50:52.920Z INFO SledAgent (StorageManager): Received DetectedRawDisk { raw_disk: Real(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,2722@0/blkdev@w0014EE8401D00800,0", dev_path: Some("/dev/dsk/c6t0014EE8401D00800d0") }, slot: 8, variant: U2, identity: DiskIdentity { vendor: "1b96", model: "W
US5EA138ESP7E3", serial: "23470B800991" }, is_boot_disk: false, firmware: DiskFirmware { active_slot: 2, next_active_slot: Some(2), slot1_read_only: true, number_of_slots: 4, slot_firmware_versions: [Some("RC610004"), Some("RC610007"), Some("RC610004"), Some("RC610004")] } }), tx: ... }                                                   
    file = sled-storage/src/manager.rs:615                                          
01:50:52.920Z INFO SledAgent (StorageResources): Inserting disk                     
    file = sled-storage/src/resources.rs:491                                        
    identity = DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800991" }                                                                         
```
but maybe we didn't "re-manage" it in time

sean (he/him):
 I think the other one also came back (serial: 23470B801016), within about a second
```
01:50:52.920Z INFO SledAgent (StorageManager): Received DetectedRawDisk { raw_disk: Real(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,2722@0/blkdev@w0014EE8401D00800,0", dev_path: Some("/dev/dsk/c6t0014EE8401D00800d0") }, slot: 8, variant: U2, identity: DiskIdentity { vendor: "1b96", model: "W
US5EA138ESP7E3", serial: "23470B800991" }, is_boot_disk: false, firmware: DiskFirmware { active_slot: 2, next_active_slot: Some(2), slot1_read_only: true, number_of_slots: 4, slot_firmware_versions: [Some("RC610004"), Some("RC610007"), Some("RC610004"), Some("RC610004")] } }), tx: ... }                                                   
    file = sled-storage/src/manager.rs:615                                          
01:50:52.920Z INFO SledAgent (StorageResources): Inserting disk                     
    file = sled-storage/src/resources.rs:491                                        
    identity = DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800991" }                                                                         
```
sean (he/him):
To lay out a timeline:

01:48:41.541Z INFO SledAgent: physical disks ensure - I think we started the "set omicron config" call here                                                              
01:48:46.802Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Device with serial 23470B800536 was observed removed from the device tree 
01:48:51.869Z INFO SledAgent (StorageResources): Ensuring zpool has datasets - Interestingly, I see additional calls to "Ensure zpool has datasets" which are continuing to operated on the disk which was "removed" (serial: 23470B800536). I don't think the removal interrupts the "physical disks ensure" call, and the operations are succeed
ing, so it's still going. This includes: setting the LBA on that disk to 4096, doing other setup, etc.                                                                   
01:48:54.309Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - We see the disk come back, with serial 23470B800536.                        
01:50:27.018Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B800991                                  
01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B801016                                  
01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - Addition of disk with serial 23470B800991                                   
01:50:52.556Z INFO SledAgent (StorageManager): Received DetectedRawDiskRemoval - We see the disk with serial 23470B800536 go away again.                                 
01:50:52.556Z INFO SledAgent (StorageManager): Received DetectedRawDisk - We see the same disk with serial 23470B800536 come back                                                                                                                                                                                                                 
01:50:52.920Z INFO SledAgent (StorageManager): Successfully synchronized disks without error - We think the disk exists again. I see the serial in the result output for DisksManagementResult.                                                                                                                                                   
01:50:53.336Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - Addition of disk with serial 23470B801016                                   
01:50:53.685Z INFO SledAgent: datasets ensure - We try to ensure datasets. I see the "Failed to find zpool" errors here, in particular, for the zpool with UUID 8809d9a9-1544-42d5-b3b7-28129a2f6773.

Digging back further, this is the zpool associated with the disk with serial 23470B801016 - this was removed earlier, and was just detected by the device tree again
01:51:02.863Z INFO SledAgent (StorageResources): Managing disk - I see the sled agent trying to manage the disk with serial 23470B801016 ; I think this is a down-stream attempt to re-synchronize the disks. Looks successful.
01:51:03.147Z INFO SledAgent (StorageManager): Successfully synchronized disks without error - Finishes the "disk synchronization" from earlier                          

TL;DR:                                                                              
We asked the sled agent to manage the disks it could, this seemed to happen successfully
The device tree was flapping a lot, showing multiple devices being added and removed and re-added. Weirdly, while the devices were missing from the device tree, this still appeared to be mutable.
While one of those devices was "removed", we asked to put datasets on it. That failed, which I believe is currently expected, because we consider "managed" devices to be ones that we see in the configuration + ones that we think are physically plugged in.
The disk later re-appeared, and was "managed" by the control plane again, but this was after the request to put datasets on it failed.

rm:                                                                                 
Just thinking out loud for a moment, to do the namespace reformat we actually will detach the blkdev instance to issue the format and then once that's done it'll come back.
So that is a way that without the nvme level device disappearing we can see that happen.
But I don't know where we do that kind of disk device manipulation in that timeline or if relevant. Just trying to think out loud here and what little we can get from what's in here. It's certainly possible (though weird) that the NVMe device went away and came back.

rm:
It's specifically doing a blkdev detach so it can reformat.
The problem is that we have to do something that destroys all the data on the device and renumbers all the sectors.
So we can't do it while it is up.                                                   

mike:
Not at the laptop at the moment, but yea detach, format, attach

rm:
So if rescanning and this operation are independent from the hardware monitor's loop we're going to be a little split brain.                                                                                                                                                                                                                      
Do we have logs around the reformat actions that we might be able to correlate?     

sean (he/him):
We do, I think this roughly lines up with the timeline I posted earlier             
01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B801016                                  
01:50:36.228Z INFO SledAgent (StorageResources): Resized 23470B801016 from 3840 to 3200                                                                                  
01:50:40.627Z INFO SledAgent (StorageResources): Formatted disk with serial 23470B801016 to an LBA with data size 4096                                                   
01:50:53.336Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded                                                                               
So: yeah, looks like the formatting happened in the window between "device removed" and "device added"                                                                   

sean (he/him):
So this makes more sense:                                                           
* We decide "what disks are plugged in" by scanning the device tree in the HardwareMonitor                                                                               
* This information exists in the Sled's inventory, and gets plumbed to RSS          
* When RSS asks for a disk to be used, we call ensure_partition_layout -> if it's a U.2, we call ensure_size_and_formatting, which can remove the device from the device tree                                                                                                                                                                     
* If we poll the device tree during this period of removal, this is interpreted as "this disk has been removed"                                                          
* When the disk later re-appears in the device tree, we'll try to manage it again, and "re-synchronize" the requested config with the disks observed in the device tree  
* However, if datasets are requested while we are not synchronized (which can happen as a part of formatting the disk, when the block device is detached) they'll show up as "not found"                                                                                                                                                          

sean (he/him):
To answer the question "Why is the disk not visible immediately after it's re-added to the device tree", the answer is basically: "Monitoring hardware and propagating that info to the sled agent is asynchronous".                                                                                                                              

In this particular case, we may want to add messaging for force things to be more synchronous -- e.g., after formatting any disks, ensure we're operating on an up-to-date view of the device tree, and ensure that storage manage state is synchronized and by "synchronized", I mean "at least up-to-date enough that it can see the block device which has been re-added"

sean (he/him):
I think the timeline is:
* ensure_partition_layout removes the device, formats it.                             
* If HardwareMonitor scans the device tree, it'll see the device as missing. This fires off a message to the "storage manager" part of the sled agent, telling it "this disk is gone".                                                                                                                                                            
* If the HardwareMonitor scans the device tree and sees the disk again, it'll send a message to the "storage manager" telling it "this disk was just added, adopt it into the control plane at your convenience"                                                                                                                                  
* Every ten seconds, the storage manager "synchronizes disks", which means: it reads the configured disks, it looks at the disks that it things are online, and it adopts the disks it thinks should be part of the control plane.                                                                                                                

rm:                                                                              
I think if we scanned NVMe devices rather than blkdev we could probably avoid that.                                                                                      
(And wouldn't get tripped up by device missing blkdev namespaces or with too many)                                                                                       

sean (he/him):                                                                       
Yeah, this makes sense to me                                                        
which would basically mean: from the perspective of the sled agent, reformatting the disk does not cause it to "go away"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!