-
Notifications
You must be signed in to change notification settings - Fork 186
Add CloudRift backend #2771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CloudRift backend #2771
Conversation
BTW,
@6erun Is it expected? I see only the following offers in
|
Yes, we have more types now and we will update |
@peterschmidt85 I have added RTX 5090 and RTX PRO 6000 here dstackai/gpuhunt#158 We will start adding PRO 6000 nodes next week. These are all GPUs we have for on-demand rental at the moment; others we're offering on a month-to-month basis and cannot be exposed for on-demand for now. Is there anything else we need to look into? |
No problem. There is also a small PR in gpuhunt to enable RTX 5090 and RTX PRO 6000: dstackai/gpuhunt#158 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@6erun, thanks again for the PR. A couple of things didn't work for me at first, but I managed to make them work with some minor tweaks. Please see my review comments for details. I've provided suggestions for most of them, so hope they will be easy to address.
Additionally, I've merged dstackai/gpuhunt#158. However, I couldn't get RTX 5090 to work with dstack
:
-
I've tried a few times to run a dev environment on
rtx59-16c-nr.2
. The dev environment started successfully, but I got this error in the container shell:(workflow) root@riftvm:~# nvidia-smi -bash: nvidia-smi: command not found
And this one on the host:
riftuser@riftvm:~$ nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
-
I've tried running a dev environment on
rtx59-16c-nr.1
once, but it didn't start. It was stuck inprovisioning
for 10 minutes. Theinstance_info
variable inupdate_provisioning_data()
contained this:instance_info
{'id': '89a57132-5072-11f0-8f1e-db636327bfe3', 'status': 'Inactive', 'node_id': 'e381ba8a-1d41-11f0-aa9a-cfdc5b0bc398', 'node_mode': 'Container', 'node_status': 'Ready', 'cpu': {'vendor': 'AMD', 'vendor_logo_url': None, 'brand': 'AMD EPYC 7B13 64-Core Processor', 'brand_short': 'EPYC 7B13', 'physical_core_count': 128, 'logical_core_count': 256}, 'cpu_mask': 'ffff00000000', 'cpu_limit': 16, 'dram': 1081953382400, 'dram_limit': 107374182400, 'disk_limit': 0, 'gpus': [{'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:01:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:24:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:41:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:61:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:81:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:a1:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:c1:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:e1:00.0'}], 'gpu_mask': '4', 'gpu_limit': 1, 'host_address': '142.214.185.236', 'resource_info': {'provider_name': 'NeuralRack', 'instance_type': 'rtx59-16c-nr.1', 'cost_per_hour': 65.0}, 'usage_info': {'usage': {'secs': 1, 'nanos': 705464000}, 'accounted_usage': {'secs': 0, 'nanos': 0}, 'user_email': '*** redacted ***'}, 'virtual_machines': [], 'containers': [], 'instructions': {'instructions_template': '*** redacted ***', 'placeholder_values': [['NODE_IP', '142.214.185.236'], ['EXECUTOR_SHORT_ID', '89a57132']]}, 'reservation_data': None}
After 10 minutes,
dstack
tried to terminate the instance, but also failed:ComputeError('Failed to terminate instance 89a57132-5072-11f0-8f1e-db636327bfe3 in region us-east-nc-nr-1.')
We couldn't find the instance in the console after that, so I assume it wasn't created correctly.
rtx49-8c-nr.1
and rtx49-8c-nr.2
worked as expected.
return response_data.get("instance_types", []) | ||
return [] | ||
|
||
def list_recipies(self) -> List[Dict]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Typo here and in a few more places below
def list_recipies(self) -> List[Dict]: | |
def list_recipes(self) -> List[Dict]: |
Thank you @jvstme for the feedback! I think I addressed all your comments. |
@jvstme regarding issues with 5090, we've had some issues with drivers on those machines lately, that requires us to reset the machine manually. I think it's better to test it with 4090 for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I'll merge the PR now, so the integration will be part of our next release this week. Thank you for the contribution.
regarding issues with 5090, we've had some issues with drivers on those machines lately, that requires us to reset the machine manually.
Okay, no worries. However, if you expect this problem to persist for some time, I can recommend to temporarily exclude 5090s from gpuhunt
so that they are not suggested to users who may want to try the integration once we announce it. We can easily remove or add offers to gpuhunt
without a release.
Thanks for the tip and the help with the integration! We've made some changes to the virtualization stack and will test to see if that helps with 5090 instability. If it doesn't, we'll either remove 5090 from the gpuhunt CloudRift manifest generation logic or disable it in the backend. |
Added CloudRift backend (cloudirft.ai).
gpuhunt
part was done in dstackai/gpuhunt#133