Add CloudRift backend #2771

6erun · 2025-06-06T22:47:55Z

Added CloudRift backend (cloudirft.ai).

gpuhunt part was done in dstackai/gpuhunt#133

Cloudrift backend

peterschmidt85 · 2025-06-09T11:07:27Z

BTW,

python -m gpuhunt cloudrift --output cloudrift.csv
2025-06-09 12:21:53,188 INFO Fetching offers for cloudrift
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'RTX 4090 Ext'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'RTX 4090 Max'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'Genoa'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'H100'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'A100'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'L40S'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'L40S'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'A100'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'H100'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'RTX 6000 Ada'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'H200'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'RTX 5090 Pro'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'MI300X'

@6erun Is it expected? I see only the following offers in gpuhunt now:

instance_name,location,price,cpu,memory,gpu_count,gpu_name,gpu_memory,spot,disk_size,gpu_vendor,flags,cpu_arch
rtx49-8c-nr.1,us-east-nc-nr-1,0.4,8,50.0,1,RTX4090,24,False,500.0,nvidia,,x86
rtx49-8c-nr.2,us-east-nc-nr-1,0.8,16,100.0,2,RTX4090,24,False,1000.0,nvidia,,x86
rtx49-8c-nr.3,us-east-nc-nr-1,1.2,24,150.0,3,RTX4090,24,False,1500.0,nvidia,,x86
rtx49-8c-nr.4,us-east-nc-nr-1,1.6,32,200.0,4,RTX4090,24,False,2000.0,nvidia,,x86
rtx49-8c-nr.5,us-east-nc-nr-1,2.0,40,250.0,5,RTX4090,24,False,2500.0,nvidia,,x86
rtx49-8c-nr.6,us-east-nc-nr-1,2.4,48,300.0,6,RTX4090,24,False,3000.0,nvidia,,x86
rtx49-8c-nr.7,us-east-nc-nr-1,2.8,56,350.0,7,RTX4090,24,False,3500.0,nvidia,,x86
rtx49-8c-nr.8,us-east-nc-nr-1,3.2,64,400.0,8,RTX4090,24,False,4000.0,nvidia,,x86

6erun · 2025-06-09T16:42:51Z

Yes, we have more types now and we will update gpuhunt soon. It was expected and can be ignored for now.

Slonegg · 2025-06-14T00:45:27Z

@peterschmidt85 I have added RTX 5090 and RTX PRO 6000 here dstackai/gpuhunt#158

We will start adding PRO 6000 nodes next week.

These are all GPUs we have for on-demand rental at the moment; others we're offering on a month-to-month basis and cannot be exposed for on-demand for now.

Is there anything else we need to look into?

jvstme · 2025-06-17T06:11:56Z

@Slonegg and @6erun, thank you for the PRs, and I apologize for the delayed review. I will be able to review them within a couple of days

Slonegg · 2025-06-19T21:43:52Z

@Slonegg and @6erun, thank you for the PRs, and I apologize for the delayed review. I will be able to review them within a couple of days

No problem. There is also a small PR in gpuhunt to enable RTX 5090 and RTX PRO 6000: dstackai/gpuhunt#158

jvstme

@6erun, thanks again for the PR. A couple of things didn't work for me at first, but I managed to make them work with some minor tweaks. Please see my review comments for details. I've provided suggestions for most of them, so hope they will be easy to address.

Additionally, I've merged dstackai/gpuhunt#158. However, I couldn't get RTX 5090 to work with dstack:

I've tried a few times to run a dev environment on rtx59-16c-nr.2. The dev environment started successfully, but I got this error in the container shell:

(workflow) root@riftvm:~# nvidia-smi
-bash: nvidia-smi: command not found

And this one on the host:

riftuser@riftvm:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I've tried running a dev environment on rtx59-16c-nr.1 once, but it didn't start. It was stuck in provisioning for 10 minutes. The instance_info variable in update_provisioning_data() contained this:

instance_info

{'id': '89a57132-5072-11f0-8f1e-db636327bfe3', 'status': 'Inactive', 'node_id': 'e381ba8a-1d41-11f0-aa9a-cfdc5b0bc398', 'node_mode': 'Container', 'node_status': 'Ready', 'cpu': {'vendor': 'AMD', 'vendor_logo_url': None, 'brand': 'AMD EPYC 7B13 64-Core Processor', 'brand_short': 'EPYC 7B13', 'physical_core_count': 128, 'logical_core_count': 256}, 'cpu_mask': 'ffff00000000', 'cpu_limit': 16, 'dram': 1081953382400, 'dram_limit': 107374182400, 'disk_limit': 0, 'gpus': [{'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:01:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:24:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:41:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:61:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:81:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:a1:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:c1:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:e1:00.0'}], 'gpu_mask': '4', 'gpu_limit': 1, 'host_address': '142.214.185.236', 'resource_info': {'provider_name': 'NeuralRack', 'instance_type': 'rtx59-16c-nr.1', 'cost_per_hour': 65.0}, 'usage_info': {'usage': {'secs': 1, 'nanos': 705464000}, 'accounted_usage': {'secs': 0, 'nanos': 0}, 'user_email': '*** redacted ***'}, 'virtual_machines': [], 'containers': [], 'instructions': {'instructions_template': '*** redacted ***', 'placeholder_values': [['NODE_IP', '142.214.185.236'], ['EXECUTOR_SHORT_ID', '89a57132']]}, 'reservation_data': None}

After 10 minutes, dstack tried to terminate the instance, but also failed:

ComputeError('Failed to terminate instance 89a57132-5072-11f0-8f1e-db636327bfe3 in region us-east-nc-nr-1.')

We couldn't find the instance in the console after that, so I assume it wasn't created correctly.

rtx49-8c-nr.1 and rtx49-8c-nr.2 worked as expected.

docs/docs/concepts/backends.md

src/dstack/_internal/core/backends/cloudrift/api_client.py

jvstme · 2025-06-23T17:35:04Z

src/dstack/_internal/core/backends/cloudrift/api_client.py

+            return response_data.get("instance_types", [])
+        return []
+
+    def list_recipies(self) -> List[Dict]:


(nit) Typo here and in a few more places below

Suggested change

def list_recipies(self) -> List[Dict]:

def list_recipes(self) -> List[Dict]:

src/dstack/_internal/core/backends/cloudrift/api_client.py

src/dstack/_internal/core/backends/cloudrift/compute.py

src/dstack/_internal/core/backends/cloudrift/configurator.py

test fix

6erun · 2025-06-24T00:31:55Z

Thank you @jvstme for the feedback! I think I addressed all your comments.

6erun · 2025-06-24T00:35:56Z

@jvstme regarding issues with 5090, we've had some issues with drivers on those machines lately, that requires us to reset the machine manually. I think it's better to test it with 4090 for now.

jvstme

Looks good to me! I'll merge the PR now, so the integration will be part of our next release this week. Thank you for the contribution.

regarding issues with 5090, we've had some issues with drivers on those machines lately, that requires us to reset the machine manually.

Okay, no worries. However, if you expect this problem to persist for some time, I can recommend to temporarily exclude 5090s from gpuhunt so that they are not suggested to users who may want to try the integration once we announce it. We can easily remove or add offers to gpuhunt without a release.

Slonegg · 2025-06-24T20:30:07Z

Looks good to me! I'll merge the PR now, so the integration will be part of our next release this week. Thank you for the contribution.

regarding issues with 5090, we've had some issues with drivers on those machines lately, that requires us to reset the machine manually.

Okay, no worries. However, if you expect this problem to persist for some time, I can recommend to temporarily exclude 5090s from gpuhunt so that they are not suggested to users who may want to try the integration once we announce it. We can easily remove or add offers to gpuhunt without a release.

Thanks for the tip and the help with the integration! We've made some changes to the virtualization stack and will test to see if that helps with 5090 instability. If it doesn't, we'll either remove 5090 from the gpuhunt CloudRift manifest generation logic or disable it in the backend.

Slonegg and others added 9 commits May 27, 2025 10:26

WIP

bd1adc8

added instance renting logic

ccf8df5

pass custom commands

d5dbaac

Merge branch 'master' into dtrifonov/integrate-cloudrift

6a47535

doc

f1507f0

updated version

c91a183

Cloudrift backend

49dcd78

Cloudrift backend

Merge branch 'dstackai:master' into master

0a17cf9

tests

45f9f6e

peterschmidt85 requested a review from jvstme June 9, 2025 08:57

6erun and others added 2 commits June 23, 2025 14:26

test fix

33eabc2

Merge branch 'dstackai:master' into master

d6559dc

jvstme reviewed Jun 23, 2025

View reviewed changes

6erun and others added 2 commits June 23, 2025 16:12

Merge pull request #2 from cloudrift-ai/dtrifonov/integrate-cloudrift

7701bf1

test fix

PR feedback

b4eeb7e

jvstme approved these changes Jun 24, 2025

View reviewed changes

jvstme merged commit 2db1752 into dstackai:master Jun 24, 2025
25 checks passed

	def list_recipies(self) -> List[Dict]:
	def list_recipes(self) -> List[Dict]:

Add CloudRift backend #2771

Add CloudRift backend #2771

Uh oh!

Conversation

6erun commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterschmidt85 commented Jun 9, 2025

Uh oh!

6erun commented Jun 9, 2025

Uh oh!

Slonegg commented Jun 14, 2025

Uh oh!

jvstme commented Jun 17, 2025

Uh oh!

Slonegg commented Jun 19, 2025

Uh oh!

jvstme left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jvstme Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

6erun commented Jun 24, 2025

Uh oh!

6erun commented Jun 24, 2025

Uh oh!

jvstme left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Slonegg commented Jun 24, 2025

Uh oh!

Uh oh!

6erun commented Jun 6, 2025 •

edited

Loading

jvstme left a comment •

edited

Loading