Skip to content

Conversation

6erun
Copy link
Contributor

@6erun 6erun commented Jun 6, 2025

Added CloudRift backend (cloudirft.ai).

gpuhunt part was done in dstackai/gpuhunt#133

@peterschmidt85 peterschmidt85 requested a review from jvstme June 9, 2025 08:57
@peterschmidt85
Copy link
Contributor

BTW,

python -m gpuhunt cloudrift --output cloudrift.csv
2025-06-09 12:21:53,188 INFO Fetching offers for cloudrift
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'RTX 4090 Ext'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'RTX 4090 Max'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'Genoa'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'H100'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'A100'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'L40S'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'L40S'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'A100'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'H100'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'RTX 6000 Ada'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'H200'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'RTX 5090 Pro'
2025-06-09 12:21:55,773 WARNING Failed to find GPU name matching 'MI300X'

@6erun Is it expected? I see only the following offers in gpuhunt now:

instance_name,location,price,cpu,memory,gpu_count,gpu_name,gpu_memory,spot,disk_size,gpu_vendor,flags,cpu_arch
rtx49-8c-nr.1,us-east-nc-nr-1,0.4,8,50.0,1,RTX4090,24,False,500.0,nvidia,,x86
rtx49-8c-nr.2,us-east-nc-nr-1,0.8,16,100.0,2,RTX4090,24,False,1000.0,nvidia,,x86
rtx49-8c-nr.3,us-east-nc-nr-1,1.2,24,150.0,3,RTX4090,24,False,1500.0,nvidia,,x86
rtx49-8c-nr.4,us-east-nc-nr-1,1.6,32,200.0,4,RTX4090,24,False,2000.0,nvidia,,x86
rtx49-8c-nr.5,us-east-nc-nr-1,2.0,40,250.0,5,RTX4090,24,False,2500.0,nvidia,,x86
rtx49-8c-nr.6,us-east-nc-nr-1,2.4,48,300.0,6,RTX4090,24,False,3000.0,nvidia,,x86
rtx49-8c-nr.7,us-east-nc-nr-1,2.8,56,350.0,7,RTX4090,24,False,3500.0,nvidia,,x86
rtx49-8c-nr.8,us-east-nc-nr-1,3.2,64,400.0,8,RTX4090,24,False,4000.0,nvidia,,x86

@6erun
Copy link
Contributor Author

6erun commented Jun 9, 2025

Yes, we have more types now and we will update gpuhunt soon. It was expected and can be ignored for now.

@Slonegg
Copy link
Contributor

Slonegg commented Jun 14, 2025

@peterschmidt85 I have added RTX 5090 and RTX PRO 6000 here dstackai/gpuhunt#158

We will start adding PRO 6000 nodes next week.

These are all GPUs we have for on-demand rental at the moment; others we're offering on a month-to-month basis and cannot be exposed for on-demand for now.

Is there anything else we need to look into?

@jvstme
Copy link
Collaborator

jvstme commented Jun 17, 2025

@Slonegg and @6erun, thank you for the PRs, and I apologize for the delayed review. I will be able to review them within a couple of days

@Slonegg
Copy link
Contributor

Slonegg commented Jun 19, 2025

@Slonegg and @6erun, thank you for the PRs, and I apologize for the delayed review. I will be able to review them within a couple of days

No problem. There is also a small PR in gpuhunt to enable RTX 5090 and RTX PRO 6000: dstackai/gpuhunt#158

Copy link
Collaborator

@jvstme jvstme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@6erun, thanks again for the PR. A couple of things didn't work for me at first, but I managed to make them work with some minor tweaks. Please see my review comments for details. I've provided suggestions for most of them, so hope they will be easy to address.

Additionally, I've merged dstackai/gpuhunt#158. However, I couldn't get RTX 5090 to work with dstack:

  1. I've tried a few times to run a dev environment on rtx59-16c-nr.2. The dev environment started successfully, but I got this error in the container shell:

    (workflow) root@riftvm:~# nvidia-smi
    -bash: nvidia-smi: command not found

    And this one on the host:

    riftuser@riftvm:~$ nvidia-smi
    NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
  2. I've tried running a dev environment on rtx59-16c-nr.1 once, but it didn't start. It was stuck in provisioning for 10 minutes. The instance_info variable in update_provisioning_data() contained this:

    instance_info
    {'id': '89a57132-5072-11f0-8f1e-db636327bfe3', 'status': 'Inactive', 'node_id': 'e381ba8a-1d41-11f0-aa9a-cfdc5b0bc398', 'node_mode': 'Container', 'node_status': 'Ready', 'cpu': {'vendor': 'AMD', 'vendor_logo_url': None, 'brand': 'AMD EPYC 7B13 64-Core Processor', 'brand_short': 'EPYC 7B13', 'physical_core_count': 128, 'logical_core_count': 256}, 'cpu_mask': 'ffff00000000', 'cpu_limit': 16, 'dram': 1081953382400, 'dram_limit': 107374182400, 'disk_limit': 0, 'gpus': [{'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:01:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:24:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:41:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:61:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:81:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:a1:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:c1:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:e1:00.0'}], 'gpu_mask': '4', 'gpu_limit': 1, 'host_address': '142.214.185.236', 'resource_info': {'provider_name': 'NeuralRack', 'instance_type': 'rtx59-16c-nr.1', 'cost_per_hour': 65.0}, 'usage_info': {'usage': {'secs': 1, 'nanos': 705464000}, 'accounted_usage': {'secs': 0, 'nanos': 0}, 'user_email': '*** redacted ***'}, 'virtual_machines': [], 'containers': [], 'instructions': {'instructions_template': '*** redacted ***', 'placeholder_values': [['NODE_IP', '142.214.185.236'], ['EXECUTOR_SHORT_ID', '89a57132']]}, 'reservation_data': None}
    

    After 10 minutes, dstack tried to terminate the instance, but also failed:

    ComputeError('Failed to terminate instance 89a57132-5072-11f0-8f1e-db636327bfe3 in region us-east-nc-nr-1.')
    

    We couldn't find the instance in the console after that, so I assume it wasn't created correctly.

rtx49-8c-nr.1 and rtx49-8c-nr.2 worked as expected.

return response_data.get("instance_types", [])
return []

def list_recipies(self) -> List[Dict]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Typo here and in a few more places below

Suggested change
def list_recipies(self) -> List[Dict]:
def list_recipes(self) -> List[Dict]:

@6erun
Copy link
Contributor Author

6erun commented Jun 24, 2025

Thank you @jvstme for the feedback! I think I addressed all your comments.

@6erun
Copy link
Contributor Author

6erun commented Jun 24, 2025

@jvstme regarding issues with 5090, we've had some issues with drivers on those machines lately, that requires us to reset the machine manually. I think it's better to test it with 4090 for now.

Copy link
Collaborator

@jvstme jvstme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I'll merge the PR now, so the integration will be part of our next release this week. Thank you for the contribution.

regarding issues with 5090, we've had some issues with drivers on those machines lately, that requires us to reset the machine manually.

Okay, no worries. However, if you expect this problem to persist for some time, I can recommend to temporarily exclude 5090s from gpuhunt so that they are not suggested to users who may want to try the integration once we announce it. We can easily remove or add offers to gpuhunt without a release.

@jvstme jvstme merged commit 2db1752 into dstackai:master Jun 24, 2025
25 checks passed
@Slonegg
Copy link
Contributor

Slonegg commented Jun 24, 2025

Looks good to me! I'll merge the PR now, so the integration will be part of our next release this week. Thank you for the contribution.

regarding issues with 5090, we've had some issues with drivers on those machines lately, that requires us to reset the machine manually.

Okay, no worries. However, if you expect this problem to persist for some time, I can recommend to temporarily exclude 5090s from gpuhunt so that they are not suggested to users who may want to try the integration once we announce it. We can easily remove or add offers to gpuhunt without a release.

Thanks for the tip and the help with the integration! We've made some changes to the virtualization stack and will test to see if that helps with 5090 instability. If it doesn't, we'll either remove 5090 from the gpuhunt CloudRift manifest generation logic or disable it in the backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants