Ray集群部署与维护
一、环境准备
1.1 安装依赖
根据不同云平台,执行以下命令安装必要依赖:
AWS
pip install -U "ray[default]" boto3
GCP
pip install -U "ray[default]" google-api-python-client
Azure
pip install -U "ray[default]" azure-cli azure-core
1.2 配置云平台凭证
AWS
配置~/.aws/credentials
文件,参考AWS文档
GCP
设置环境变量:
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"
Azure
登录并配置订阅:
az login
az account set -s <subscription_id>
二、集群部署
2.1 创建配置文件
创建config.yaml
文件,以下是各平台的最小配置示例:
AWS
cluster_name: minimal
provider:
type: aws
region: us-west1
auth:
ssh_user: ubuntu
GCP
cluster_name: minimal
provider:
type: gcp
region: us-west1
auth:
ssh_user: ubuntu
Azure
cluster_name: minimal
provider:
type: azure
location: westus2
resource_group: ray-cluster
auth:
ssh_user: ubuntu
ssh_private_key: ~/.ssh/id_rsa
ssh_public_key: ~/.ssh/id_rsa.pub
2.2 启动集群
ray up -y config.yaml
三、集群使用
3.1 提交作业
ray exec config.yaml 'python -c "import ray; ray.init()"'
3.2 连接到集群
ray attach config.yaml
3.3 运行示例应用
创建script.py
文件:
from collections import Counter
import socket
import time
import ray
ray.init()
print(f'''This cluster consists of
{
len(ray.nodes())} nodes in total
{
ray.cluster_resources()['CPU']} CPU resources in total
''')
@ray.remote
def