Open Source Observability For AWS Inferentia Nodes Within Amazon EKS Clusters | Amazon Web Services

Recent developments in machine learning (ML) have led to increasingly large models, some of which require hundreds of billions of parameters. Although they are more powerful, training and inference on those models require significant computational resources. Despite the availability of advanced distributed training libraries, it’s common for training and inference jobs to need hundreds of accelerators (GPUs or purpose-built ML chips such as AWS 培訓班和 AWS 推理), and therefore tens or hundreds of instances.

In such distributed environments, observability of both instances and ML chips becomes key to model performance fine-tuning and cost optimization. Metrics allow teams to understand workload behavior and optimize resource allocation and utilization, diagnose anomalies, and increase overall infrastructure efficiency. For data scientists, ML chips utilization and saturation are also relevant for capacity planning.

This post walks you through the Open Source Observability pattern for AWS Inferentia, which shows you how to monitor the performance of ML chips, used in an Amazon Elastic Kubernetes服務 (Amazon EKS) cluster, with data plane nodes based on 亞馬遜彈性計算雲 (Amazon EC2) instances of type 信息1 和信息2.

The pattern is part of the AWS CDK Observability Accelerator, a set of opinionated modules to help you set observability for Amazon EKS clusters. The AWS CDK Observability Accelerator is organized around patterns, which are reusable units for deploying multiple resources. The open source observability set of patterns instruments observability with 亞馬遜管理的 Grafana dashboards, an 適用於 OpenTelemetry 的 AWS 發行版 collector to collect metrics, and 亞馬遜普羅米修斯託管服務存儲它們。

解決方案概述

下圖說明了解決方案體系結構。

This solution deploys an Amazon EKS cluster with a node group that includes Inf1 instances.

The AMI type of the node group is AL2_x86_64_GPU，它使用 Amazon EKS optimized accelerated Amazon Linux AMI. In addition to the standard Amazon EKS-optimized AMI configuration, the accelerated AMI includes the NeuronX runtime.

To access the ML chips from Kubernetes, the pattern deploys the AWS 神經元 device plugin.

Metrics are exposed to Amazon Managed Service for Prometheus by the neuron-monitor DaemonSet, which deploys a minimal container, with the Neuron tools installed. Specifically, the neuron-monitor DaemonSet runs the neuron-monitor command piped into the neuron-monitor-prometheus.py companion script (both commands are part of the container):

neuron-monitor | neuron-monitor-prometheus.py --port <port>

The command uses the following components:

neuron-monitor collects metrics and stats from the Neuron applications running on the system and streams the collected data to stdout in JSON格式
neuron-monitor-prometheus.py maps and exposes the telemetry data from JSON format into Prometheus-compatible format

Data is visualized in Amazon Managed Grafana by the corresponding dashboard.

The rest of the setup to collect and visualize metrics with Amazon Managed Service for Prometheus and Amazon Managed Grafana is similar to that used in other open source based patterns, which are included in the AWS Observability Accelerator for CDK GitHub存儲庫。

條件：

You need the following to complete the steps in this post:

搭建環境

完成以下步驟來設置您的環境：

Open a terminal window and run the following commands:

export AWS_REGION=<YOUR AWS REGION>
export ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)

Retrieve the workspace IDs of any existing Amazon Managed Grafana workspace:

aws grafana list-workspaces

The following is our sample output:

{
  "workspaces": [
    {
      "authentication": {
        "providers": [
          "AWS_SSO"
        ]
      },
      "created": "2023-06-07T12:23:56.625000-04:00",
      "description": "accelerator-workspace",
      "endpoint": "g-XYZ.grafana-workspace.us-east-2.amazonaws.com",
      "grafanaVersion": "9.4",
      "id": "g-XYZ",
      "modified": "2023-06-07T12:30:09.892000-04:00",
      "name": "accelerator-workspace",
      "notificationDestinations": [
        "SNS"
      ],
      "status": "ACTIVE",
      "tags": {}
    }
  ]
}

Assign the values of id 和 endpoint to the following environment variables:

export COA_AMG_WORKSPACE_ID="<<YOUR-WORKSPACE-ID, similar to the above g-XYZ, without quotation marks>>"
export COA_AMG_ENDPOINT_URL="<<https://YOUR-WORKSPACE-URL, including protocol (i.e. https://), without quotation marks, similar to the above https://g-XYZ.grafana-workspace.us-east-2.amazonaws.com>>"

COA_AMG_ENDPOINT_URL needs to include https://.

Create a Grafana API key from the Amazon Managed Grafana workspace:

export AMG_API_KEY=$(aws grafana create-workspace-api-key 
--key-name "grafana-operator-key" 
--key-role "ADMIN" 
--seconds-to-live 432000 
--workspace-id $COA_AMG_WORKSPACE_ID 
--query key 
--output text)

Set up a secret in AWS系統經理:

aws ssm put-parameter --name "/cdk-accelerator/grafana-api-key" 
--type "SecureString" 
--value $AMG_API_KEY 
--region $AWS_REGION

The secret will be accessed by the External Secrets add-on and made available as a native Kubernetes secret in the EKS cluster.

Bootstrap the AWS CDK environment

The first step to any AWS CDK deployment is bootstrapping the environment. You use the cdk bootstrap command in the AWS CDK CLI to prepare the environment (a combination of AWS account and AWS Region) with resources required by AWS CDK to perform deployments into that environment. AWS CDK bootstrapping is needed for each account and Region combination, so if you already bootstrapped AWS CDK in a Region, you don’t need to repeat the bootstrapping process.

cdk bootstrap aws://$ACCOUNT_ID/$AWS_REGION

部署解決方案

完成以下步驟來部署解決方案：

克隆 cdk-aws-observability-accelerator repository and install the dependency packages. This repository contains AWS CDK v2 code written in TypeScript.

git clone https://github.com/aws-observability/cdk-aws-observability-accelerator.git
cd cdk-aws-observability-accelerator

The actual settings for Grafana dashboard JSON files are expected to be specified in the AWS CDK context. You need to update context ，在 cdk.json file, located in the current directory. The location of the dashboard is specified by the fluxRepository.values.GRAFANA_NEURON_DASH_URL parameter, and neuronNodeGroup is used to set the instance type, number, and Amazon Elastic Block商店 (Amazon EBS) size used for the nodes.

Enter the following snippet into cdk.json，取代 context:

"context": {
    "fluxRepository": {
      "name": "grafana-dashboards",
      "namespace": "grafana-operator",
      "repository": {
        "repoUrl": "https://github.com/aws-observability/aws-observability-accelerator",
        "name": "grafana-dashboards",
        "targetRevision": "main",
        "path": "./artifacts/grafana-operator-manifests/eks/infrastructure"
      },
      "values": {
        "GRAFANA_CLUSTER_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/cluster.json",
        "GRAFANA_KUBELET_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/kubelet.json",
        "GRAFANA_NSWRKLDS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/namespace-workloads.json",
        "GRAFANA_NODEEXP_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodeexporter-nodes.json",
        "GRAFANA_NODES_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodes.json",
        "GRAFANA_WORKLOADS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/workloads.json",
        "GRAFANA_NEURON_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/neuron/neuron-monitor.json"
      },
      "kustomizations": [
        {
          "kustomizationPath": "./artifacts/grafana-operator-manifests/eks/infrastructure"
        },
        {
          "kustomizationPath": "./artifacts/grafana-operator-manifests/eks/neuron"
        }
      ]
    },
     "neuronNodeGroup": {
      "instanceClass": "inf1",
      "instanceSize": "2xlarge",
      "desiredSize": 1, 
      "minSize": 1, 
      "maxSize": 3,
      "ebsSize": 512
    }
  }

You can replace the Inf1 instance type with Inf2 and change the size as needed. To check availability in your selected Region, run the following command (amend Values as you see fit):

aws ec2 describe-instance-type-offerings 
--filters Name=instance-type,Values="inf1*" 
--query "InstanceTypeOfferings[].InstanceType" 
--region $AWS_REGION

Install the project dependencies:

npm install

Run the following commands to deploy the open source observability pattern:

make build
make pattern single-new-eks-inferentia-opensource-observability deploy

驗證解決方案

Complete the following steps to validate the solution:

跑過 update-kubeconfig command. You should be able to get the command from the output message of the previous command:

aws eks update-kubeconfig --name single-new-eks-inferentia-opensource... --region <your region> --role-arn arn:aws:iam::xxxxxxxxx:role/single-new-eks-....

Verify the resources you created:

kubectl get pods -A

The following screenshot shows our sample output.

確保 neuron-device-plugin-daemonset DaemonSet is running:

kubectl get ds neuron-device-plugin-daemonset --namespace kube-system

The following is our expected output:

NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
neuron-device-plugin-daemonset   1         1         1       1            1           <none>          2h

確認 neuron-monitor DaemonSet is running:

kubectl get ds neuron-monitor --namespace kube-system

The following is our expected output:

NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
neuron-monitor   1         1         1       1            1           <none>          2h

To verify that the Neuron devices and cores are visible, run the neuron-ls 和 neuron-top commands from, for example, your neuron-monitor pod (you can get the pod’s name from the output of kubectl get pods -A):

kubectl exec -it {your neuron-monitor pod} -n kube-system -- /bin/bash -c "neuron-ls"

The following screenshot shows our expected output.

kubectl exec -it {your neuron-monitor pod} -n kube-system -- /bin/bash -c "neuron-top"

The following screenshot shows our expected output.

Visualize data using the Grafana Neuron dashboard

Log in to your Amazon Managed Grafana workspace and navigate to the 儀表板 panel. You should see a dashboard named Neuron / Monitor.

To see some interesting metrics on the Grafana dashboard, we apply the following manifest:

curl https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/k8s-deployment-manifest-templates/neuron/pytorch-inference-resnet50.yml | kubectl apply -f -

This is a sample workload that compiles the torchvision ResNet50 model and runs repetitive inference in a loop to generate telemetry data.

To verify the pod was successfully deployed, run the following code:

kubectl get pods

You should see a pod named pytorch-inference-resnet50.

After a few minutes, looking into the Neuron / Monitor dashboard, you should see the gathered metrics similar to the following screenshots.

Grafana Operator and Flux always work together to synchronize your dashboards with Git. If you delete your dashboards by accident, they will be re-provisioned automatically.

清理

You can delete the whole AWS CDK stack with the following command:

make pattern single-new-eks-inferentia-opensource-observability destroy

結論

In this post, we showed you how to introduce observability, with open source tooling, into an EKS cluster featuring a data plane running EC2 Inf1 instances. We started by selecting the Amazon EKS-optimized accelerated AMI for the data plane nodes, which includes the Neuron container runtime, providing access to AWS Inferentia and Trainium Neuron devices. Then, to expose the Neuron cores and devices to Kubernetes, we deployed the Neuron device plugin. The actual collection and mapping of telemetry data into Prometheus-compatible format was achieved via neuron-monitor 和 neuron-monitor-prometheus.py. Metrics were sourced from Amazon Managed Service for Prometheus and displayed on the Neuron dashboard of Amazon Managed Grafana.

We recommend that you explore additional observability patterns in the AWS Observability Accelerator for CDK GitHub repo. To learn more about Neuron, refer to the AWS Neuron文檔.

關於作者

Riccardo Freschi is a Sr. Solutions Architect at AWS, focusing on application modernization. He works closely with partners and customers to help them transform their IT landscapes in their journey to the AWS Cloud by refactoring existing applications and building new ones.

SEO 支持的內容和 PR 分發。今天得到放大。
PlatoData.Network 垂直生成人工智能。賦予自己力量。訪問這裡。
柏拉圖愛流。 Web3 智能。知識放大。訪問這裡。
柏拉圖ESG。碳，清潔科技, 能源，環境，太陽能，廢物管理。訪問這裡。
柏拉圖健康。生物技術和臨床試驗情報。訪問這裡。
資源： https://aws.amazon.com/blogs/machine-learning/open-source-observability-for-aws-inferentia-nodes-within-amazon-eks-clusters/

柏拉圖數據智能。
垂直搜索和人工智能。

Amazon EKS 叢集中 AWS Inferentia 節點的開源可觀測性 |亞馬遜網路服務

解決方案概述

條件：

搭建環境

Bootstrap the AWS CDK environment

部署解決方案

驗證解決方案

Visualize data using the Grafana Neuron dashboard

清理

結論

關於作者

Verizon DBIR：基本安全失誤導致資料外洩激增

音樂家 FKA Twigs 告訴國會她創造了自己的 AI Deepfake – Decrypt

最新情報

「我會盡我所能」：幣安創始人著眼於出獄後的生活 – Decrypt

Anson Resources 與 LG Energy Solution 簽署鋰供應協議

狂野的西部？白名單加密錢包在基礎山寨幣上獲得 3,000,000% 的收益

Matera 籌集 3.6 萬美元，透過將 DeFi 與社群媒體融合來促進創作者經濟

世界最高天文台在智利開始運作 – 物理世界

到 30 年，BDAG 將達到 2030 美元，在最值得購買的加密貨幣中超越 TON 價格

和我們線上諮詢

柏拉圖數據智能。垂直搜索和人工智能。

Amazon EKS 叢集中 AWS Inferentia 節點的開源可觀測性 |亞馬遜網路服務

解決方案概述

條件：

搭建環境

Bootstrap the AWS CDK environment

部署解決方案

驗證解決方案

Visualize data using the Grafana Neuron dashboard

清理

結論

關於作者

最新情報

和我們線上諮詢

柏拉圖數據智能。
垂直搜索和人工智能。