柏拉圖數據智能。
垂直搜索和人工智能。

將 HyperPod 叢集與 Active Directory 整合以實現無縫多使用者登入 |亞馬遜網路服務

日期:

亞馬遜 SageMaker HyperPod is purpose-built to accelerate foundation model (FM) training, removing the undifferentiated heavy lifting involved in managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for weeks and months without disruption.

Typically, HyperPod clusters are used by multiple users: machine learning (ML) researchers, software engineers, data scientists, and cluster administrators. They edit their own files, run their own jobs, and want to avoid impacting each other’s work. To achieve this multi-user environment, you can take advantage of Linux’s user and group mechanism and statically create multiple users on each instance through lifecycle scripts. The drawback to this approach, however, is that user and group settings are duplicated across multiple instances in the cluster, making it difficult to configure them consistently on all instances, such as when a new team member joins.

To solve this pain point, we can use 輕量級目錄訪問協議 (LDAP)LDAP over TLS/SSL (LDAPS) to integrate with a directory service such as 適用於 Microsoft Active Directory 的 AWS 目錄服務. With the directory service, you can centrally maintain users and groups, and their permissions.

In this post, we introduce a solution to integrate HyperPod clusters with AWS Managed Microsoft AD, and explain how to achieve a seamless multi-user login environment with a centrally maintained directory.

解決方案概述

The solution uses the following AWS services and resources:

我們也用 AWS 雲形成 to deploy a stack to create the prerequisites for the HyperPod cluster: VPC, subnets, security group, and 適用於Lustre的Amazon FSx 卷。

下圖說明了高級解決方案架構。

Architecture diagram for HyperPod and Active Directory integration

In this solution, HyperPod cluster instances use the LDAPS protocol to connect to the AWS Managed Microsoft AD via an NLB. We use TLS termination by installing a certificate to the NLB. To configure LDAPS in HyperPod cluster instances, the lifecycle script installs and configures 系統安全服務守護程序 (SSSD)—an open source client software for LDAP/LDAPS.

條件:

This post assumes you already know how to create a basic HyperPod cluster without SSSD. For more details on how to create HyperPod clusters, refer to Getting started with SageMaker HyperPodHyperPod workshop.

Also, in the setup steps, you will use a Linux machine to generate a self-signed certificate and obtain an obfuscated password for the AD reader user. If you don’t have a Linux machine, you can create an EC2 Linux instance or use AWS 雲外殼.

Create a VPC, subnets, and a security group

請按照 Own Account section of the HyperPod workshop. You will deploy a CloudFormation stack and create prerequisite resources such as VPC, subnets, security group, and FSx for Lustre volume. You need to create both a primary subnet and backup subnet when deploying the CloudFormation stack, because AWS Managed Microsoft AD requires at least two subnets with different Availability Zones.

In this post, for simplicity, we use the same VPC, subnets, and security group for both the HyperPod cluster and directory service. If you need to use different networks between the cluster and directory service, make sure security groups and route tables are configured so that they can communicate each other.

Create AWS Managed Microsoft AD on Directory Service

Complete the following steps to set up your directory:

  1. Directory Service console選擇 目錄 在導航窗格中。
  2. 選擇 Set up directory.
  3. Directory type, 選擇 AWS 託管的 Microsoft AD.
  4. 選擇 下一頁.
    Directory type selection screen
  5. , 選擇 標準版.
  6. Directory DNS name, enter your preferred directory DNS name (for example, hyperpod.abc123.com).
  7. 管理員密碼¸ set a password and save it for later use.
  8. 選擇 下一頁.
    Directory creation configuration screen
  9. 網路相關 section, specify the VPC and two private subnets you created.
  10. 選擇 下一頁.
    Directory network configuration screen
  11. Review the configuration and pricing, then choose 創建目錄.
    Directory creation confirmation screen
    The directory creation starts. Wait until the status changes from 創建活性, which can take 20–30 minutes.
  12. 當狀態更改為 活性, open the detail page of the directory and take note of the DNS addresses for later use.Directory details screen

Create an NLB in front of Directory Service

To create the NLB, complete the following steps:

  1. Amazon EC2控制台選擇 目標群體 在導航窗格中。
  2. 選擇 Create target groups.
  3. Create a target group with the following parameters:
    1. Choose a target type, 選擇 IP地址.
    2. Target group name,進入 LDAP.
    3. Protocol: Port選擇 TCP 並進入 389.
    4. IP address type, 選擇 IPv4.
    5. 專有網絡選擇 SageMaker HyperPod VPC (which you created with the CloudFormation template).
    6. Health check protocol選擇 TCP.
  4. 選擇 下一頁.
    Load balancing target creation configuration screen
  5. 註冊目標 section, register the directory service’s DNS addresses as the targets.
  6. 外接連接埠選擇 包括在下面待定.Load balancing target registration screenThe addresses are added in the Review targets section with 等待處理 狀態。
  7. 選擇 Create target group.Load balancing target review screen
  8. Load Balancers console選擇 Create load balancer.
  9. 網絡負載均衡器選擇 創建.Load balancer type choosing screen
  10. Configure an NLB with the following parameters:
    1. 負載平衡器名稱,輸入名稱(例如, nlb-ds).
    2. 方案, 選擇 內部.
    3. IP address type, 選擇 IPv4.NLB creation basic configuration section
    4. 專有網絡選擇 SageMaker HyperPod VPC (which you created with the CloudFormation template).
    5. 映射, select the two private subnets and their CIDR ranges (which you created with the CloudFormation template).
    6. 安全組選擇 CfStackName-SecurityGroup-XYZXYZ (which you created with the CloudFormation template).NLB creation network mapping and security groups configurations
  11. Listeners and routing section, specify the following parameters:
    1. 協議選擇 TCP.
    2. 港口,進入 389.
    3. 默認操作, choose the target group named LDAP.

    Here, we are adding a listener for LDAP. We will add LDAPS later.

  12. 選擇 Create load balancer.NLB listeners routing configuration screenWait until the status changes from 配置 to Active, which can take 3–5 minutes.
  13. 當狀態更改為 活性, open the detail page of the provisioned NLB and take note of the DNS name (xyzxyz.elb.region-name.amazonaws.com) for later use.NLB details screen

Create a self-signed certificate and import it to Certificate Manager

To create a self-signed certificate, complete the following steps:

  1. On your Linux-based environment (local laptop, EC2 Linux instance, or CloudShell), run the following OpenSSL的 commands to create a self-signed certificate and private key:
    $ openssl genrsa 2048 > ldaps.key
    
    $ openssl req -new -key ldaps.key -out ldaps_server.csr
    
    You are about to be asked to enter information that will be incorporated
    into your certificate request.
    What you are about to enter is what is called a Distinguished Name or a DN.
    There are quite a few fields but you can leave some blank
    For some fields there will be a default value,
    If you enter '.', the field will be left blank.
    -----
    Country Name (2 letter code) [AU]:US
    State or Province Name (full name) [Some-State]:Washington
    Locality Name (eg, city) []:Bellevue
    Organization Name (eg, company) [Internet Widgits Pty Ltd]:CorpName
    Organizational Unit Name (eg, section) []:OrgName
    Common Name (e.g., server FQDN or YOUR name) []:nlb-ds-abcd1234.elb.region.amazonaws.com
    Email Address []:[email protected]
    
    Please enter the following 'extra' attributes
    to be sent with your certificate request
    A challenge password []:
    An optional company name []:
    
    $ openssl x509 -req -sha256 -days 365 -in ldaps_server.csr -signkey ldaps.key -out ldaps.crt
    
    Certificate request self-signature ok
    subject=C = US, ST = Washington, L = Bellevue, O = CorpName, OU = OrgName, CN = nlb-ds-abcd1234.elb.region.amazonaws.com, emailAddress = [email protected]
    
    $ chmod 600 ldaps.key

  2. Certificate Manager console選擇 進口.
  3. Enter the certificate body and private key, from the contents of ldaps.crtldaps.key
  4. 選擇 下一頁.Certificate importing screen
  5. Add any optional tags, then choose 下一頁.Certificate tag editing screen
  6. 查看配置並選擇 進口.Certificate import review screen

Add an LDAPS listener

We added a listener for LDAP already in the NLB. Now we add a listener for LDAPS with the imported certificate. Complete the following steps:

  1. Load Balancers console, navigate to the NLB details page.
  2. 聽眾 標籤,選擇 Add listener.NLB listers screen with add listener button
  3. Configure the listener with the following parameters:
    1. 協議選擇 TLS.
    2. 港口,進入 636.
    3. 默認操作選擇 LDAP.
    4. Certificate source, 選擇 From ACM.
    5. 證書, enter what you imported in ACM.
  4. 選擇 加入.NLB listener configuration screenNow the NLB listens to both LDAP and LDAPS. It is recommended to delete the LDAP listener because it transmits data without encryption, unlike LDAPS.NLB listerners list with LDAP and LDAPS

Create an EC2 Windows instance to administer users and groups in the AD

To create and maintain users and groups in the AD, complete the following steps:

  1. 在 Amazon EC2 控制台上,選擇 實例 在導航窗格中。
  2. 選擇 啟動實例.
  3. 姓名,為您的實例輸入一個名稱。
  4. 亞馬遜機器映像選擇 Microsoft Windows Server 2022 Base.
  5. 實例類型選擇 t2.micro.
  6. 網絡設置 部分,提供以下參數:
    1. 專有網絡選擇 SageMaker HyperPod VPC (which you created with the CloudFormation template).
    2. 子網, choose either of two subnets you created with the CloudFormation template.
    3. 公共安全組選擇 CfStackName-SecurityGroup-XYZXYZ (which you created with the CloudFormation template).
  7. 配置存儲, set storage to 30 GB gp2.
  8. 高級細節 部分,用於 Domain join directory¸ choose the AD you created.
  9. IAM 實例設定檔,選擇一個 AWS身份和訪問管理 (IAM) role with at least the AmazonSSMManagedEC2InstanceDefaultPolicy 政策。
  10. 查看摘要並選擇 啟動實例.

Create users and groups in AD using the EC2 Windows instance

這款獨特的敏感免洗唇膜採用 Moisture WrapTM 技術和 Berry Mix ComplexTM 成分, 遠程桌面, connect to the EC2 Windows instance you created in the previous step. Using an RDP client is recommended over using a browser-based Remote Desktop so that you can exchange the contents of the clipboard with your local machine using copy-paste operations. For more details about connecting to EC2 Windows instances, refer to Connect to your Windows instance.

If you are prompted for a login credential, use hyperpodAdmin (哪裡 hyperpod is the first part of your directory DNS name) as the user name, and use the admin password you set to the directory service.

  1. When the Windows desktop screen opens, choose 服務器管理器 來自 開始 菜單。Dashboard screen on Server Manager
  2. 選擇 本地服務器 in the navigation pane, and confirm that the domain is what you specified to the directory service.Local Server screen on Server Manager
  3. 管理 菜單,選擇 添加角色和功能.Drop down menu opened from Manage button
  4. 選擇 下一頁 until you are at the 功能 頁。Add Roles and Features Wizard
  5. Expand the feature 遠程服務器管理工具,擴大 角色管理工具,然後選擇 AD DS 和 AD LDS 工具活動目錄權限管理服務.
  6. 選擇 下一頁安裝.Features selection screenFeature installation starts.
  7. When the installation is complete, choose 關閉.Feature installation progress screen
  8. 已提交 活動目錄用戶和計算機 來自 開始 菜單。Active Directory Users and Computers window
  9. hyperpod.abc123.com,擴大 hyperpod.
  10. 選擇(右擊) hyperpod選擇 全新,並選擇 組織單位.Context menu opened to create an Organizational Unit
  11. Create an organizational unit called Groups.Organizational Unit ceation dialog
  12. 選擇(右擊) 選擇 全新,並選擇 群組.Context menu opened to create groups
  13. Create a group called ClusterAdmin.Group creation dialog for ClusterAdmin
  14. Create a second group called ClusterDev.Group creation dialog for ClusterDev
  15. 選擇(右擊) 用戶選擇 全新,並選擇 用戶名单 .
  16. 創建一個新用戶。User creation dialog
  17. Choose (right-click) the user and choose Add to a group.Context menu opened to add a user to a group
  18. Add your users to the groups ClusterAdmin or ClusterDev.Group selection screen to add a user to a groupUsers added to the ClusterAdmin group will have sudo privilege on the cluster.

Create a ReadOnly user in AD

Create a user called ReadOnlyUsers。 “ ReadOnly user is used by the cluster to programmatically access users and groups in AD.

User creation dialog to create ReadOnly user

Take note of the password for later use.

Password entering screen for ReadOnly user

(For SSH public key authentication) Add SSH public keys to users

By storing an SSH public key to a user in AD, you can log in without entering a password. You can use an existing key pair, or you can create a new key pair with OpenSSH’s ssh-keygen command. For more information about generating a key pair, refer to Create a key pair for your Amazon EC2 instance.

  1. In 活動目錄用戶和計算機上, 瀏覽 菜單,啟用 高級功能.View menu opened to enable Advanced Features
  2. 打開 氟化鈉性能 dialog of the user.
  3. 屬性編輯器 標籤,選擇 altSecurityIdentities 選擇 編輯.Attribute Editor tab on User Properties dialog
  4. Value to add選擇 加入.
  5. 價值觀, add an SSH public key.
  6. 選擇 OK.Attribute editing dialog for altSecurityIdentitiesConfirm that the SSH public key appears as an attribute.Attribute Editor tab with altSecurityIdentities configured

Get an obfuscated password for the ReadOnly user

To avoid including a plain text password in the SSSD configuration file, you obfuscate the password. For this step, you need a Linux environment (local laptop, EC2 Linux instance, or CloudShell).

安裝 sssd-tools package on the Linux machine to install the Python module pysss for obfuscation:

# Ubuntu
$ sudo apt install sssd-tools

# Amazon Linux
$ sudo yum install sssd-tools

Run the following one-line Python script. Input the password of the ReadOnly user. You will get the obfuscated password.

$ python3 -c "import getpass,pysss; print(pysss.password().encrypt(getpass.getpass('AD reader user password: ').strip(), pysss.password().AES_256))"
AD reader user password: (Enter ReadOnly user password) 
AAAQACK2....

Create a HyperPod cluster with an SSSD-enabled lifecycle script

Next, you create a HyperPod cluster with LDAPS/Active Directory integration.

  1. Find the configuration file config.py in your lifecycle script directory, open it with your text editor, and edit the properties in the Config 類和 SssdConfig 類:
    1. 在你的生活中 True 對於 enable_sssd to enable setting up SSSD.
    2. SssdConfig class contains configuration parameters for SSSD.
    3. Make sure you use the obfuscated password for the ldap_default_authtok property, not a plain text password.
    # Basic configuration parameters
    class Config:
             :
        # Set true if you want to install SSSD for ActiveDirectory/LDAP integration.
        # You need to configure parameters in SssdConfig as well.
        enable_sssd = True
    # Configuration parameters for ActiveDirectory/LDAP/SSSD
    class SssdConfig:
    
        # Name of domain. Can be default if you are not sure.
        domain = "default"
    
        # Comma separated list of LDAP server URIs
        ldap_uri = "ldaps://nlb-ds-xyzxyz.elb.us-west-2.amazonaws.com"
    
        # The default base DN to use for performing LDAP user operations
        ldap_search_base = "dc=hyperpod,dc=abc123,dc=com"
    
        # The default bind DN to use for performing LDAP operations
        ldap_default_bind_dn = "CN=ReadOnly,OU=Users,OU=hyperpod,DC=hyperpod,DC=abc123,DC=com"
    
        # "password" or "obfuscated_password". Obfuscated password is recommended.
        ldap_default_authtok_type = "obfuscated_password"
    
        # You need to modify this parameter with the obfuscated password, not plain text password
        ldap_default_authtok = "placeholder"
    
        # SSH authentication method - "password" or "publickey"
        ssh_auth_method = "publickey"
    
        # Home directory. You can change it to "/home/%u" if your cluster doesn't use FSx volume.
        override_homedir = "/fsx/%u"
    
        # Group names to accept SSH login
        ssh_allow_groups = {
            "controller" : ["ClusterAdmin", "ubuntu"],
            "compute" : ["ClusterAdmin", "ClusterDev", "ubuntu"],
            "login" : ["ClusterAdmin", "ClusterDev", "ubuntu"],
        }
    
        # Group names for sudoers
        sudoers_groups = {
            "controller" : ["ClusterAdmin", "ClusterDev"],
            "compute" : ["ClusterAdmin", "ClusterDev"],
            "login" : ["ClusterAdmin", "ClusterDev"],
        }
    

  2. Copy the certificate file ldaps.crt to the same directory (where config.py 存在)。
  3. Upload the modified lifecycle script files to your 亞馬遜簡單存儲服務 (Amazon S3) bucket, and create a HyperPod cluster with it.
  4. 等待狀態變為 服務中.

企業驗證

Let’s verify the solution by logging in to the cluster with SSH. Because the cluster was created in a private subnet, you can’t directly SSH into the cluster from your local environment. You can choose from two options to connect to the cluster.

Option 1: SSH login through AWS Systems Manager

您可以使用 AWS系統經理 as a proxy for the SSH connection. Add a host entry to the SSH configuration file ~/.ssh/config using the following example. For the HostName field, specify the Systems Manger target name in the format of sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]。 為了 IdentityFile field, specify the file path to the user’s SSH private key. This field is not required if you chose password authentication.

Host MyCluster-LoginNode
    HostName sagemaker-cluster:abcd1234_LoginGroup-i-01234567890abcdef
    User user1
    IdentityFile ~/keys/my-cluster-ssh-key.pem
    ProxyCommand aws --profile default --region us-west-2 ssm start-session --target %h --document-name AWS-StartSSHSession --parameters portNumber=%p

跑過 ssh command using the host name you specified. Confirm you can log in to the instance with the specified user.

$ ssh MyCluster-LoginNode
   :
   :
   ____              __  ___     __             __ __                  ___          __
  / __/__ ____ ____ /  |/  /__ _/ /_____ ____  / // /_ _____  ___ ____/ _ ___  ___/ /
 _ / _ `/ _ `/ -_) /|_/ / _ `/  '_/ -_) __/ / _  / // / _ / -_) __/ ___/ _ / _  /
/___/_,_/_, /__/_/  /_/_,_/_/___/_/   /_//_/_, / .__/__/_/ /_/   ___/_,_/
         /___/                                    /___/_/
You're on the controller
Instance Type: ml.m5.xlarge
user1@ip-10-1-111-222:~$

At this point, users can still use the Systems Manager default shell session to log in to the cluster as ssm-user with administrative privileges. To block the default Systems Manager shell access and enforce SSH access, you can configure your IAM policy by referring to the following example:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession",
                "ssm:TerminateSession"
            ],
            "Resource": [
                "arn:aws:sagemaker:us-west-2:123456789012:cluster/abcd1234efgh",
                "arn:aws:ssm:us-west-2:123456789012:document/AWS-StartSSHSession"
            ],
            "Condition": {
                "BoolIfExists": {
                    "ssm:SessionDocumentAccessCheck": "true"
                }
            }
        }
    ]
}

For more details on how to enforce SSH access, refer to Start a session with a document by specifying the session documents in IAM policies.

Option 2: SSH login through bastion host

Another option to access the cluster is to use a 堡壘主機 as a proxy. You can use this option when the user doesn’t have permission to use Systems Manager sessions, or to troubleshoot when Systems Manager is not working.

  1. Create a bastion security group that allows inbound SSH access (TCP port 22) from your local environment.
  2. Update the security group for the cluster to allow inbound SSH access from the bastion security group.
  3. Create an EC2 Linux instance.
  4. 亞馬遜機器映像選擇 Ubuntu 服務器 20.04 LTS.
  5. 實例類型選擇 t3.小.
  6. 網絡設置 部分,提供以下參數:
    1. 專有網絡選擇 SageMaker HyperPod VPC (which you created with the CloudFormation template).
    2. 子網, choose the public subnet you created with the CloudFormation template.
    3. 公共安全組, choose the bastion security group you created.
  7. 配置存儲, set storage to 8 GB.
  8. Identify the public IP address of the bastion host and the private IP address of the target instance (for example, the login node of the cluster), and add two host entries in the SSH config, by referring to the following example:
    Host Bastion
        HostName 11.22.33.44
        User ubuntu
        IdentityFile ~/keys/my-bastion-ssh-key.pem
    
    Host MyCluster-LoginNode-with-Proxy
        HostName 10.1.111.222
        User user1
        IdentityFile ~/keys/my-cluster-ssh-key.pem
        ProxyCommand ssh -q -W %h:%p Bastion

  9. 跑過 ssh command using the target host name you specified earlier, and confirm you can log in to the instance with the specified user:
    $ ssh MyCluster-LoginNode-with-Proxy
       :
       :
       ____              __  ___     __             __ __                  ___          __
      / __/__ ____ ____ /  |/  /__ _/ /_____ ____  / // /_ _____  ___ ____/ _ ___  ___/ /
     _ / _ `/ _ `/ -_) /|_/ / _ `/  '_/ -_) __/ / _  / // / _ / -_) __/ ___/ _ / _  /
    /___/_,_/_, /__/_/  /_/_,_/_/___/_/   /_//_/_, / .__/__/_/ /_/   ___/_,_/
             /___/                                    /___/_/
    You're on the controller
    Instance Type: ml.m5.xlarge
    user1@ip-10-1-111-222:~$

清理

Clean up the resources in the following order:

  1. Delete the HyperPod cluster.
  2. Delete the Network Load Balancer.
  3. Delete the load balancing target group.
  4. Delete the certificate imported to Certificate Manager.
  5. Delete the EC2 Windows instance.
  6. Delete the EC2 Linux instance for the bastion host.
  7. Delete the AWS Managed Microsoft AD.
  8. Delete the CloudFormation stack for the VPC, subnets, security group, and FSx for Lustre volume.

結論

This post provided steps to create a HyperPod cluster integrated with Active Directory. This solution removes the hassle of user maintenance on large-scale clusters and allows you to manage users and groups centrally in one place.

For more information about HyperPod, check out the HyperPod workshopSageMaker HyperPod Developer Guide. Leave your feedback on this solution in the comments section.


關於作者

Tomonori Shimomura is a Senior Solutions Architect on the Amazon SageMaker team, where he provides in-depth technical consultation to SageMaker customers and suggests product improvements to the product team. Before joining Amazon, he worked on the design and development of embedded software for video game consoles, and now he leverages his in-depth skills in Cloud side technology. In his free time, he enjoys playing video games, reading books, and writing software.

朱塞佩·安傑洛·波切利 是 Amazon Web Services 的首席機器學習專家解決方案架構師。 他擁有多年的軟件工程經驗和 ML 背景,能夠與各種規模的客戶合作,了解他們的業務和技術需求,並設計充分利用 AWS 雲和 Amazon 機器學習堆棧的 AI 和 ML 解決方案。 他參與過不同領域的項目,包括 MLOps、計算機視覺和 NLP,涉及廣泛的 AWS 服務。 在空閒時間,朱塞佩喜歡踢足球。

Monidipa Chakraborty currently serves as a Senior Software Development Engineer at Amazon Web Services (AWS), specifically within the SageMaker HyperPod team. She is committed to assisting customers by designing and implementing robust and scalable systems that demonstrate operational excellence. Bringing nearly a decade of software development experience, Monidipa has contributed to various sectors within Amazon, including Video, Retail, Amazon Go, and AWS SageMaker.

Satish Pasumarthi is a Software Developer at Amazon Web Services. With several years of software engineering and an ML background, he loves to bridge the gap between the ML and systems and is passionate to build systems that make large scale model training possible. He has worked on projects in a variety of domains, including Machine Learning frameworks, model benchmarking, building hyperpod beta involving a broad set of AWS services. In his free time, Satish enjoys playing badminton.

現貨圖片

最新情報

現貨圖片

和我們線上諮詢

你好呀!我怎麼幫你?