Amazon SageMaker HyperPod is purpose-built to accelerate foundation model (FM) training, removing the undifferentiated heavy lifting involved in managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for weeks and months without disruption.
Typically, HyperPod clusters are used by multiple users: machine learning (ML) researchers, software engineers, data scientists, and cluster administrators. They edit their own files, run their own jobs, and want to avoid impacting each other’s work. To achieve this multi-user environment, you can take advantage of Linux’s user and group mechanism and statically create multiple users on each instance through lifecycle scripts. The drawback to this approach, however, is that user and group settings are duplicated across multiple instances in the cluster, making it difficult to configure them consistently on all instances, such as when a new team member joins.
To solve this pain point, we can use پروتکل دسترسی دایرکتوری سبک (LDAP) و LDAP over TLS/SSL (LDAPS) to integrate with a directory service such as AWS Directory Service for Microsoft Active Directory. With the directory service, you can centrally maintain users and groups, and their permissions.
In this post, we introduce a solution to integrate HyperPod clusters with AWS Managed Microsoft AD, and explain how to achieve a seamless multi-user login environment with a centrally maintained directory.
بررسی اجمالی راه حل
The solution uses the following AWS services and resources:
ما همچنین استفاده می کنیم AWS CloudFormation to deploy a stack to create the prerequisites for the HyperPod cluster: VPC, subnets, security group, and آمازون FSx برای Luster جلد.
نمودار زیر معماری راه حل های سطح بالا را نشان می دهد.
In this solution, HyperPod cluster instances use the LDAPS protocol to connect to the AWS Managed Microsoft AD via an NLB. We use TLS termination by installing a certificate to the NLB. To configure LDAPS in HyperPod cluster instances, the lifecycle script installs and configures دیمون سرویس های امنیتی سیستم (SSSD)—an open source client software for LDAP/LDAPS.
پیش نیازها
This post assumes you already know how to create a basic HyperPod cluster without SSSD. For more details on how to create HyperPod clusters, refer to Getting started with SageMaker HyperPod و HyperPod workshop.
Also, in the setup steps, you will use a Linux machine to generate a self-signed certificate and obtain an obfuscated password for the AD reader user. If you don’t have a Linux machine, you can create an EC2 Linux instance or use AWS CloudShell.
Create a VPC, subnets, and a security group
دستورالعمل های موجود در Own Account section of the HyperPod workshop. You will deploy a CloudFormation stack and create prerequisite resources such as VPC, subnets, security group, and FSx for Lustre volume. You need to create both a primary subnet and backup subnet when deploying the CloudFormation stack, because AWS Managed Microsoft AD requires at least two subnets with different Availability Zones.
In this post, for simplicity, we use the same VPC, subnets, and security group for both the HyperPod cluster and directory service. If you need to use different networks between the cluster and directory service, make sure security groups and route tables are configured so that they can communicate each other.
Create AWS Managed Microsoft AD on Directory Service
Complete the following steps to set up your directory:
- بر Directory Service console، انتخاب کنید راهنماها در صفحه ناوبری
- را انتخاب کنید راه اندازی دایرکتوری.
- برای Directory type، انتخاب کنید AWS مایکروسافت AD را مدیریت کرد.
- را انتخاب کنید بعدی.
- برای چاپ، انتخاب کنید نسخه استاندارد.
- برای Directory DNS name, enter your preferred directory DNS name (for example,
hyperpod.abc123.com
). - برای رمز عبور مدیر¸ set a password and save it for later use.
- را انتخاب کنید بعدی.
- در شبکه section, specify the VPC and two private subnets you created.
- را انتخاب کنید بعدی.
- Review the configuration and pricing, then choose دایرکتوری ایجاد کنید.
The directory creation starts. Wait until the status changes from ایجاد به فعال, which can take 20–30 minutes. - وقتی وضعیت به فعال, open the detail page of the directory and take note of the DNS addresses for later use.
Create an NLB in front of Directory Service
To create the NLB, complete the following steps:
- بر کنسول آمازون EC2، انتخاب کنید گروه های هدف در صفحه ناوبری
- را انتخاب کنید Create target groups.
- Create a target group with the following parameters:
- برای Choose a target type، انتخاب کنید آدرس های IP.
- برای Target group name، وارد
LDAP
. - برای Protocol: Port، انتخاب کنید TCP و وارد شوید
389
. - برای IP address type، انتخاب کنید IPv4.
- برای VPC، انتخاب کنید SageMaker HyperPod VPC (which you created with the CloudFormation template).
- برای Health check protocol، انتخاب کنید TCP.
- را انتخاب کنید بعدی.
- در ثبت اهداف section, register the directory service’s DNS addresses as the targets.
- برای بنادر، انتخاب کنید در زیر به صورت معلق درج کنید.The addresses are added in the Review targets section with در انتظار وضعیت.
- را انتخاب کنید Create target group.
- بر Load Balancers console، انتخاب کنید Create load balancer.
- تحت متعادل کننده بار شبکه، انتخاب کنید ساختن.
- Configure an NLB with the following parameters:
- برای نام متعادل کننده بار، یک نام وارد کنید (به عنوان مثال،
nlb-ds
). - برای طرح، انتخاب کنید داخلی.
- برای IP address type، انتخاب کنید IPv4.
- برای VPC، انتخاب کنید SageMaker HyperPod VPC (which you created with the CloudFormation template).
- تحت نگاشت, select the two private subnets and their CIDR ranges (which you created with the CloudFormation template).
- برای گروه های امنیتی، انتخاب کنید
CfStackName-SecurityGroup-XYZXYZ
(which you created with the CloudFormation template).
- برای نام متعادل کننده بار، یک نام وارد کنید (به عنوان مثال،
- در Listeners and routing section, specify the following parameters:
- برای پروتکل، انتخاب کنید TCP.
- برای بندر، وارد
389
. - برای اقدام پیش فرض, choose the target group named LDAP.
Here, we are adding a listener for LDAP. We will add LDAPS later.
- را انتخاب کنید Create load balancer.Wait until the status changes from تأمین to Active, which can take 3–5 minutes.
- وقتی وضعیت به فعال, open the detail page of the provisioned NLB and take note of the DNS name (
xyzxyz.elb.region-name.amazonaws.com
) for later use.
Create a self-signed certificate and import it to Certificate Manager
To create a self-signed certificate, complete the following steps:
- On your Linux-based environment (local laptop, EC2 Linux instance, or CloudShell), run the following OpenSSL را commands to create a self-signed certificate and private key:
- بر Certificate Manager console، انتخاب کنید وارد كردن.
- Enter the certificate body and private key, from the contents of
ldaps.crt
وldaps.key
بود. - را انتخاب کنید بعدی.
- Add any optional tags, then choose بعدی.
- تنظیمات را بررسی کرده و انتخاب کنید وارد كردن.
Add an LDAPS listener
We added a listener for LDAP already in the NLB. Now we add a listener for LDAPS with the imported certificate. Complete the following steps:
- بر Load Balancers console, navigate to the NLB details page.
- بر شنوندگان برگه ، انتخاب کنید Add listener.
- Configure the listener with the following parameters:
- برای پروتکل، انتخاب کنید TLS.
- برای بندر، وارد
636
. - برای اقدام پیش فرض، انتخاب کنید LDAP.
- برای Certificate source، انتخاب کنید From ACM.
- برای گواهی نامه, enter what you imported in ACM.
- را انتخاب کنید اضافه کردن.Now the NLB listens to both LDAP and LDAPS. It is recommended to delete the LDAP listener because it transmits data without encryption, unlike LDAPS.
Create an EC2 Windows instance to administer users and groups in the AD
To create and maintain users and groups in the AD, complete the following steps:
- در کنسول آمازون EC2، را انتخاب کنید موارد در صفحه ناوبری
- را انتخاب کنید راه اندازی نمونه ها.
- برای نام، یک نام برای نمونه خود وارد کنید.
- برای تصویر ماشین آمازون، انتخاب کنید Microsoft Windows Server 2022 Base.
- برای نوع نمونه، انتخاب کنید t2.micro.
- در تنظیمات شبکه بخش، پارامترهای زیر را ارائه دهید:
- برای VPC، انتخاب کنید SageMaker HyperPod VPC (which you created with the CloudFormation template).
- برای زیرشبکه, choose either of two subnets you created with the CloudFormation template.
- برای Common security groups، انتخاب کنید
CfStackName-SecurityGroup-XYZXYZ
(which you created with the CloudFormation template).
- برای ذخیره سازی را پیکربندی کنید, set storage to 30 GB gp2.
- در جزئیات پیشرفته بخش، برای Domain join directory¸ choose the AD you created.
- برای نمایه نمونه IAM، انتخاب کنید هویت AWS و مدیریت دسترسی (IAM) role with at least the
AmazonSSMManagedEC2InstanceDefaultPolicy
سیاست. - خلاصه را مرور کنید و انتخاب کنید راه اندازی نمونه.
Create users and groups in AD using the EC2 Windows instance
با دسک تاپ از راه دور, connect to the EC2 Windows instance you created in the previous step. Using an RDP client is recommended over using a browser-based Remote Desktop so that you can exchange the contents of the clipboard with your local machine using copy-paste operations. For more details about connecting to EC2 Windows instances, refer to Connect to your Windows instance.
If you are prompted for a login credential, use hyperpodAdmin
(جایی که hyperpod
is the first part of your directory DNS name) as the user name, and use the admin password you set to the directory service.
- When the Windows desktop screen opens, choose مدیریت سرور از آغاز منو.
- را انتخاب کنید سرور محلی in the navigation pane, and confirm that the domain is what you specified to the directory service.
- بر مدیریت منو ، انتخاب کنید اضافه کردن نقش ها و ویژگی ها.
- را انتخاب کنید بعدی until you are at the امکانات احتمال برد مراجعه کنید.
- Expand the feature ابزار مدیریت سرور از راه دور، بسط دادن ابزارهای مدیریت نقش، و انتخاب کنید ابزارهای AD DS و AD LDS و سرویس مدیریت حقوق Active Directory.
- را انتخاب کنید بعدی و نصب.Feature installation starts.
- When the installation is complete, choose نزدیک.
- باز کن کاربران و رایانه های Active Directory از آغاز منو.
- تحت
hyperpod.abc123.com
، بسط دادنhyperpod
. - انتخاب کنید (راست کلیک کنید)
hyperpod
، انتخاب کنید جدید، و انتخاب کنید واحد سازمانی. - Create an organizational unit called
Groups
. - انتخاب کنید (راست کلیک کنید) گروه ها، انتخاب کنید جدید، و انتخاب کنید گروه.
- Create a group called
ClusterAdmin
. - Create a second group called
ClusterDev
. - انتخاب کنید (راست کلیک کنید) کاربران، انتخاب کنید جدید، و انتخاب کنید کاربر.
- یک کاربر جدید ایجاد کنید.
- Choose (right-click) the user and choose Add to a group.
- Add your users to the groups
ClusterAdmin
orClusterDev
.Users added to theClusterAdmin
group will havesudo
privilege on the cluster.
Create a ReadOnly user in AD
Create a user called ReadOnly
زیر Users
. ReadOnly
user is used by the cluster to programmatically access users and groups in AD.
Take note of the password for later use.
(For SSH public key authentication) Add SSH public keys to users
By storing an SSH public key to a user in AD, you can log in without entering a password. You can use an existing key pair, or you can create a new key pair with OpenSSH’s ssh-keygen
command. For more information about generating a key pair, refer to Create a key pair for your Amazon EC2 instance.
- In کاربران و رایانه های Active Directory، در چشم انداز menu, enable ویژگی های پیشرفته.
- باز کردن پروژه های ما dialog of the user.
- بر ویرایشگر ویژگی برگه ، انتخاب کنید
altSecurityIdentities
را انتخاب کنید ویرایش. - برای Value to add، انتخاب کنید اضافه کردن.
- برای ارزشها, add an SSH public key.
- را انتخاب کنید OK.Confirm that the SSH public key appears as an attribute.
Get an obfuscated password for the ReadOnly user
To avoid including a plain text password in the SSSD configuration file, you obfuscate the password. For this step, you need a Linux environment (local laptop, EC2 Linux instance, or CloudShell).
نصب sssd-tools
package on the Linux machine to install the Python module pysss
for obfuscation:
Run the following one-line Python script. Input the password of the ReadOnly
user. You will get the obfuscated password.
Create a HyperPod cluster with an SSSD-enabled lifecycle script
Next, you create a HyperPod cluster with LDAPS/Active Directory integration.
- Find the configuration file
config.py
in your lifecycle script directory, open it with your text editor, and edit the properties in theConfig
کلاس وSssdConfig
کلاس:- تنظیم
True
برایenable_sssd
to enable setting up SSSD. - La
SssdConfig
class contains configuration parameters for SSSD. - Make sure you use the obfuscated password for the
ldap_default_authtok
property, not a plain text password.
- تنظیم
- Copy the certificate file
ldaps.crt
to the same directory (whereconfig.py
وجود دارد). - Upload the modified lifecycle script files to your سرویس ذخیره سازی ساده آمازون (Amazon S3) bucket, and create a HyperPod cluster with it.
- صبر کنید تا وضعیت تغییر کند در خدمت.
تایید
Let’s verify the solution by logging in to the cluster with SSH. Because the cluster was created in a private subnet, you can’t directly SSH into the cluster from your local environment. You can choose from two options to connect to the cluster.
Option 1: SSH login through AWS Systems Manager
شما می توانید مدیر سیستم های AWS as a proxy for the SSH connection. Add a host entry to the SSH configuration file ~/.ssh/config
using the following example. For the HostName
field, specify the Systems Manger target name in the format of sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]
. برای IdentityFile
field, specify the file path to the user’s SSH private key. This field is not required if you chose password authentication.
اجرا کن ssh
command using the host name you specified. Confirm you can log in to the instance with the specified user.
At this point, users can still use the Systems Manager default shell session to log in to the cluster as ssm-user
with administrative privileges. To block the default Systems Manager shell access and enforce SSH access, you can configure your IAM policy by referring to the following example:
For more details on how to enforce SSH access, refer to Start a session with a document by specifying the session documents in IAM policies.
Option 2: SSH login through bastion host
Another option to access the cluster is to use a میزبان سنگر as a proxy. You can use this option when the user doesn’t have permission to use Systems Manager sessions, or to troubleshoot when Systems Manager is not working.
- Create a bastion security group that allows inbound SSH access (TCP port 22) from your local environment.
- Update the security group for the cluster to allow inbound SSH access from the bastion security group.
- Create an EC2 Linux instance.
- برای تصویر ماشین آمازون، انتخاب کنید سرور اوبونتو 20.04 LTS.
- برای نوع نمونه، انتخاب کنید t3.small.
- در تنظیمات شبکه بخش، پارامترهای زیر را ارائه دهید:
- برای VPC، انتخاب کنید SageMaker HyperPod VPC (which you created with the CloudFormation template).
- برای زیرشبکه, choose the public subnet you created with the CloudFormation template.
- برای Common security groups, choose the bastion security group you created.
- برای ذخیره سازی را پیکربندی کنید, set storage to 8 GB.
- Identify the public IP address of the bastion host and the private IP address of the target instance (for example, the login node of the cluster), and add two host entries in the SSH config, by referring to the following example:
- اجرا کن
ssh
command using the target host name you specified earlier, and confirm you can log in to the instance with the specified user:
پاک کردن
Clean up the resources in the following order:
- Delete the HyperPod cluster.
- Delete the Network Load Balancer.
- Delete the load balancing target group.
- Delete the certificate imported to Certificate Manager.
- Delete the EC2 Windows instance.
- Delete the EC2 Linux instance for the bastion host.
- Delete the AWS Managed Microsoft AD.
- Delete the CloudFormation stack for the VPC, subnets, security group, and FSx for Lustre volume.
نتیجه
This post provided steps to create a HyperPod cluster integrated with Active Directory. This solution removes the hassle of user maintenance on large-scale clusters and allows you to manage users and groups centrally in one place.
For more information about HyperPod, check out the HyperPod workshop و SageMaker HyperPod Developer Guide. Leave your feedback on this solution in the comments section.
درباره نویسنده
Tomonori Shimomura is a Senior Solutions Architect on the Amazon SageMaker team, where he provides in-depth technical consultation to SageMaker customers and suggests product improvements to the product team. Before joining Amazon, he worked on the design and development of embedded software for video game consoles, and now he leverages his in-depth skills in Cloud side technology. In his free time, he enjoys playing video games, reading books, and writing software.
جوزپه آنجلو پورچلی یک معمار اصلی راه حل های متخصص یادگیری ماشین برای خدمات وب آمازون است. او با چندین سال مهندسی نرم افزار و پیشینه ML، با مشتریان در هر اندازه ای کار می کند تا نیازهای تجاری و فنی آنها را درک کند و راه حل های هوش مصنوعی و ML را طراحی کند که بهترین استفاده را از AWS Cloud و پشته یادگیری ماشین آمازون می کند. او روی پروژههایی در حوزههای مختلف، از جمله MLOps، بینایی کامپیوتر، و NLP، که شامل مجموعه گستردهای از خدمات AWS است، کار کرده است. جوزپه در اوقات فراغت خود از بازی فوتبال لذت می برد.
Monidipa Chakraborty currently serves as a Senior Software Development Engineer at Amazon Web Services (AWS), specifically within the SageMaker HyperPod team. She is committed to assisting customers by designing and implementing robust and scalable systems that demonstrate operational excellence. Bringing nearly a decade of software development experience, Monidipa has contributed to various sectors within Amazon, including Video, Retail, Amazon Go, and AWS SageMaker.
Satish Pasumarthi is a Software Developer at Amazon Web Services. With several years of software engineering and an ML background, he loves to bridge the gap between the ML and systems and is passionate to build systems that make large scale model training possible. He has worked on projects in a variety of domains, including Machine Learning frameworks, model benchmarking, building hyperpod beta involving a broad set of AWS services. In his free time, Satish enjoys playing badminton.
- محتوای مبتنی بر SEO و توزیع روابط عمومی. امروز تقویت شوید.
- PlatoData.Network Vertical Generative Ai. به خودت قدرت بده دسترسی به اینجا.
- PlatoAiStream. هوش وب 3 دانش تقویت شده دسترسی به اینجا.
- PlatoESG. کربن ، CleanTech، انرژی، محیط، خورشیدی، مدیریت پسماند دسترسی به اینجا.
- PlatoHealth. هوش بیوتکنولوژی و آزمایشات بالینی. دسترسی به اینجا.
- منبع: https://aws.amazon.com/blogs/machine-learning/integrate-hyperpod-clusters-with-active-directory-for-seamless-multi-user-login/