Amazon SageMaker HyperPod is purpose-built to speed up basis mannequin (FM) coaching, eradicating the undifferentiated heavy lifting concerned in managing and optimizing a big coaching compute cluster. With SageMaker HyperPod, you possibly can practice FMs for weeks and months with out disruption.
Usually, HyperPod clusters are utilized by a number of customers: machine studying (ML) researchers, software program engineers, knowledge scientists, and cluster directors. They edit their very own recordsdata, run their very own jobs, and wish to keep away from impacting one another’s work. To realize this multi-user atmosphere, you possibly can make the most of Linux’s person and group mechanism and statically create a number of customers on every occasion by lifecycle scripts. The downside to this strategy, nevertheless, is that person and group settings are duplicated throughout a number of cases within the cluster, making it troublesome to configure them constantly on all cases, comparable to when a brand new staff member joins.
To resolve this ache level, we are able to use Light-weight Listing Entry Protocol (LDAP) and LDAP over TLS/SSL (LDAPS) to combine with a listing service comparable to AWS Listing Service for Microsoft Lively Listing. With the listing service, you possibly can centrally keep customers and teams, and their permissions.
On this publish, we introduce an answer to combine HyperPod clusters with AWS Managed Microsoft AD, and clarify the way to obtain a seamless multi-user login atmosphere with a centrally maintained listing.
Answer overview
The answer makes use of the next AWS companies and assets:
We additionally use AWS CloudFormation to deploy a stack to create the stipulations for the HyperPod cluster: VPC, subnets, safety group, and Amazon FSx for Lustre quantity.
The next diagram illustrates the high-level resolution structure.
On this resolution, HyperPod cluster cases use the LDAPS protocol to connect with the AWS Managed Microsoft AD through an NLB. We use TLS termination by putting in a certificates to the NLB. To configure LDAPS in HyperPod cluster cases, the lifecycle script installs and configures System Safety Companies Daemon (SSSD)—an open supply shopper software program for LDAP/LDAPS.
Conditions
This publish assumes you already know the way to create a primary HyperPod cluster with out SSSD. For extra particulars on the way to create HyperPod clusters, confer with Getting began with SageMaker HyperPod and the HyperPod workshop.
Additionally, within the setup steps, you’ll use a Linux machine to generate a self-signed certificates and acquire an obfuscated password for the AD reader person. In the event you don’t have a Linux machine, you possibly can create an EC2 Linux occasion or use AWS CloudShell.
Create a VPC, subnets, and a safety group
Comply with the directions within the Personal Account part of the HyperPod workshop. You’ll deploy a CloudFormation stack and create prerequisite assets comparable to VPC, subnets, safety group, and FSx for Lustre quantity. That you must create each a main subnet and backup subnet when deploying the CloudFormation stack, as a result of AWS Managed Microsoft AD requires at the very least two subnets with totally different Availability Zones.
On this publish, for simplicity, we use the identical VPC, subnets, and safety group for each the HyperPod cluster and listing service. If it is advisable to use totally different networks between the cluster and listing service, be certain safety teams and route tables are configured in order that they will talk one another.
Create AWS Managed Microsoft AD on Listing Service
Full the next steps to arrange your listing:
- On the Listing Service console, select Directories within the navigation pane.
- Select Arrange listing.
- For Listing kind, choose AWS Managed Microsoft AD.
- Select Subsequent.
- For Version, choose Commonplace Version.
- For Listing DNS identify, enter your most popular listing DNS identify (for instance,
hyperpod.abc123.com
). - For Admin password¸ set a password and reserve it for later use.
- Select Subsequent.
- Within the Networking part, specify the VPC and two personal subnets you created.
- Select Subsequent.
- Evaluate the configuration and pricing, then select Create listing.
The listing creation begins. Wait till the standing modifications from Creating to Lively, which may take 20–half-hour. - When the standing modifications to Lively, open the element web page of the listing and be aware of the DNS addresses for later use.
Create an NLB in entrance of Listing Service
To create the NLB, full the next steps:
- On the Amazon EC2 console, select Goal teams within the navigation pane.
- Select Create goal teams.
- Create a goal group with the next parameters:
- For Select a goal kind, choose IP addresses.
- For Goal group identify, enter
LDAP
. - For Protocol: Port, select TCP and enter
389
. - For IP handle kind, choose IPv4.
- For VPC, select SageMaker HyperPod VPC (which you created with the CloudFormation template).
- For Well being test protocol, select TCP.
- Select Subsequent.
- Within the Register targets part, register the listing service’s DNS addresses because the targets.
- For Ports, select Embrace as pending under.
The addresses are added within the Evaluate targets part with Pending standing.
- Select Create goal group.
- On the Load Balancers console, select Create load balancer.
- Beneath Community Load Balancer, select Create.
- Configure an NLB with the next parameters:
- For Load balancer identify, enter a reputation (for instance,
nlb-ds
). - For Scheme, choose Inner.
- For IP handle kind, choose IPv4.
- For VPC, select SageMaker HyperPod VPC (which you created with the CloudFormation template).
- Beneath Mappings, choose the 2 personal subnets and their CIDR ranges (which you created with the CloudFormation template).
- For Safety teams, select
CfStackName-SecurityGroup-XYZXYZ
(which you created with the CloudFormation template).
- For Load balancer identify, enter a reputation (for instance,
- Within the Listeners and routing part, specify the next parameters:
- For Protocol, select TCP.
- For Port, enter
389
. - For Default motion, select the goal group named LDAP.
Right here, we’re including a listener for LDAP. We are going to add LDAPS later.
- Select Create load balancer.
Wait till the standing modifications from Provisioning to Lively, which may take 3–5 minutes.
- When the standing modifications to Lively, open the element web page of the provisioned NLB and be aware of the DNS identify (
xyzxyz.elb.region-name.amazonaws.com
) for later use.
Create a self-signed certificates and import it to Certificates Supervisor
To create a self-signed certificates, full the next steps:
- In your Linux-based atmosphere (native laptop computer, EC2 Linux occasion, or CloudShell), run the next OpenSSL instructions to create a self-signed certificates and personal key:
- On the Certificates Supervisor console, select Import.
- Enter the certificates physique and personal key, from the contents of
ldaps.crt
andldaps.key
respectively. - Select Subsequent.
- Add any elective tags, then select Subsequent.
- Evaluate the configuration and select Import.
Add an LDAPS listener
We added a listener for LDAP already within the NLB. Now we add a listener for LDAPS with the imported certificates. Full the next steps:
- On the Load Balancers console, navigate to the NLB particulars web page.
- On the Listeners tab, select Add listener.
- Configure the listener with the next parameters:
- For Protocol, select TLS.
- For Port, enter
636
. - For Default motion, select LDAP.
- For Certificates supply, choose From ACM.
- For Certificates, enter what you imported in ACM.
- Select Add.
Now the NLB listens to each LDAP and LDAPS. It is strongly recommended to delete the LDAP listener as a result of it transmits knowledge with out encryption, in contrast to LDAPS.
Create an EC2 Home windows occasion to manage customers and teams within the AD
To create and keep customers and teams within the AD, full the next steps:
- On the Amazon EC2 console, select Situations within the navigation pane.
- Select Launch cases.
- For Title, enter a reputation on your occasion.
- For Amazon Machine Picture, select Microsoft Home windows Server 2022 Base.
- For Occasion kind, select t2.micro.
- Within the Community settings part, present the next parameters:
- For VPC, select SageMaker HyperPod VPC (which you created with the CloudFormation template).
- For Subnet, select both of two subnets you created with the CloudFormation template.
- For Frequent safety teams, select
CfStackName-SecurityGroup-XYZXYZ
(which you created with the CloudFormation template).
- For Configure storage, set storage to 30 GB gp2.
- Within the Superior particulars part, for Area be part of listing¸ select the AD you created.
- For IAM occasion profile, select an AWS Id and Entry Administration (IAM) position with at the very least the
AmazonSSMManagedEC2InstanceDefaultPolicy
coverage. - Evaluate the abstract and select Launch occasion.
Create customers and teams in AD utilizing the EC2 Home windows occasion
With Distant Desktop, connect with the EC2 Home windows occasion you created within the earlier step. Utilizing an RDP shopper is beneficial over utilizing a browser-based Distant Desktop so as to alternate the contents of the clipboard together with your native machine utilizing copy-paste operations. For extra particulars about connecting to EC2 Home windows cases, confer with Hook up with your Home windows occasion.
If you’re prompted for a login credential, use hyperpodAdmin
(the place hyperpod
is the primary a part of your listing DNS identify) because the person identify, and use the admin password you set to the listing service.
- When the Home windows desktop display opens, select Server Supervisor from the Begin menu.
- Select Native Server within the navigation pane, and ensure that the area is what you specified to the listing service.
- On the Handle menu, select Add Roles and Options.
- Select Subsequent till you might be on the Options web page.
- Develop the function Distant Server Administration Instruments, broaden Position Administration Instruments, and choose AD DS and AD LDS Instruments and Lively Listing Rights Administration Service.
- Select Subsequent and Set up.
Characteristic set up begins.
- When the set up is full, select Shut.
- Open Lively Listing Customers and Computer systems from the Begin menu.
- Beneath
hyperpod.abc123.com
, broadenhyperpod
. - Select (right-click)
hyperpod
, select New, and select Organizational Unit. - Create an organizational unit known as
Teams
. - Select (right-click) Teams, select New, and select Group.
- Create a gaggle known as
ClusterAdmin
. - Create a second group known as
ClusterDev
. - Select (right-click) Customers, select New, and select Consumer.
- Create a brand new person.
- Select (right-click) the person and select Add to a gaggle.
- Add your customers to the teams
ClusterAdmin
orClusterDev
.Customers added to the
ClusterAdmin
group may havesudo
privilege on the cluster.
Create a ReadOnly person in AD
Create a person known as ReadOnly
below Customers
. The ReadOnly
person is utilized by the cluster to programmatically entry customers and teams in AD.
Pay attention to the password for later use.
(For SSH public key authentication) Add SSH public keys to customers
By storing an SSH public key to a person in AD, you possibly can log in with out getting into a password. You should utilize an present key pair, or you possibly can create a brand new key pair with OpenSSH’s ssh-keygen
command. For extra details about producing a key pair, confer with Create a key pair on your Amazon EC2 occasion.
- In Lively Listing Customers and Computer systems, on the View menu, allow Superior Options.
- Open the Properties dialog of the person.
- On the Attribute Editor tab, select
altSecurityIdentities
select Edit. - For Worth so as to add, select Add.
- For Values, add an SSH public key.
- Select OK.
Verify that the SSH public key seems as an attribute.
Get an obfuscated password for the ReadOnly person
To keep away from together with a plain textual content password within the SSSD configuration file, you obfuscate the password. For this step, you want a Linux atmosphere (native laptop computer, EC2 Linux occasion, or CloudShell).
Set up the sssd-tools
bundle on the Linux machine to put in the Python module pysss
for obfuscation:
Run the next one-line Python script. Enter the password of the ReadOnly
person. You’re going to get the obfuscated password.
Create a HyperPod cluster with an SSSD-enabled lifecycle script
Subsequent, you create a HyperPod cluster with LDAPS/Lively Listing integration.
- Discover the configuration file
config.py
in your lifecycle script listing, open it together with your textual content editor, and edit the properties within theConfig
class andSssdConfig
class:- Set
True
forenable_sssd
to allow organising SSSD. - The
SssdConfig
class comprises configuration parameters for SSSD. - Be sure you use the obfuscated password for the
ldap_default_authtok
property, not a plain textual content password.
- Set
- Copy the certificates file
ldaps.crt
to the identical listing (the placeconfig.py
exists). - Add the modified lifecycle script recordsdata to your Amazon Easy Storage Service (Amazon S3) bucket, and create a HyperPod cluster with it.
- Wait till the standing modifications to InService.
Verification
Let’s confirm the answer by logging in to the cluster with SSH. As a result of the cluster was created in a non-public subnet, you possibly can’t immediately SSH into the cluster out of your native atmosphere. You may select from two choices to connect with the cluster.
Possibility 1: SSH login by AWS Methods Supervisor
You should utilize AWS Methods Supervisor as a proxy for the SSH connection. Add a number entry to the SSH configuration file ~/.ssh/config
utilizing the next instance. For the HostName
area, specify the Methods Manger goal identify within the format of sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]
. For the IdentityFile
area, specify the file path to the person’s SSH personal key. This area just isn’t required in case you selected password authentication.
Run the ssh
command utilizing the host identify you specified. Verify you possibly can log in to the occasion with the required person.
At this level, customers can nonetheless use the Methods Supervisor default shell session to log in to the cluster as ssm-user
with administrative privileges. To dam the default Methods Supervisor shell entry and implement SSH entry, you possibly can configure your IAM coverage by referring to the next instance:
For extra particulars on the way to implement SSH entry, confer with Begin a session with a doc by specifying the session paperwork in IAM insurance policies.
Possibility 2: SSH login by bastion host
One other choice to entry the cluster is to make use of a bastion host as a proxy. You should utilize this feature when the person doesn’t have permission to make use of Methods Supervisor periods, or to troubleshoot when Methods Supervisor just isn’t working.
- Create a bastion safety group that permits inbound SSH entry (TCP port 22) out of your native atmosphere.
- Replace the safety group for the cluster to permit inbound SSH entry from the bastion safety group.
- Create an EC2 Linux occasion.
- For Amazon Machine Picture, select Ubuntu Server 20.04 LTS.
- For Occasion kind, select t3.small.
- Within the Community settings part, present the next parameters:
- For VPC, select SageMaker HyperPod VPC (which you created with the CloudFormation template).
- For Subnet, select the general public subnet you created with the CloudFormation template.
- For Frequent safety teams, select the bastion safety group you created.
- For Configure storage, set storage to eight GB.
- Establish the general public IP handle of the bastion host and the personal IP handle of the goal occasion (for instance, the login node of the cluster), and add two host entries within the SSH config, by referring to the next instance:
- Run the
ssh
command utilizing the goal host identify you specified earlier, and ensure you possibly can log in to the occasion with the required person:
Clear up
Clear up the assets within the following order:
- Delete the HyperPod cluster.
- Delete the Community Load Balancer.
- Delete the load balancing goal group.
- Delete the certificates imported to Certificates Supervisor.
- Delete the EC2 Home windows occasion.
- Delete the EC2 Linux occasion for the bastion host.
- Delete the AWS Managed Microsoft AD.
- Delete the CloudFormation stack for the VPC, subnets, safety group, and FSx for Lustre quantity.
Conclusion
This publish offered steps to create a HyperPod cluster built-in with Lively Listing. This resolution removes the effort of person upkeep on large-scale clusters and permits you to handle customers and teams centrally in a single place.
For extra details about HyperPod, try the HyperPod workshop and the SageMaker HyperPod Developer Information. Go away your suggestions on this resolution within the feedback part.
In regards to the Authors
Tomonori Shimomura is a Senior Options Architect on the Amazon SageMaker staff, the place he gives in-depth technical session to SageMaker prospects and suggests product enhancements to the product staff. Earlier than becoming a member of Amazon, he labored on the design and growth of embedded software program for online game consoles, and now he leverages his in-depth expertise in Cloud aspect expertise. In his free time, he enjoys enjoying video video games, studying books, and writing software program.
Giuseppe Angelo Porcelli is a Principal Machine Studying Specialist Options Architect for Amazon Net Companies. With a number of years software program engineering and an ML background, he works with prospects of any dimension to grasp their enterprise and technical wants and design AI and ML options that make the very best use of the AWS Cloud and the Amazon Machine Studying stack. He has labored on initiatives in several domains, together with MLOps, laptop imaginative and prescient, and NLP, involving a broad set of AWS companies. In his free time, Giuseppe enjoys enjoying soccer.
Monidipa Chakraborty at the moment serves as a Senior Software program Growth Engineer at Amazon Net Companies (AWS), particularly inside the SageMaker HyperPod staff. She is dedicated to helping prospects by designing and implementing sturdy and scalable programs that exhibit operational excellence. Bringing almost a decade of software program growth expertise, Monidipa has contributed to varied sectors inside Amazon, together with Video, Retail, Amazon Go, and AWS SageMaker.
Satish Pasumarthi is a Software program Developer at Amazon Net Companies. With a number of years of software program engineering and an ML background, he likes to bridge the hole between the ML and programs and is passionate to construct programs that make giant scale mannequin coaching attainable. He has labored on initiatives in a wide range of domains, together with Machine Studying frameworks, mannequin benchmarking, constructing hyperpod beta involving a broad set of AWS companies. In his free time, Satish enjoys enjoying badminton.