Deploying on AWS EKSπ
This HOWTO outlines the setup of a Compute Gateway on AWS for operators in charge of setting up a Compute Gateway. The described setup is suited for Compute Gateways deployed on an AWS EKS cluster using Terraform.
Please make sure to read through the Prerequisites before beginning a setup.
For existing AWS accounts:
- Deployment of the Compute Gateway into your AWS account should take less than 1 hour
- You may choose any AWS region for your deployment
- We strongly recommend to NOT USE your root user to deploy the Compute Gateway and instead leverage an IAM user on another principle
- For details about support options, additional services, technical support tiers and SLAs of Apheris, please donβt hesitate to contact us via info@apheris.com
Prerequisitesπ
Software dependenciesπ
You need AWS CLI to access the AWS APIs and specifically to connect to your cluster with kubectl
.
To install it, follow the official guide and verify the installation afterwards with:
aws --version
To set up a Compute Gateway for AWS EKS using the Apheris provided Terraform module, you will need Terraform in the first place. Follow the official installation guide and verify the installation with:
terraform version
Access to the EKS clusterπ
In order to provision/modify the Kubernetes objects in the cluster, you need to be able to reach the Kubernetes API and have the proper rights in the cluster.
You can run the following command to check your identity:
aws sts get-caller-identity
Returned Account should match your target account and the ARN's role should have the right set of permissions to grant you access to the cluster.
Deliverablesπ
Your Apheris representative provides you with the following files:
gateway-<version>.zip
- zipped version of the Gateway module
- A Bitwarden Send link containing sensitive Helm Chart values
Apheris shares the password to the Bitwarden Send link via a separate channel for security reasons.
The Bitwarden Send link has an expiration time and is accessible exactly once.
The Gateways-specific secrets shared via Bitwarden Send link come in YAML format file ready to be
used as the values.yaml
for the Gateway Helm chart:
tenant: "tenantID"
auth:
domain: auth.app.apheris.net
orchestrator:
clientId: "clientId"
clientSecret: "clientSecret"
helmRepoUsername: "helmRepoUsername"
helmRepoPassword: "helmRepoPassword"
Roles and users for an Apheris-configured Compute Gatewayπ
When an AWS EKS cluster is created, the IAM principal that creates the cluster is automatically granted system:masters
permissions in the cluster's role-based access control configuration in the Amazon EKS control plane. To grant additional IAM principals the ability to interact with your cluster, the aws-auth
ConfigMap must be edited. Please refer to the official documentation for more details.
A list of users and/or roles to add to aws-auth
configmap should be shared with Apheris before the installation to grant a set of permissions to run the Terraform commands.
The following example shows how a role will be added to the aws-auth
configmap from the Gateway module variables:
aws_auth_roles = [
{
rolearn = "arn:aws:iam::ACCOUNT_ID:role/ROLE"
username = "cluster-admin"
groups = ["system:masters"]
},
]
The following example shows how a user will be added to the aws-auth
configmap:
aws_auth_users = [
{
userarn = "arn:aws:iam::ACCOUNT_ID:user/USER"
username = "USER"
groups = ["system:masters"]
},
]
If not explicitly specified, the aws_auth_roles
and aws_auth_users
will have admin access. For more details please refer to the Inputs section of the Gateway module's README.md
file (included in the gateway-<version>.zip
).
Scalingπ
During operations many computations may be scheduled on the cluster. We currently have a 1-1 relationship between Kubernetes pod and computation. Computation pods can be identified by the presence of the apheris_job_id
label. The cluster will autoscale to accommodate inbound computations. However the default Karpenter settings may not be optimal in all cases. Ideally available Kubernetes node memory should be a multiple of the most common computation request.
To query currently running computations requests, run kubectl get po -n apheris -l apheris_job_id -o jsonpath='{.items[*].spec.containers[*].resources}'
. The ideal scenario is a near 100% memory utilization rate to minimize costs.
Install / Update procedureπ
Please follow the described procedure step by step and do not hesitate to contact your Apheris representative if you have questions or concerns.
Adding the Gateway moduleπ
Create a folder for your deployment, for example, apheris
, and navigate there:
mkdir -p apheris && cd apheris
Copy gateway-<version>.zip
to your deployment folder and extract it to the gateway
sub-folder. You can use unzip
utility to do that from your shell. For example for version 0.8.1
:
unzip gateway-0.8.1.zip -d gateway
If you list the gateway folder with:
ls gateway/
You should see a set of *.tf
files and modules
folder, for example, your output could look similar to the following:
CHANGELOG.md main.tf modules outputs.tf providers.tf README.md variables.tf versions.tf
Adding necessary Terraform filesπ
(Unless you have received a customized gateway.tf
file) Create a new terraform file like gateway.tf
.
The required minimum looks like:
module "gateway" {
source = "./gateway"
name = "gateway"
helm_repository_username = "helmRepoUsername" # value from the Bitwarden Send link
helm_repository_password = "helmRepoPassword" # value from the Bitwarden Send link
helm_chart_values = [file("values.yaml")]
access_entries = []
}
Additionally, create a provider.tf
to set the providers that will be used by Terraform. It would look like:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
helm = {
source = "hashicorp/helm"
version = "~> 2.0"
}
kubectl = {
source = "gavinbunney/kubectl"
version = "~> 1.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.0"
}
}
required_version = ">= 1.6.0"
}
provider "aws" {
region = "eu-central-1" # Replace with your AWS region
}
data "aws_eks_cluster_auth" "default" {
name = module.gateway.cluster_name
}
provider "helm" {
kubernetes {
host = module.gateway.endpoint
cluster_ca_certificate = module.gateway.certificate_authority_data
token = data.aws_eks_cluster_auth.default.token
}
}
provider "kubectl" {
host = module.gateway.endpoint
cluster_ca_certificate = module.gateway.certificate_authority_data
token = data.aws_eks_cluster_auth.default.token
load_config_file = false
}
Copy the values.yaml
file included in the Bitwarden link described in the Deliverables section of this documentation to the deployment folder.
Finally, configure your terraform_backend.tf
file to have terraform state stored in your preferred backend. As a reference, the file contains might look like:
terraform {
backend "s3" {
bucket = "terraform-s3-bucket"
dynamodb_table = "terraform-dynamodb-table"
encrypt = true
key = "terraform.tfstate"
region = "eu-central-1"
}
}
At this point, if you list the directory with:
ls
You should see the following files and the deployment folder:
gateway.tf gateway providers.tf terraform_backend.tf values.yaml
Optional: Enabling access to external AWS S3 bucketsπ
To allow access to an S3 bucket that is not managed by the gateway Terraform module you will need to grant the Kubernetes service account for the Apheris Data Access Layer (DAL) access to the bucket.
The gateway module makes use of IAM Roles for service accounts,
you can use the dal_role_arn
output on the gateway Terraform module to get the ARN of the dedicated IAM role for the DAL service account.
Use that IAM Role as principal in IAM policy documents for the respective bucket policy.
An example bucket policy looks like:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::ACCOUNT_ID:role/EXAMPLE_DAL_IAM_ROLE_NAME"
]
},
"Action": [
"s3:ListBucket",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::EXAMPLE_BUCKET_NAME",
"arn:aws:s3:::EXAMPLE_BUCKET_NAME/*"
]
}
]
}
You need to add the bucket name to the additional_data_buckets
list of the gateway terraform module to set up
network policies and add it to the values passed to the helm chart.
For example:
module "gateway" {
...
additional_data_buckets = ["EXAMPLE_BUCKET_NAME"]
helm_chart_values = [
...
yamlencode({
dal : {
sources : {
s3 : ["EXAMPLE_BUCKET_NAME"]
}
}
}),
...
]
...
}
Optional: Enable DAL storageπ
The DAL can store intermediate data written by computations. This data always stays on the gateway.
You can enable this feature by setting enable_dal_persistence = true
for the gateway
Terraform module.
Optional: Configure custom node poolsπ
You can configure Karpenter node pools that are more adapted to your particular compute needs than the default
ones via the gateway Terraform module karpenter_node_pools
variable.
Apheris computations that use GPU add a toleration for the nvidia.com/gpu
taint, computations that do not
use GPU lack that toleration.
Make sure to include the taint
{
key = "nvidia.com/gpu"
effect = "NoSchedule"
}
The Apheris Compute Gateway agent supports a basic way of mapping computations to nodes with a specific taint, based on the number of requested GPUs of the compute spec.
For example, to enable computations that use 4 or 8 GPUs to get scheduled with a taint (example.com/expensive
) use:
module "gateway" {
...
karpenter_node_pools = {
expensive-pool = {
instance_families = [...]
capacity_type = ["on-demand"]
taints = [
{
key = "example.com/expensive"
effect = "NoSchedule"
}
]
}
}
...
...
helm_chart_values = [
yamlencode(
job = {
nodePoolMapping = {
4 = "example.com/expensive"
8 = "example.com/expensive"
}
}
)
]
...
}
Please note that the mapping only adds a toleration for the specified taints. More involved node pool configurations require adding specific taints to all node pools in order to prevent scheduling of unwanted computations there.
Optional: Enable Asset Policy Signature Validationπ
Please refer to the guide
Running terraform initπ
The terraform init
command is required at least once before applying. It's safe to run the terraform apply
command directly though (see Running terraform apply section); Terraform will complain by outputting a message like Run "terraform init"
, should it be required.
As long as the following falls true:
- the AWS CLI is configured to have access to the cluster
- the
gateway
module is added
- all the
*.tf
and thevalues.yaml
files are present
the init
command can be run:
terraform init
After the completion, a message similar to the following should be displayed
Terraform has been successfully initialized!
You may now begin working with Terraform. Try running the plan
command to see any changes that are required for your infrastructure. All Terraform commands should now work.
If you ever set or change modules or backend configuration for Terraform, rerun this command to reinitialize your working directory. If you forget, other commands will detect it and remind you to do so if necessary.
Running terraform applyπ
After the successful execution of the init
command, the apply
command can be executed:
terraform apply
The following message should show up on your screen:
Plan: X to add, Y to change, Z to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value:
You can scroll up and take a look at the planned changes. Please contact your Apheris representative via your dedicated support channel if you have doubts, otherwise, type yes
to approve.
After the completion, a message similar to the following should be displayed:
Apply complete! Resources: X added, Y changed, Z destroyed.
Post-install / Post-upgradeπ
After the apply
command has been completed you may want to verify the following:
- Gateway agent version
- Network connectivity
Verifying the Gateway agent versionπ
To verify that the running agent version matches the expected one you should run:
kubectl get deployment -l apheris-component=agent -n apheris -o=jsonpath="{.items[*].spec.template.spec.containers[0].image}" | sed 's/.*:/agent:\t/'
Please confirm that the returned version matches the version that Apheris has communicated while sharing the deliverables.
If for some reason you get an error while executing the previous command please try the following step-by-step guidance
-
Identify the name of the deployment by running:
kubectl get deployment -l apheris-component=agent -n apheris
You should see only one deployment listed, please copy its name, the name will likely be
apheris-gateway-agent
-
Check the image tag by running:
kubectl get deployment/YOUR_DEPLOYMENT_NAME -n apheris -o yaml | grep -o 'image:.*'| sed 's/.*:/agent:\t/'
Please make sure that you replace
YOUR_DEPLOYMENT_NAME
with the correct name (e.g.,apheris-gateway-agent
).
Verifying the network connectivityπ
During setup and upgrades, the Gateway accesses endpoints to download additional components.
During normal runtime, the Gateway communicates with a number of Apheris-managed services, Auth0 for user authentication, and (typically) an external Docker registry to pull the Apheris computation images.
The exact list of endpoints that must be accessible from the Gateway is dependent on your platform setup.
The Gateway agent Helm chart comes with a set of connectivity tests that check if the necessary endpoints are reachable. The outcome will display each test suite and its result. The tests can be run with the following command:
helm test apheris-gateway -n apheris`
Logs can be inspected using:
kubectl logs apheris-gateway-connectivity-test -n apheris`
Please donβt hesitate to contact your Apheris representative if you need more details and to receive the exact endpoints that are required to be accessible for your platform setup.
Wrapping upπ
Congratulations! The install / upgrade has been completed!
Please contact your Apheris representative on your dedicated support channel or by mailing to support@apheris.com to confirm the completion and request to run the smoke tests.
Uninstallπ
Perform a terraform destroy
on the root module of your gateway and wait for Terraform to remove all related resources.
FAQsπ
How do I solve Error: cannot re-use a name that is still in use
?π
In some cases helm releases are already present on the cluster but not recorded in the terraform state.
This will result in error messages that look similar to:
β Error: cannot re-use a name that is still in use
β
β with module.gateway.module.gateway_agent[0].helm_release.gateway_agent[0],
β on .terraform/modules/gateway.gateway_agent/gateway/modules/gateway_agent/gateway_agent.tf line 17, in resource "helm_release" "gateway_agent":
β 17: resource "helm_release" "gateway_agent" {
Please make sure that the helm release is not managed by another system first.
To resolve the error:
1. uninstall the respective helm release (e.g. helm uninstall apheris-gateway -n apheris
for the above case)
2. run terraform apply
again
Can I manage the AWS VPC setup?π
The Apheris gateway
Terraform module will create the VPC resources by default. However, it can be overridden to disable the creation of these resources. To do so, the module needs to be provided with information about which VPC and subnet will be used for deploying the EKS cluster.
Below is an example of how the module receives these inputs:
create_vpc = false
vpc_id = "vpc-045da9dc92fb2a887"
node_subnet_ids = ["subnet-0640da74d7ffb177f", "subnet-0d0e19e8e7ee019a0", "subnet-02e5590fa6f9712c9"]
If you've set up your VPC using the gateway
module, and you've made manual changes to your VPC resources that you want to preserve without being overwritten during the next upgrade, you would need to remove VPC Resources from Terraform state: You can use the following command to achieve this:
terraform state list | grep 'module.vpc' | xargs -I {} terraform state rm '{}'
By doing this, Terraform won't attempt to modify or destroy the resource during future upgrades, safeguarding your customizations.
Note
When managing your VPC resources, please consider creating subnets with a sufficient number of IP addresses since each Kubernetes pod will allocate an IP. The reference setup from our Terraform module will create three subnets, each with 254 addresses.