Deploying on AWS EKS🔗

This how-to outlines the setup of a Compute Gateway on AWS for operators in charge of setting up a Compute Gateway. The described setup is suited for Compute Gateways deployed on an AWS EKS cluster using Terraform.

Please make sure to read through the Prerequisites before beginning a setup.

For existing AWS accounts:

Deployment of the Compute Gateway into your AWS account should take less than 1 hour
You may choose any AWS region for your deployment
We strongly recommend to NOT USE your root user to deploy the Compute Gateway and instead leverage an IAM user on another principle
For details about support options, additional services, technical support tiers and SLAs of Apheris, please don’t hesitate to contact us via info@apheris.com

Prerequisites🔗

Software dependencies🔗

You need AWS CLI to access the AWS APIs and specifically to connect to your cluster with kubectl. To install it, follow the official guide and verify the installation afterward with:

aws --version

To set up a Compute Gateway for AWS EKS using the Apheris provided Terraform module, you will need Terraform in the first place. Follow the official installation guide and verify the installation with:

terraform version

Access to the EKS cluster🔗

In order to provision/modify the Kubernetes objects in the cluster, you need to be able to reach the Kubernetes API and have the proper rights in the cluster.

You can run the following command to check your identity:

aws sts get-caller-identity

Returned Account should match your target account and the ARN's role should have the right set of permissions to grant you access to the cluster.

Deliverables🔗

Your Apheris representative provides you with the following files:

gateway-<version>.zip - zipped version of the Gateway module

A Bitwarden Send link containing sensitive Helm Chart values

Apheris shares the password to the Bitwarden Send link via a separate channel for security reasons.

The Bitwarden Send link has an expiration time and is accessible exactly once.

The Gateway specific configuration shared via Bitwarden Send link come in YAML format file ready to be used as the values.yaml for the Gateway Helm chart:

tenant: "tenantID"
auth:
  domain: auth.app.apheris.net
  orchestrator:
    clientId: "clientId"
    clientSecret: "clientSecret"
helmRepoUsername: "helmRepoUsername"
helmRepoPassword: "helmRepoPassword"

Roles and users for an Apheris-configured Compute Gateway🔗

When an AWS EKS cluster is created, the IAM principal that creates the cluster is automatically granted system:masters permissions in the cluster's role-based access control configuration in the Amazon EKS control plane. To grant additional IAM principals the ability to interact with your cluster, the aws-auth ConfigMap must be edited. Please refer to the official documentation for more details.

A list of users and/or roles to add to aws-auth configmap should be shared with Apheris before the installation to grant a set of permissions to run the Terraform commands.

The following example shows how a role will be added to the aws-auth configmap from the Gateway module variables:

aws_auth_roles = [
  {
    rolearn  = "arn:aws:iam::ACCOUNT_ID:role/ROLE"
    username = "cluster-admin"
    groups   = ["system:masters"]
  },
]

The following example shows how a user will be added to the aws-auth configmap:

aws_auth_users = [
  {
    userarn  = "arn:aws:iam::ACCOUNT_ID:user/USER"
    username = "USER"
    groups   = ["system:masters"]
  },
]

If not explicitly specified, the aws_auth_roles and aws_auth_users will have admin access. For more details please refer to the Inputs section of the Gateway module's README.md file (included in the gateway-<version>.zip).

Scaling🔗

During operations many computations may be scheduled on the cluster. We currently have a 1-1 relationship between Kubernetes pod and computation. Computation pods can be identified by the presence of the apheris_job_id label. The cluster will autoscale to accommodate inbound computations. However, the default Karpenter settings may not be optimal in all cases. Ideally available Kubernetes node memory should be a multiple of the most common computation request.

To query currently running computations requests, run kubectl get po -n apheris -l apheris_job_id -o jsonpath='{.items[*].spec.containers[*].resources}'. The ideal scenario is a near 100% memory utilization rate to minimize costs.

Install / Update procedure🔗

Please follow the described procedure step by step and do not hesitate to contact your Apheris representative if you have questions or concerns.

Adding the Gateway module🔗

Create a folder for your deployment, for example, apheris, and navigate there:

mkdir -p apheris && cd apheris

Copy gateway-<version>.zip to your deployment folder and extract it to the gateway sub-folder. You can use unzip utility to do that from your shell. For example for version 0.8.1:

unzip gateway-0.8.1.zip -d gateway

If you list the gateway folder with:

ls gateway/

You should see a set of *.tf files and modules folder, for example, your output could look similar to the following:

CHANGELOG.md  main.tf  modules  outputs.tf  providers.tf  README.md  variables.tf  versions.tf

Adding necessary Terraform files🔗

(Unless you have received a customized gateway.tf file) Create a new terraform file like gateway.tf. The required minimum looks like:

module "gateway" {
  source = "./gateway"

  name = "gateway"

  helm_repository_username = "helmRepoUsername" # value from the Bitwarden Send link
  helm_repository_password = "helmRepoPassword" # value from the Bitwarden Send link

  helm_chart_values = [file("values.yaml")]

  access_entries = []
}

Additionally, create a provider.tf to set the providers that will be used by Terraform. It would look like:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
    helm = {
      source  = "hashicorp/helm"
      version = "~> 2.0"
    }
    kubectl = {
      source  = "gavinbunney/kubectl"
      version = "~> 1.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.0"
    }
  }

  required_version = ">= 1.6.0"
}

provider "aws" {
  region = "eu-central-1" # Replace with your AWS region
}

data "aws_eks_cluster_auth" "default" {
  name = module.gateway.cluster_name
}

provider "helm" {
  kubernetes {
    host                   = module.gateway.endpoint
    cluster_ca_certificate = module.gateway.certificate_authority_data
    token                  = data.aws_eks_cluster_auth.default.token
  }
}

provider "kubectl" {
  host                   = module.gateway.endpoint
  cluster_ca_certificate = module.gateway.certificate_authority_data
  token                  = data.aws_eks_cluster_auth.default.token
  load_config_file       = false
}

Copy the values.yaml file included in the Bitwarden link described in the Deliverables section of this documentation to the deployment folder.

Finally, configure your terraform_backend.tf file to have terraform state stored in your preferred backend. As a reference, the file contains might look like:

terraform {
  backend "s3" {
    bucket         = "terraform-s3-bucket"
    dynamodb_table = "terraform-dynamodb-table"
    encrypt        = true
    key            = "terraform.tfstate"
    region         = "eu-central-1"
  }
}

At this point, if you list the directory with:

ls

You should see the following files and the deployment folder:

gateway.tf  gateway  providers.tf  terraform_backend.tf  values.yaml

Optional: Enabling access to external AWS S3 buckets🔗

To allow access to an S3 bucket that is not managed by the gateway Terraform module you will need to grant the Kubernetes service account for the Apheris Data Access Layer (DAL) access to the bucket.

The gateway module makes use of IAM Roles for service accounts, you can use the dal_role_arn output on the gateway Terraform module to get the ARN of the dedicated IAM role for the DAL service account.

Use that IAM Role as principal in IAM policy documents for the respective bucket policy.

An example bucket policy looks like:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::ACCOUNT_ID:role/EXAMPLE_DAL_IAM_ROLE_NAME"
        ]
      },
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::EXAMPLE_BUCKET_NAME",
        "arn:aws:s3:::EXAMPLE_BUCKET_NAME/*"
      ]
    }
  ]
}

You need to add the bucket name to the additional_data_buckets list of the gateway terraform module to set up network policies and add it to the values passed to the helm chart.

For example:

module "gateway" {
  ...
  additional_data_buckets = ["EXAMPLE_BUCKET_NAME"]

  helm_chart_values = [
    ...
    yamlencode({
      dal : {
        sources : {
          s3 : ["EXAMPLE_BUCKET_NAME"]
        }
      }
    }),
    ...
  ]
  ...
}

Optional: Configure custom node pools🔗

You can configure Karpenter node pools that are more adapted to your particular compute needs than the default ones via the gateway Terraform module karpenter_node_pools variable.

Apheris computations that use GPU add a toleration for the nvidia.com/gpu taint, computations that do not use GPU lack that toleration. Make sure to include the taint

  {
    key    = "nvidia.com/gpu"
    effect = "NoSchedule"
  }

in GPU node pool configurations to prevent non-GPU computations from using more expensive nodes with GPUs.

The Apheris Compute Gateway agent supports a basic way of mapping computations to nodes with a specific taint, based on the number of requested GPUs of the Compute Spec.

For example, to enable computations that use 4 or 8 GPUs to get scheduled with a taint (example.com/expensive) use:

module "gateway" {
  ...
  karpenter_node_pools = {
    expensive-pool = {
      instance_families = [...]
      capacity_type = ["on-demand"]
      taints = [
        {
          key    = "example.com/expensive"
          effect = "NoSchedule"
        }
      ]
    }
  }
  ...
  ...
  helm_chart_values = [
    yamlencode(
      job = {
        nodePoolMapping = {
          4 = "example.com/expensive"
          8 = "example.com/expensive"
        }
      }
    )
  ]
  ...
}

Please note that the mapping only adds a toleration for the specified taints. More involved node pool configurations require adding specific taints to all node pools in order to prevent scheduling of unwanted computations there.

Optional: Enable Asset Policy Signature Validation🔗

Please refer to the guide

Running terraform init🔗

The terraform init command is required at least once before applying. It's safe to run the terraform apply command directly though (see Running terraform apply section); Terraform will complain by outputting a message like Run "terraform init", should it be required.

As long as the following falls true:

the AWS CLI is configured to have access to the cluster

the gateway module is added

all the *.tf and the values.yaml files are present

the init command can be run:

terraform init

After the completion, a message similar to the following should be displayed

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running the plan command to see any changes that are required for your infrastructure. All Terraform commands should now work.

If you ever set or change modules or backend configuration for Terraform, rerun this command to reinitialize your working directory. If you forget, other commands will detect it and remind you to do so if necessary.

Running terraform apply🔗

After the successful execution of the init command, the apply command can be executed:

terraform apply

The following message should show up on your screen:

Plan: X to add, Y to change, Z to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value:

You can scroll up and take a look at the planned changes. Please contact your Apheris representative via your dedicated support channel if you have doubts, otherwise, type yes to approve.

After the completion, a message similar to the following should be displayed:

Apply complete! Resources: X added, Y changed, Z destroyed.

Post-install / Post-upgrade🔗

After the apply command has been completed you may want to verify the following:

Gateway agent version

Network connectivity

Verifying the Gateway agent version🔗

To verify that the running agent version matches the expected one you should run:

kubectl get deployment -l apheris-component=agent -n apheris -o=jsonpath="{.items[*].spec.template.spec.containers[0].image}" | sed 's/.*:/agent:\t/'

Please confirm that the returned version matches the version that Apheris has communicated while sharing the deliverables.

If for some reason you get an error while executing the previous command please try the following step-by-step guidance

Identify the name of the deployment by running:
```
kubectl get deployment -l apheris-component=agent -n apheris
```
You should see only one deployment listed, please copy its name, the name will likely be apheris-gateway-agent
Check the image tag by running:
```
kubectl get deployment/YOUR_DEPLOYMENT_NAME -n apheris -o yaml | grep -o 'image:.*'| sed 's/.*:/agent:\t/'
```
Please make sure that you replace YOUR_DEPLOYMENT_NAME with the correct name (e.g., apheris-gateway-agent).

Verifying the network connectivity🔗

During setup and upgrades, the Gateway accesses endpoints to download additional components.

During normal runtime, the Gateway communicates with a number of Apheris-managed services, Auth0 for user authentication, and (typically) an external Docker registry to pull the Apheris computation images.

The exact list of endpoints that must be accessible from the Gateway is dependent on your platform setup.

The Gateway agent Helm chart comes with a set of connectivity tests that check if the necessary endpoints are reachable. The outcome will display each test suite and its result. The tests can be run with the following command:

helm test apheris-gateway -n apheris`

Logs can be inspected using:

kubectl logs apheris-gateway-connectivity-test -n apheris`

Please don’t hesitate to contact your Apheris representative if you need more details and to receive the exact endpoints that are required to be accessible for your platform setup.

Wrapping up🔗

Congratulations! The installation / upgrade has been completed!

Please contact your Apheris representative on your dedicated support channel or by mailing to support@apheris.com to confirm the completion and request to run the smoke tests.

Uninstall🔗

Perform a terraform destroy on the root module of your gateway and wait for Terraform to remove all related resources.

FAQs🔗

How do I solve `Error: cannot re-use a name that is still in use` ?🔗

In some cases helm releases are already present on the cluster but not recorded in the terraform state.

This will result in error messages that look similar to:

│ Error: cannot re-use a name that is still in use
│
│   with module.gateway.module.gateway_agent[0].helm_release.gateway_agent[0],
│   on .terraform/modules/gateway.gateway_agent/gateway/modules/gateway_agent/gateway_agent.tf line 17, in resource "helm_release" "gateway_agent":
│   17: resource "helm_release" "gateway_agent" {

Please make sure that the helm release is not managed by another system first.

To resolve the error:

uninstall the respective helm release (e.g. helm uninstall apheris-gateway -n apheris for the above case)
run terraform apply again

Can I manage the AWS VPC setup?🔗

The Apheris gateway Terraform module will create the VPC resources by default. However, it can be overridden to disable the creation of these resources. To do so, the module needs to be provided with information about which VPC and subnet will be used for deploying the EKS cluster.

Below is an example of how the module receives these inputs:

create_vpc = false
vpc_id = "vpc-045da9dc92fb2a887"
node_subnet_ids = ["subnet-0640da74d7ffb177f", "subnet-0d0e19e8e7ee019a0", "subnet-02e5590fa6f9712c9"]

If you've set up your VPC using the gateway module, and you've made manual changes to your VPC resources that you want to preserve without being overwritten during the next upgrade, you would need to remove VPC Resources from Terraform state: You can use the following command to achieve this:

terraform state list | grep 'module.vpc' | xargs -I {} terraform state rm '{}'

By doing this, Terraform won't attempt to modify or destroy the resource during future upgrades, safeguarding your customizations.

Note

When managing your VPC resources, please consider creating subnets with a sufficient number of IP addresses since each Kubernetes pod will allocate an IP. The reference setup from our Terraform module will create three subnets, each with 254 addresses.