Troubleshooting Guide🔗

This guide addresses common operational issues encountered when deploying or running the Apheris Hub. For each scenario, recommended actions and example commands are provided.

Accessing Hub Logs🔗

To diagnose most issues, review the Hub container logs:

docker logs apheris-hub

If you face "permission denied" errors, add sudo before the command for elevated privileges.

Common Issues and Solutions🔗

Insufficient Disk Space🔗

Symptoms: Errors such as no space left on device or failed model installation.

Basic Solution:

Ensure you have sufficient free disk space to accommodate large model images and Docker data. Some models and their dependencies can require significant storage.
Remove unused Docker images and volumes:
```
docker system prune -a
```

After freeing space, try your operation again.

Advanced: Changing Docker's Data Root (if disk is full on default Docker storage path):

If your Docker root directory (often /var/lib/docker) is full, you can move Docker's storage to a different disk or partition with more space. This is especially useful if you cannot free up enough space on the default volume.

Steps:

Stop Docker and its socket:
```
sudo systemctl stop docker
sudo systemctl stop docker.socket
```
This ensures Docker is not using any files while you change its configuration.

Create the new Docker data root directory:
```
sudo mkdir -p /somepath
```
Replace /somepath with the path to a disk or partition with sufficient free space.

(If needed) Create the Docker config directory:
```
sudo mkdir -p /etc/docker
```
Only necessary if /etc/docker does not already exist. This directory is required for the daemon.json configuration file.

Configure Docker to use the new data root: Add the following to /etc/docker/daemon.json (create the file if it does not exist):
```
{
  "data-root": "/somepath"
}
```
This tells Docker to store all images, containers, and volumes in the new location.

Restart Docker:
```
sudo systemctl start docker
```

Verify the new Docker root directory:
```
docker info | grep Root
```
This should show the new data root path you configured.

Note

Before proceeding, consider the following:

You may need to migrate existing images/volumes if you want to preserve them. For most troubleshooting, starting with a clean data root is sufficient.
Always ensure the new location has enough space for large model images and future growth.

Missing or Invalid API Key🔗

Symptoms:

Model installation or image pulls fail.
The UI displays: Authentication to the registry failed. Please verify APH_HUB_API_KEY and try again.
In the logs, you may see errors such as registry unauthorized or similar messages.

Solution:

Set or update the API key:

Obtain a valid API key from Apheris if needed.

To set the variable for your Hub container, run:
```
docker run --name apheris-hub \
  ... \
  -e APH_HUB_API_KEY=your-api-key-here \
  ...
```

Verify resolution:
- Try the operation again in the UI. The error message should no longer appear if the key is valid.
- Check the logs for absence of unauthorized or authentication errors:
```
docker logs apheris-hub | grep -i unauthorized
```

Using a wrong Application Definition File🔗

You can provide a custom application definition YAML file to the Hub using the APH_HUB_APPLICATION_DEFINITION_FILE environment variable. This allows you to override or extend the default application configuration.

How to use with Docker:

Place your custom YAML file (e.g., my-app-def.yaml) on the host machine.
Add the following options to your docker run command:

bash docker run --name apheris-hub \ ... \ -e APH_HUB_APPLICATION_DEFINITION_FILE=/app/data.yaml \ -v /path/to/my-app-def.yaml:/app/data.yaml:ro \ ...

Replace /path/to/my-app-def.yaml with the full path to your YAML file on the host.

This mounts your custom YAML file into the container and tells the Hub to use it as the application definition.

If the file is not found or the environment variable is not set, the Hub will use its default configuration.

Confirm the Hub is using your custom definition by checking the application list in the UI or logs:
```
docker logs apheris-hub | grep applicationDefinitionFile
```
If empty, it means the Hub is using the default application definition file.

If issues persist, contact Apheris support for further assistance.

GPU or Driver Issues🔗

Symptoms:

The Hub fails to detect a GPU or reports driver/toolkit errors.
Model inference fails or is unstable.

Solution:

Check GPU hardware and driver status:
- List available GPUs and their memory:
```
nvidia-smi
```
  This shows GPU model, driver version, and available VRAM. Confirm you have enough VRAM for your workload (e.g., OpenFold3 requires GPUs with at least 40GB VRAM).
- Check driver version:
```
nvidia-smi --query-gpu=driver_version --format=csv
```
  Ensure the driver version matches the requirements for your CUDA toolkit and models.
- If nvidia-smi fails, the driver may not be installed or loaded. Try:
```
lsmod | grep nvidia
```
  If nothing is returned, the Nvidia kernel module is not loaded.

Confirm Nvidia Container Toolkit is installed:
- Check installation:
  - On Debian/Ubuntu systems (using apt):
```
dpkg -l | grep nvidia-container-toolkit
```
  - On Red Hat/CentOS/Fedora systems (using yum/dnf):
```
rpm -qa | grep nvidia-container-toolkit
```
  - On other distributions, check your package manager's documentation for how to list installed packages.
- For Docker, verify GPU access:
```
docker run --rm --gpus all nvidia/cuda:latest nvidia-smi
```
  This should show your GPU details inside the container. If it fails, the toolkit or driver may not be installed or configured correctly.

Check VRAM requirements for your model:
- Some models (e.g., OpenFold3) require GPUs with at least 40GB VRAM for stable inference. If your GPU has less, inference may fail or be unstable.
- Use nvidia-smi to check available VRAM and compare with model requirements.

General debugging tips:
- Reboot the system after installing or updating drivers.
- Ensure Docker is started with GPU support (--gpus all).
- Check container logs for CUDA or driver errors:
  
```bash
docker ps --filter "label=apheris.hub=true" --format "{{.Names}}" | xargs -r -I {} docker logs {} | grep -Ei "cuda|nvidia"

```bash

Advanced: Nvidia Container Runtime, cgroups, and VRAM Troubleshooting🔗

If you have installed the Nvidia Container Toolkit but still cannot start containers with GPU support, or see errors like unknown or invalid runtime name: nvidia, follow these steps:

Verify Nvidia driver and hardware on the host:
```
sudo nvidia-smi
```
This should show your GPUs and driver status. If it fails, the driver is not installed or loaded.

Test container GPU access (without sandboxing):

sudo docker run --rm --runtime=nvidia --gpus=all --privileged ubuntu nvidia-smi

Test with Nvidia environment variables:

sudo docker run --rm --runtime=nvidia \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=all \
  ubuntu nvidia-smi || true

Fix cgroups issues in Nvidia container runtime:

If you see errors related to cgroups or the Nvidia runtime, you may need to comment out the no-cgroups setting in the Nvidia container runtime config. This is a common fix for GPU access issues on some systems.

Run:
```
sudo sed -i 's/no-cgroups = /#no-cgroups = /' /etc/nvidia-container-runtime/config.toml
```
Then test GPU access in a container:
```
sudo docker run --rm --runtime=nvidia \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=all \
  ubuntu nvidia-smi
```
If you see your GPU listed, the issue is resolved.

Check Docker and system cgroup configuration:

sudo docker info | grep -iE '(version|cgrou)'
sudo mount | grep cgroup2

If you attempted rootless setup, ensure it is not interfering with GPU access. Revert to standard Docker if needed.

Pairing session: If you continue to have issues, we recommend scheduling a pairing session with Apheris support to resolve complex setup problems interactively.

VRAM Requirements for OpenFold3 and Other Models🔗

For OpenFold3, we recommend GPUs with at least 40GB VRAM for stable inference. Inference may work for small inputs, but errors or out-of-memory (OOM) events are likely with larger workloads.

If you encounter OOM errors, check the logs for clear error messages. If the model fails due to insufficient memory, consider upgrading your hardware or using cloud resources with larger GPUs.

If you need help interpreting error messages or hardware requirements, contact Apheris support.

Port Conflict🔗

Symptoms: Errors such as Bind for 0.0.0.0:8080 failed: port is already allocated.

Cause: Another process is already using port 8080 on your system.

Solution: Map a different local port (e.g., 8081) to the container’s internal port 8080 by modifying the docker run command:

--publish=127.0.0.1:8081:8080

After starting, access the Hub at http://localhost:8081.

Support ZIP🔗

The Support ZIP feature provides a one-click way to collect all relevant diagnostic information for troubleshooting and support. This archive helps resolve issues faster by bundling:

System and hardware details (system_info.json)
Application configuration (with sensitive data redacted)
Container and image summaries
Model registry (data.yaml)
Container logs with tailing/truncation (logs/)
Sanitized container inspection data (inspections/)

For a detailed breakdown of what is included in the Support ZIP archive and how redaction works, see Support ZIP Archive: Contents and Redaction Details. For further questions, contact support.

How to Generate a Support ZIP🔗

You can generate a Support ZIP archive in two ways:

From the UI:

In the Apheris Hub UI, navigate to the Settings tab.
Go to the Support section.
Click on the Download Support Zip button to generate and download the archive.
Provide the downloaded archive when contacting support (email: support@apheris.com).

Via API: Use the following curl command, replacing <HUB_API_URL> with your Hub's URL (e.g., http://localhost:8080):

curl -o support.zip <HUB_API_URL>/api/v1/support/zip?tailLines=1000

You can adjust the tailLines parameter to control how many lines of logs are included (set to 0 for full logs).

Additional Support🔗

If issues persist, contact support@apheris.com and provide logs or a Support ZIP archive if available (Settings > Support > Download Support Zip).