Troubleshooting Guideπ
This guide addresses common operational issues encountered when deploying or running the Apheris Hub. For each scenario, recommended actions and example commands are provided.
Accessing Hub Logsπ
To diagnose most issues, review the Hub container logs:
docker logs apheris-hub
If you face "permission denied" errors, add sudo before the command for elevated privileges.
Common Issues and Solutionsπ
Insufficient Disk Spaceπ
Symptoms: Errors such as no space left on device or failed model installation.
Basic Solution:
- Ensure you have sufficient free disk space to accommodate large model images and Docker data. Some models and their dependencies can require significant storage.
-
Remove unused Docker images and volumes:
docker system prune -a
- After freeing space, try your operation again.
Advanced: Changing Docker's Data Root (if disk is full on default Docker storage path):
If your Docker root directory (often /var/lib/docker) is full, you can move Docker's storage to a different disk or partition with more space. This is especially useful if you cannot free up enough space on the default volume.
Steps:
-
Stop Docker and its socket:
sudo systemctl stop docker sudo systemctl stop docker.socketThis ensures Docker is not using any files while you change its configuration.
-
Create the new Docker data root directory:
sudo mkdir -p /somepathReplace
/somepathwith the path to a disk or partition with sufficient free space.
-
(If needed) Create the Docker config directory:
sudo mkdir -p /etc/dockerOnly necessary if
/etc/dockerdoes not already exist. This directory is required for thedaemon.jsonconfiguration file.
-
Configure Docker to use the new data root: Add the following to
/etc/docker/daemon.json(create the file if it does not exist):{ "data-root": "/somepath" }This tells Docker to store all images, containers, and volumes in the new location.
-
Restart Docker:
sudo systemctl start docker
-
Verify the new Docker root directory:
docker info | grep RootThis should show the new data root path you configured.
Note
Before proceeding, consider the following:
- You may need to migrate existing images/volumes if you want to preserve them. For most troubleshooting, starting with a clean data root is sufficient.
- Always ensure the new location has enough space for large model images and future growth.
Missing or Invalid API Keyπ
Symptoms:
- Model installation or image pulls fail.
- The UI displays:
Authentication to the registry failed. Please verify APH_HUB_API_KEY and try again. - In the logs, you may see errors such as
registry unauthorizedor similar messages.
Solution:
-
Set or update the API key:
Obtain a valid API key from Apheris if needed.
To set the variable for your Hub container, run:
docker run --name apheris-hub \ ... \ -e APH_HUB_API_KEY=your-api-key-here \ ...
-
Verify resolution:
- Try the operation again in the UI. The error message should no longer appear if the key is valid.
-
Check the logs for absence of
unauthorizedor authentication errors:docker logs apheris-hub | grep -i unauthorized
Using a wrong Application Definition Fileπ
You can provide a custom application definition YAML file to the Hub using the APH_HUB_APPLICATION_DEFINITION_FILE environment variable. This allows you to override or extend the default application configuration.
How to use with Docker:
- Place your custom YAML file (e.g.,
my-app-def.yaml) on the host machine. -
Add the following options to your
docker runcommand:bash docker run --name apheris-hub \ ... \ -e APH_HUB_APPLICATION_DEFINITION_FILE=/app/data.yaml \ -v /path/to/my-app-def.yaml:/app/data.yaml:ro \ ...Replace
/path/to/my-app-def.yamlwith the full path to your YAML file on the host.
This mounts your custom YAML file into the container and tells the Hub to use it as the application definition.
If the file is not found or the environment variable is not set, the Hub will use its default configuration.
-
Confirm the Hub is using your custom definition by checking the application list in the UI or logs:
docker logs apheris-hub | grep applicationDefinitionFileIf empty, it means the Hub is using the default application definition file.
If issues persist, contact Apheris support for further assistance.
GPU or Driver Issuesπ
Symptoms:
- The Hub fails to detect a GPU or reports driver/toolkit errors.
- Model inference fails or is unstable.
Solution:
-
Check GPU hardware and driver status:
-
List available GPUs and their memory:
nvidia-smiThis shows GPU model, driver version, and available VRAM. Confirm you have enough VRAM for your workload (e.g., OpenFold3 requires GPUs with at least 40GB VRAM).
-
Check driver version:
nvidia-smi --query-gpu=driver_version --format=csvEnsure the driver version matches the requirements for your CUDA toolkit and models.
-
If
nvidia-smifails, the driver may not be installed or loaded. Try:lsmod | grep nvidiaIf nothing is returned, the Nvidia kernel module is not loaded.
-
-
Confirm Nvidia Container Toolkit is installed:
-
Check installation:
-
On Debian/Ubuntu systems (using
apt):dpkg -l | grep nvidia-container-toolkit
-
On Red Hat/CentOS/Fedora systems (using
yum/dnf):rpm -qa | grep nvidia-container-toolkit
- On other distributions, check your package manager's documentation for how to list installed packages.
-
-
For Docker, verify GPU access:
docker run --rm --gpus all nvidia/cuda:latest nvidia-smiThis should show your GPU details inside the container. If it fails, the toolkit or driver may not be installed or configured correctly.
-
-
Check VRAM requirements for your model:
- Some models (e.g., OpenFold3) require GPUs with at least 40GB VRAM for stable inference. If your GPU has less, inference may fail or be unstable.
- Use
nvidia-smito check available VRAM and compare with model requirements.
-
General debugging tips:
- Reboot the system after installing or updating drivers.
- Ensure Docker is started with GPU support (
--gpus all). -
Check container logs for CUDA or driver errors:
```bash
docker ps --filter "label=apheris.hub=true" --format "{{.Names}}" | xargs -r -I {} docker logs {} | grep -Ei "cuda|nvidia"
```bash
Advanced: Nvidia Container Runtime, cgroups, and VRAM Troubleshootingπ
If you have installed the Nvidia Container Toolkit but still cannot start containers with GPU support, or see errors like unknown or invalid runtime name: nvidia, follow these steps:
-
Verify Nvidia driver and hardware on the host:
sudo nvidia-smiThis should show your GPUs and driver status. If it fails, the driver is not installed or loaded.
-
Test container GPU access (without sandboxing):
sudo docker run --rm --runtime=nvidia --gpus=all --privileged ubuntu nvidia-smi
-
Test with Nvidia environment variables:
sudo docker run --rm --runtime=nvidia \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_DRIVER_CAPABILITIES=all \ ubuntu nvidia-smi || true
-
Fix cgroups issues in Nvidia container runtime:
If you see errors related to cgroups or the Nvidia runtime, you may need to comment out the
no-cgroupssetting in the Nvidia container runtime config. This is a common fix for GPU access issues on some systems.Run:
sudo sed -i 's/no-cgroups = /#no-cgroups = /' /etc/nvidia-container-runtime/config.tomlThen test GPU access in a container:
sudo docker run --rm --runtime=nvidia \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_DRIVER_CAPABILITIES=all \ ubuntu nvidia-smiIf you see your GPU listed, the issue is resolved.
-
Check Docker and system cgroup configuration:
sudo docker info | grep -iE '(version|cgrou)' sudo mount | grep cgroup2
If you attempted rootless setup, ensure it is not interfering with GPU access. Revert to standard Docker if needed.
Pairing session: If you continue to have issues, we recommend scheduling a pairing session with Apheris support to resolve complex setup problems interactively.
VRAM Requirements for OpenFold3 and Other Modelsπ
For OpenFold3, we recommend GPUs with at least 40GB VRAM for stable inference. Inference may work for small inputs, but errors or out-of-memory (OOM) events are likely with larger workloads.
If you encounter OOM errors, check the logs for clear error messages. If the model fails due to insufficient memory, consider upgrading your hardware or using cloud resources with larger GPUs.
If you need help interpreting error messages or hardware requirements, contact Apheris support.
Port Conflictπ
Symptoms: Errors such as Bind for 0.0.0.0:8080 failed: port is already allocated.
Cause: Another process is already using port 8080 on your system.
Solution: Map a different local port (e.g., 8081) to the containerβs internal port 8080 by modifying the docker run command:
--publish=127.0.0.1:8081:8080
After starting, access the Hub at http://localhost:8081.
Support ZIPπ
The Support ZIP feature provides a one-click way to collect all relevant diagnostic information for troubleshooting and support. This archive helps resolve issues faster by bundling:
- System and hardware details (
system_info.json) - Application configuration (with sensitive data redacted)
- Container and image summaries
- Model registry (
data.yaml) - Container logs with tailing/truncation (
logs/) - Sanitized container inspection data (
inspections/)
For a detailed breakdown of what is included in the Support ZIP archive and how redaction works, see Support ZIP Archive: Contents and Redaction Details. For further questions, contact support.
How to Generate a Support ZIPπ
You can generate a Support ZIP archive in two ways:
From the UI:
- In the Apheris Hub UI, navigate to the Settings tab.
- Go to the Support section.
- Click on the Download Support Zip button to generate and download the archive.
- Provide the downloaded archive when contacting support (email: support@apheris.com).
Via API:
Use the following curl command, replacing <HUB_API_URL> with your Hub's URL (e.g., http://localhost:8080):
curl -o support.zip <HUB_API_URL>/api/v1/support/zip?tailLines=1000
You can adjust the tailLines parameter to control how many lines of logs are included (set to 0 for full logs).
Additional Supportπ
If issues persist, contact support@apheris.com and provide logs or a Support ZIP archive if available (Settings > Support > Download Support Zip).