Skip to content

Troubleshooting Guide🔗

This guide addresses common operational issues encountered when deploying or running the Apheris Hub. For each scenario, recommended actions and example commands are provided.

Accessing Hub Logs🔗

To diagnose most issues, review the Hub logs:

For Docker deployments:

docker logs apheris-hub

To follow logs in real-time:

docker logs -f apheris-hub

For Kubernetes/Helm deployments:

kubectl logs -n apheris-hub -l app.kubernetes.io/name=hub-hub

To follow logs in real-time:

kubectl logs -n apheris-hub -l app.kubernetes.io/name=hub-hub -f

Common Issues and Solutions🔗

Insufficient Disk Space (Docker Deployments)🔗

Symptoms: Errors such as no space left on device or failed model installation.

Basic Solution:

  • Ensure you have sufficient free disk space to accommodate large model images and Docker data. Some models and their dependencies can require significant storage.
  • Remove unused Docker images and volumes:

    docker system prune -a
    
  • After freeing space, try your operation again.

Advanced: Changing Docker's Data Root (if disk is full on default Docker storage path):

If your Docker root directory (often /var/lib/docker) is full, you can move Docker's storage to a different disk or partition with more space. This is especially useful if you cannot free up enough space on the default volume.

Steps:

  • Stop Docker and its socket:

    sudo systemctl stop docker
    sudo systemctl stop docker.socket
    

    This ensures Docker is not using any files while you change its configuration.

  • Create the new Docker data root directory:

    sudo mkdir -p /somepath
    

    Replace /somepath with the path to a disk or partition with sufficient free space.

  • (If needed) Create the Docker config directory:

    sudo mkdir -p /etc/docker
    

    Only necessary if /etc/docker does not already exist. This directory is required for the daemon.json configuration file.

  • Configure Docker to use the new data root: Add the following to /etc/docker/daemon.json (create the file if it does not exist):

    {
      "data-root": "/somepath"
    }
    

    This tells Docker to store all images, containers, and volumes in the new location.

  • Restart Docker:

    sudo systemctl start docker
    
  • Verify the new Docker root directory:

    docker info | grep Root
    

    This should show the new data root path you configured.

Note

Before proceeding, consider the following:

  • You may need to migrate existing images/volumes if you want to preserve them. For most troubleshooting, starting with a clean data root is sufficient.
  • Always ensure the new location has enough space for large model images and future growth.

GPU or Driver Issues🔗

Symptoms:

  • The Hub fails to detect a GPU or reports driver/toolkit errors.
  • Model inference fails or is unstable.

Solution:

  • Check GPU hardware and driver status:

    • List available GPUs and their memory:

      nvidia-smi
      

      This shows GPU model, driver version, and available VRAM. Confirm you have enough VRAM for your workload (e.g., OpenFold3 requires GPUs with at least 40GB VRAM).

    • Check driver version:

      nvidia-smi --query-gpu=driver_version --format=csv
      

      Ensure the driver version matches the requirements for your CUDA toolkit and models.

    • If nvidia-smi fails, the driver may not be installed or loaded. Try:

      lsmod | grep nvidia
      

      If nothing is returned, the Nvidia kernel module is not loaded.

  • Confirm Nvidia Container Toolkit is installed:

    • Check installation:

      • On Debian/Ubuntu systems (using apt):

        dpkg -l | grep nvidia-container-toolkit
        
      • On Red Hat/CentOS/Fedora systems (using yum/dnf):

        rpm -qa | grep nvidia-container-toolkit
        
      • On other distributions, check your package manager's documentation for how to list installed packages.
    • For Docker, verify GPU access:

      docker run --rm --gpus all nvidia/cuda:latest nvidia-smi
      

      This should show your GPU details inside the container. If it fails, the toolkit or driver may not be installed or configured correctly.

  • Check VRAM requirements for your model:

    • Some models (e.g., OpenFold3) require GPUs with at least 40GB VRAM for stable inference. If your GPU has less, inference may fail or be unstable.
    • Use nvidia-smi to check available VRAM and compare with model requirements.
  • General debugging tips (Docker Deployments):

    • Reboot the system after installing or updating drivers.
    • Ensure Docker is started with GPU support (--gpus all).
    • Check container logs for CUDA or driver errors:

      docker ps --filter "label=apheris.hub=true" --format "{{.Names}}" | xargs -r -I {} docker logs {} | grep -Ei "cuda|nvidia"
      

Advanced: Nvidia Container Runtime, cgroups, and VRAM Troubleshooting🔗

If you have installed the Nvidia Container Toolkit but still cannot start containers with GPU support, or see errors like unknown or invalid runtime name: nvidia, follow these steps:

  • Verify Nvidia driver and hardware on the host:

    sudo nvidia-smi
    

    This should show your GPUs and driver status. If it fails, the driver is not installed or loaded.

  • Test container GPU access (without sandboxing):

    sudo docker run --rm --runtime=nvidia --gpus=all --privileged ubuntu nvidia-smi
    
  • Test with Nvidia environment variables:

    sudo docker run --rm --runtime=nvidia \
      -e NVIDIA_VISIBLE_DEVICES=all \
      -e NVIDIA_DRIVER_CAPABILITIES=all \
      ubuntu nvidia-smi || true
    
  • Fix cgroups issues in Nvidia container runtime:

    If you see errors related to cgroups or the Nvidia runtime, you may need to comment out the no-cgroups setting in the Nvidia container runtime config. This is a common fix for GPU access issues on some systems.

    Run:

    sudo sed -i 's/no-cgroups = /#no-cgroups = /' /etc/nvidia-container-runtime/config.toml
    

    Then test GPU access in a container:

    sudo docker run --rm --runtime=nvidia \
      -e NVIDIA_VISIBLE_DEVICES=all \
      -e NVIDIA_DRIVER_CAPABILITIES=all \
      ubuntu nvidia-smi
    

    If you see your GPU listed, the issue is resolved.

  • Check Docker and system cgroup configuration:

    sudo docker info | grep -iE '(version|cgrou)'
    sudo mount | grep cgroup2
    

If you attempted rootless setup, ensure it is not interfering with GPU access. Revert to standard Docker if needed.

Pairing session: If you continue to have issues, we recommend scheduling a pairing session with Apheris support to resolve complex setup problems interactively.

VRAM Requirements for OpenFold3 and Other Models🔗

For OpenFold3, we recommend GPUs with at least 40GB VRAM for stable inference. Inference may work for small inputs, but errors or out-of-memory (OOM) events are likely with larger workloads.

If you encounter OOM errors, check the logs for clear error messages. If the model fails due to insufficient memory, consider upgrading your hardware or using cloud resources with larger GPUs.

If you need help interpreting error messages or hardware requirements, contact Apheris support.

GPU Resource Conflicts Between Models🔗

Symptoms:

  • A job crashes with NVML errors after another model has been running
  • Error messages like pynvml.NVMLError_Unknown: Unknown Error or Can't initialize NVML in model logs
  • One model works fine initially, but subsequent models fail to access the GPU
  • Jobs fail with GPU-related errors even though the GPU appears available

Cause:

When multiple models share the same GPU(s) using --gpus all, GPU resources (particularly CUDA contexts, memory, and NVML state) may not be properly released when a model completes its work. This can prevent other models from initializing their GPU access, especially with models that have complex GPU requirements like OpenFold3 and Boltz2.

Solution:

The most reliable way to resolve GPU resource conflicts is to restart the affected model containers/pods to force a clean GPU state:

For Docker deployments:

Restart the affected model container:

docker restart <modelname>

For example, to restart the OpenFold3 model after a Boltz2 run:

docker restart openfold3 boltz2

For Kubernetes/Helm deployments:

Restart the affected model deployment:

kubectl rollout restart deployment/hub-<modelname> -n apheris-hub

For example, to restart the OpenFold3 model:

kubectl rollout restart deployment/hub-openfold3 -n apheris-hub

Prevention:

While GPU resource conflicts can occur with any multi-model setup, restarting model containers is a reliable workaround. If you encounter this issue frequently, consider:

  • Running models on dedicated GPUs if you have multiple GPUs available
  • Configuring GPU device IDs in your deployment to isolate models
  • Contacting support@apheris.com for guidance on GPU resource management strategies

Model Not Deployed🔗

Symptoms: When attempting to check model status, you see an error message on the model's page: Model not deployed.

Cause: This error indicates that the requested model is either:

  1. Disabled in the deployment configuration - The model was not enabled in your deployment settings
  2. Unreachable - The model is not running or the Hub cannot connect to it

Solution:

  • Check if the model is enabled in your configuration:

    Review your deployment configuration and verify the model's enabled setting. For example, to enable OpenFold3:

    models:
      openfold3:
        enabled: true
    

    If the model was disabled, update the configuration and redeploy.

  • Test model connectivity from your local machine:

    For Docker deployments:

    Check if the model is responding by querying its weights endpoint. Replace <MODEL_PORT> with the model's configured port from config.yaml (e.g., Mock defaults to 7771):

    curl http://127.0.0.1:<MODEL_PORT>/weights
    

    If the model is running and accessible, you should see a JSON response with available weights.

    For Kubernetes/Helm deployments:

    First, identify the model's service and port:

    kubectl get svc -n apheris-hub -l app.kubernetes.io/name=hub-<modelname>
    

    For example, to find the mock model service:

    kubectl get svc -n apheris-hub -l app.kubernetes.io/name=hub-mock
    

    This shows the service name and ports. Then set up a port-forward to access it locally:

    kubectl port-forward -n apheris-hub svc/hub-<modelname> <LOCAL_PORT>:<SERVICE_PORT>
    

    For example, to port-forward the mock model (typically runs on port 8000):

    kubectl port-forward -n apheris-hub svc/hub-mock 7771:8000
    

    In another terminal, test the connection:

    curl http://127.0.0.1:7771/weights
    

    If the model is running correctly, you should see a JSON response with available weights:

    {
      "available_weights": [
        {
          "model_type": "mock",
          "model_version_id": "mock:vX.Y.Z",
          ...
        }
      ]
    }
    

    If you get a connection error, the model is either not running or not exposed on that port.

  • Check if the model pod/container is running:

    For Docker deployments:

    docker ps --filter "label=apheris.hub=true"
    

    Look for containers with names matching the model (e.g., mock, openfold3). For example, to check if the mock model is running:

    docker ps --filter "name=mock"
    

    For Kubernetes/Helm deployments:

    kubectl get pods -n apheris-hub -l app.kubernetes.io/name=hub-<modelname>
    

    Replace <modelname> with the model name (e.g., mock, openfold3). For example:

    kubectl get pods -n apheris-hub -l app.kubernetes.io/name=hub-mock
    
  • Check model logs for errors:

    For Docker deployments:

    docker logs <modelname>
    

    For example, to check mock model logs:

    docker logs mock
    

    For Kubernetes/Helm deployments:

    kubectl logs -n apheris-hub -l app.kubernetes.io/name=hub-<modelname>
    

    For example, to check mock model logs:

    kubectl logs -n apheris-hub -l app.kubernetes.io/name=hub-mock
    

    Look for startup errors, port conflicts, GPU/driver issues, or resource constraints (e.g., out of memory).

  • Verify network connectivity between Hub and model:

    The Hub must be able to reach the model service. You can verify connectivity by checking the model logs to see if the Hub is successfully making requests.

    For Docker deployments:

    Check the model container logs to see if the Hub is making successful requests:

    docker logs <modelname> --tail=20
    

    For example, to check if the Hub is connecting to the mock model:

    docker logs mock --tail=20
    

    You should see log entries showing HTTP requests from the Hub's IP address (e.g., INFO: 172.19.0.4:50222 - "GET /weights HTTP/1.1" 200 OK). If you see successful 200 responses, the Hub can reach the model.

    For Kubernetes/Helm deployments:

    Check the model pod logs to see if the Hub is making successful requests:

    kubectl logs -n apheris-hub -l app.kubernetes.io/name=hub-<modelname> --tail=20
    

    For example, to check if the Hub is connecting to the mock model:

    kubectl logs -n apheris-hub -l app.kubernetes.io/name=hub-mock --tail=20
    

    You should see log entries showing HTTP requests from the Hub's IP address (e.g., INFO: 10.244.0.9:55890 - "GET /weights HTTP/1.1" 200 OK). If you see successful 200 responses, the Hub can reach the model.

    If you don't see any requests from the Hub, or see connection errors, there may be network policy issues, DNS resolution problems, or the model service is not properly exposed.

If the issue persists after following these steps, please contact support@apheris.com with the diagnostic information.

Support ZIP🔗

The Support ZIP feature provides a convenient way to collect diagnostic information that can help support teams better understand your deployment setup. This archive contains structured JSON data with:

  • System and hardware details (OS, architecture, Go version, CPU count, GPU/driver status)
  • Hub configuration with sensitive data redacted (input/output directories, API URL, authentication settings, timeouts, websocket settings, etc.)
  • Installed model versions with their current status, endpoints, and available weights

While this information can provide useful context about your environment, additional troubleshooting steps may still be needed to resolve specific issues.

For a detailed breakdown of what is included in the Support ZIP archive and how redaction works, see Support ZIP Archive: Contents and Redaction Details. For further questions, contact support.

How to Generate a Support ZIP🔗

You can generate a Support ZIP archive in two ways:

From the UI:

  1. In the Apheris Hub UI, navigate to the Settings tab.
  2. Go to the Support section.
  3. Click on the Download Support Zip button to generate and download the archive.
  4. Provide the downloaded archive when contacting support (email: support@apheris.com).

Via API: Use the following curl command, replacing <HUB_API_URL> with your Hub's URL (e.g., http://localhost:8080):

curl -o support.zip <HUB_API_URL>/api/v1/support/zip

Additional Support🔗

If issues persist, contact support@apheris.com. When reaching out, include relevant logs and, if the Hub is running, a Support ZIP archive (Settings > Support > Download Support Zip) to help provide context about your environment and configuration.