Monitoring Gateways🔗

Basic overview in the Governance Portal🔗

The Governance Portal allows you to see the most important settings of your Gateway(s) as well as its running status.

Accessing Gateways🔗

Gateways have a dedicated navigation item in the Governance Portal sidebar, making them easily accessible from anywhere in the application.

Access Permissions🔗

Access to the gateway monitoring views is role-based and depends on your user role and organization configuration:

Role	Access Level	Description
Owner + Admin Organization configured	Full Access	Users who have the Owner role in an organization that is configured as an admin organization can view all gateways associated with the orchestrator across all organizations. An admin organization is one whose ID is listed in the orchestrator config setting `adminOrgIds`. Please contact your Apheris to configure this for your environment.
Owner, Data Steward	Organization Scope	Can only view gateways associated with their own organization. Gateways from other organizations are not visible.
Other roles	No Access	Cannot access the gateway monitoring views.

Gateway List View🔗

The gateway list provides a comprehensive overview of all your gateways with the following columns:

Column	Description
Name	The gateway identifier
Organization	The organization the gateway belongs to
Namespace	The Kubernetes namespace where the gateway is deployed
Release Version	The current software version of the gateway
Active compute specs	Number of active computation specifications
CPU	CPU resource allocation
GPU	GPU resource allocation
Memory	Memory resource allocation
Status	Current operational status
Signing	Digital signature verification status

Use the search box to filter gateways by name, toggle "Live only" to show only active gateways, or click the refresh button to update the list. Click any row to view detailed gateway information.

Gateway Details View🔗

Clicking on a gateway opens the details view, which provides comprehensive configuration and status information organized into the following sections:

Status & Identity - Heartbeat interval, last restart time, organization, namespace, release version, gateway and agent IDs, and deployment flavor
Current Resource Usage - Active compute specs, CPU, GPU, and memory utilization
Digital Signatures - Verification status and public signing keys with certificate information
Auth0 Configuration - Domain, audience, and token validation settings
Gateway Authentication - Runtime verification, NVFlare validation, and public keys

Gateway logs🔗

Note

All Apheris components logs are on UTC timezone.

Log Event Ingestion🔗

The Apheris Gateway components emit logs in jsonline format to stdout/stderr. This integrates with any logging system that is tailored towards Kubernetes.

Note

No Apheris component maintains log files.

Log shipping, ingestion and indexing is out of scope of this guide as we cannot reasonably provide meaningful documentation for setting this up. Please find information about setting up log shipping, ingestion and indexing in the documentation for your specific logging system.

If you have further questions, please contact your Apheris representative or reach out via support@apheris.com.

Log Event Format🔗

All Gateway components produce logs in jsonline format (one json document per log event on a single line) and emit them to the containers (and pods) stdout.

The logs are leveled, the default level is info. The log levels can be set via the agent.logLevel and dal.logLevel helm values.

Fields🔗

field	description
level	the log level of the event
ts	timestamp of the event in unix epoch
msg	main message
error	(optional) error message if present
stacktrace	(optional) stacktrace if present

Gateway Agent Logs🔗

The following examples are reformatted for readability.

An example error event:

{
  "level": "error",
  "ts": 1686125489.6039624,
  "caller": "app/result\_adapter.go:19",
  "msg": "receiving event",
  "agent\_id": "35d1f1d5-318a-458e-9432-97d892c6c296",
  "error": "Get \\"<http://orchestrator/computations\\>": dial tcp: lookup orchestrator on 10.96.0.10:53: server misbehaving",
  "stacktrace": "main.resultAdapter.func1\\n\\t/go/src/app/result\_adapter.go:19"
}

An example computation request event:

{
  "level": "info",
  "ts": 1710169487.2425287,
  "caller": "agent/computation\_pipeline.go:186",
  "msg": "computation request",
  "agent\_id": "c4e84dc3-3248-44b2-890b-b4b6f0b472d0",
  "request": {
    "id": "a1f76a60-300c-43cd-af9a-f7f3cfec9e69",
    "resources": {
      "cpu": 0.5,
      "memory": 500
    },
    "authentication": {
      "userSession": "..."
    },
    "execution": {
      "image": "quay.io/apheris/statistics:0.3.0",
      "dataSources": \[
        {
          "path": "s3://apheris-tutorials-data/whas/worcester/data.csv",
          "key": "whas1\_gateway-1\_org-1"
        }
      \],
      "Parameters": {
        "NvflareParameters": {
          "arguments": "-u -m nvflare.private.fed.app.client.client\_train -m /workspace -s fed\_client.json --set secure\_train=true uid=f44f2052-659a-43fd-84f8-8942627d222c org=org\_yJz0JV5nAkFTkyl9 config\_folder=config",
          "deploymentID": "88aaf187-3ca2-4460-9271-359b1a4ef57d"
        }
      },
      "Statement": {
        "NvflareStatement": {
          "command": "/usr/local/bin/python3"
        }
      }
    },
    "replicas": 1
  }
}

An example heartbeat error event:

{
  "level": "error",
  "ts": 1687431881.4542866,
  "caller": "app/main.go:179",
  "msg": "heartbeat",
  "agent\_id": "972a5b9d-d67e-4474-a3fb-1240cbfedd67",
  "error": "error response from server: <html>\\r\\n<head><title>504 Gateway Time-out</title></head>\\r\\n<body>\\r\\n<center><h1>504 Gateway Time-out</h1></center>\\r\\n</body>\\r\\n</html>\\r\\n",
  "stacktrace": "main.main.func4\\n\\t/go/src/app/main.go:179\\ngithub.com/apheris/node-agent/pkg/orchestrator.Client.GatewayHeartbeat.func1\\n\\t/go/src/app/pkg/orchestrator/orchestrator.go:155"
}

Notable events🔗

msg field	level	when	description
"configuration"	info	once at startup	agent configuration
"computation request"	info	for every computation request event	the entire payload of the computation request event
"heartbeat"	error	for every heartbeat error event	the error message and the stacktrace of the heartbeat error event

Data Access Layer (DAL) Logs🔗

The following examples are reformatted for readability.

An example data access log event:

{
  "level": "info",
  "ts": 1709735744.1490877,
  "caller": "dal/http\_middleware.go:58",
  "msg": "request",
  "instance\_id": "64646a19-62f0-43c1-9c6a-30844a31f749",
  "http\_status": 200,
  "http\_method": "GET",
  "url": "/datasets/s3://apheris-tutorials-data/whas/worcester/data.csv",
  "request\_duration": 0.327430324,
  "error": ""
}

Notable events🔗

msg field	level	when	description
"configuration"	info	once at startup	agent configuration
"request"	info	for every request for a dataset that DAL (Data Access Layer) serves	includes the dataset url (as `url` field)