Hosted Stellar Relayer on GCP: Operator Deployment Guide

A step-by-step guide for infrastructure teams running a hosted Stellar relayer service on Google Cloud Platform.

Who this is for: infrastructure operators who have run production GCP workloads but are new to OpenZeppelin's relayer stack.

What you get: a hosted Stellar Channels service in your own GCP project, sized to serve the same workload OpenZeppelin runs today (roughly 2M+ transactions per day across about 2,500 relayers).

1. Overview

OpenZeppelin runs a hosted Stellar relayer service at channels.openzeppelin.com (mainnet) and channels.openzeppelin.com/testnet (testnet). The service takes on the hard parts of submitting Stellar transactions in parallel: managing a pool of channel accounts, fee bumping, arbitrating sequence numbers, and failing over between RPC providers. Downstream callers just talk to a simple HTTP API.

This guide shows you how to run that same service in your own GCP project.

What You End Up With

By the end of this guide you will have:

  • A production-ready hosted Stellar Channels service in your own GCP project, served from a domain you control (for example, channels.your-company.com).
  • A Cloud Run compute tier with autoscaling, sitting behind an External HTTPS Load Balancer with a Google-managed SSL certificate.
  • Memorystore Redis for state and deferred-job scheduling. In production this runs as STANDARD_HA with automatic failover.
  • Eight Pub/Sub topics and subscriptions that handle the distributed transaction-processing pipeline (when queue_backend = "pubsub").
  • An optional Cloudflare Worker in front of the load balancer for self-serve API-key issuance (the /gen flow), per-user rate limiting, and usage analytics.
  • A Secret Manager entry for every secret. Secrets are injected as environment variables when the container starts.
  • Cloud KMS for ED25519 transaction signing. The module provisions a keyring and an asymmetric signing key.
  • An Artifact Registry remote repository configured to proxy the public ECR image, giving Cloud Run a GCP-native pull path.
  • Optional Cloud Functions for fund-relayer balance monitoring.

The service handles two transaction-submission modes:

  • Signed XDR mode: the caller signs a complete Stellar transaction envelope and submits it. The service only fee-bumps and submits.
  • Soroban func + auth mode: the caller submits a Soroban host function plus authorization entries. The service assembles the transaction, simulates it, signs with a channel account, fee-bumps, and submits.

What This Guide Assumes You Already Have

  • A strong GCP background: VPC, Cloud Run, IAM, Cloud DNS, Memorystore, Pub/Sub.
  • Terraform fluency (1.5.0 or later).
  • A target GCP project where you can create the full resource set.
  • A domain you control. DNS can live in Route53, Cloud DNS, or another provider.
  • Optionally, a Cloudflare account if you want the /gen API-key gateway.

1.5 How Channels Works on Stellar

Every Stellar transaction has a source account with a monotonically increasing sequence number. Only one transaction per source account can be in-flight at a time. This is the constraint that limits parallel throughput on Stellar.

The Channels service works around it with a pool of dedicated source accounts: the channel accounts. Each in-flight transaction acquires one channel account from the pool, uses its sequence number, and releases it after confirmation. The pool size determines how many transactions can run in parallel.

The fund account is a separate Stellar account that holds the XLM balance. When the service submits a transaction, it wraps the channel-signed envelope in a fee-bump transaction, a Stellar primitive that lets a second account pay the network fee. Both accounts are backed by Cloud KMS ED25519 keys.

The pool size you provision in Step 5.10 is your throughput ceiling. See §10.1 for the sizing formula before you bootstrap.

2. Architecture

Cloud Architecture

The whole stack above is provisioned by the gcp Terraform module in OpenZeppelin/relayer-channels-infra. You consume it either by cloning the repo or by referencing it as an external module from your own Terraform.

Components

ComponentGCP ServicePurpose
Edge gatewayCloudflare Worker + KV (optional)API-key issuance, rate limiting, usage tracking
Load balancerExternal HTTPS LB + Google-managed certTLS termination, HTTPS-only, health-checked routing
ComputeCloud Run v2 ServiceRuns the relayer container with autoscaling
StateMemorystore Redis 7.2Transaction records, sequence counters, distributed locks
Queue8 Pub/Sub topics + 8 subscriptionsDistributed transaction processing pipeline
SecretsSecret ManagerAPI keys, admin secrets, encryption keys
SigningCloud KMS (EC_SIGN_ED25519)Transaction signing for fund + channel accounts
Image registryArtifact Registry (remote repo)Proxies ECR Public image for Cloud Run
ObservabilityCloud Logging + Cloud MonitoringApplication logs, metrics
NetworkingVPC + VPC Connector + Private Service AccessPrivate connectivity to Memorystore
Optional monitorsCloud Functions + Cloud SchedulerBalance-check function

App Architecture (Channels Plugin Runtime)

Transaction Lifecycle

Pub/Sub Queue Topology

The relayer's distributed processing layer uses eight Pub/Sub topics with pull subscriptions. The Pub/Sub backend handles retries through Redis sorted sets (a store-and-run-when-due pattern), so there are no dead-letter topics.

Deferred job pattern: Pub/Sub has no native delayed delivery, so deferred jobs (retries with backoff) are stored in Redis sorted sets keyed by their due time. A due-sweep worker runs every 1 to 5 seconds per queue type, claims due jobs from Redis, and publishes them to the topic. The topic only ever carries jobs that are already due.

Capacity Profile

The reference deployment OpenZeppelin runs handles a growing load of about 3M transactions per day, served by roughly 1,000 relayers (fund and channel-account entities combined). The module defaults are sized conservatively for new deployments. Expect to grow into something closer to the production shape as your workload scales.

ResourceModule default (prod)Current GCP deployment
CPU1 vCPU4 vCPU
Memory2 Gi8 Gi
Min instances23
Max instances1020
Redis tierSTANDARD_HASTANDARD_HA
Redis memory5 GB5 GB

The module defaults work fine for a new deployment that is ramping up. The GCP deployment was raised above defaults to handle concurrent transaction stress testing. Tune further as your workload grows.


3. Prerequisites

GCP access, tooling, and Stellar-side accounts must be in place before you run terraform apply.

Accounts and Access

  • A GCP project with billing enabled and permission to create Cloud Run services, Memorystore instances, Pub/Sub topics and subscriptions, Secret Manager secrets, Cloud KMS keyrings and keys, Compute Engine load balancers, VPC connectors, Artifact Registry repositories, and IAM role bindings.
  • A service account for Terraform with these roles:
    • roles/editor for general resource creation
    • roles/resourcemanager.projectIamAdmin to grant IAM roles to service accounts
    • roles/compute.networkAdmin for VPC peering used by Private Service Access
    • roles/cloudkms.admin to create KMS keyrings and keys
    • roles/pubsub.admin to create topics and subscriptions and set IAM policies
    • roles/secretmanager.admin to create secrets and set IAM policies
    • roles/run.admin to manage Cloud Run services
    • roles/artifactregistry.admin to create repositories and set IAM policies
  • A domain you control, with access to create DNS records (Route53, Cloud DNS, or another provider).
  • Optionally, a Cloudflare account with a zone matching your domain, if you want the /gen API-key gateway.

Tooling

ToolVersionWhy
Terraform1.5.0 or laterModule language constraints
Google provider5.0 or later, below 7.0Pinned in versions.tf
Cloudflare provider~> 5.0Required even when enable_cloudflare = false (a Terraform constraint)
gcloud CLIrecent stableAuth, Artifact Registry, debugging
Node.js 18+ and pnpm 10+recent stableOnly if you modify the Channels plugin

Stellar-Side Prerequisites

  • Soroban RPC access: for mainnet, use at least two independent private providers from different infrastructure operators (QuickNode and Ankr are the providers OpenZeppelin uses). "Independent" means different node operators, not different API wrappers on the same underlying node. The public image ships with a public RPC endpoint by default; override it with private providers after deployment (see Step 5.8).
  • Initial XLM funding: each Stellar account requires a minimum base reserve of 1 XLM. For 200 channel accounts plus the fund account, budget at least 250 XLM before transaction fees. Fund the fund relayer's Stellar account first — oz-channels bootstrap draws channel account balances from it.

Reference Repositories

RepoRoleVisibility
OpenZeppelin/relayer-channels-infraTerraform modules and operator CLIs (oz-relayer, oz-channels)Public
OpenZeppelin/openzeppelin-relayerThe relayer applicationPublic
OpenZeppelin/relayer-plugin-channelsThe Channels plugin runtime (TypeScript)Public

4. Environments

We recommend running separate environments with isolated state:

EnvironmentStellar networkGCP project patternCloud Run servicePub/Sub prefix
prodStellar MainnetProduction projectrelayer-channels-servicerelayer-mainnet-prod-
stgStellar TestnetSame or separate projectrelayer-channels-stg-servicerelayer-testnet-stg-

The module derives service naming from app_name plus environment. When environment = "prod", the resource-name suffix is dropped. For other environments, names are suffixed with -<environment>.

Each environment gets its own:

  • Terraform state (use separate GCS backend prefixes).
  • Terraform working directory (examples/gcp/ for stg, examples/gcp-prod/ for prod).
  • VPC connector CIDR range (for example 10.8.0.0/28 for stg and 10.9.0.0/28 for prod if they share a VPC).
  • Secret Manager secrets, KMS keys, and Pub/Sub topics.
  • Cloudflare Worker, if enabled, with distinct names like relayer-channels-stg-gcp-gateway.

5. Step-by-Step Deployment

Full provisioning sequence from authentication through end-to-end verification. Steps 5.1–5.4 set up credentials and configuration; 5.5–5.6 set up the container image and apply infrastructure; 5.7–5.11 wire up DNS, RPC endpoints, signers, and channel accounts.

Step 5.1: Set Up Authentication

export GOOGLE_APPLICATION_CREDENTIALS="$HOME/path/to/service-account-key.json"

If your GCP org blocks gcloud auth application-default login, use a service account key file instead (IAM & Admin > Service Accounts > Keys > Create new key > JSON).

Step 5.2: Get the Module

Option A, reference as an external module (recommended):

module "relayer_channels" {
  source = "git::https://github.com/OpenZeppelin/relayer-channels-infra.git//modules/gcp?ref=main"
  # ... variables
}

Option B, clone the repo:

git clone https://github.com/OpenZeppelin/relayer-channels-infra.git
cd relayer-channels-infra/examples/gcp  # or examples/gcp-prod

Step 5.3: Configure the Terraform Backend

In versions.tf, configure remote state. Do not keep state on a laptop in production.

terraform {
  backend "gcs" {
    bucket = "your-org-terraform-state"
    prefix = "relayer-channels/prod.tfstate"
  }
}

Initialize:

terraform init

Step 5.4: Create Your tfvars

cp terraform.tfvars.example terraform.tfvars

Minimum required configuration:

project_id      = "my-gcp-project"
region          = "us-east1"
environment     = "prod"                # or "stg"
network         = "default"
subnetwork      = "default"
domain_name     = "channels.your-company.com"
container_image = "us-east1-docker.pkg.dev/my-project/ecr-public/w5h5k2p1/openzeppelin-relayer-channels:mainnet-latest"
stellar_network = "mainnet"             # or "testnet"
queue_backend   = "pubsub"

# Secrets, never commit these
relayer_api_key        = ""  # set via TF_VAR_relayer_api_key
channels_admin_secret  = ""  # set via TF_VAR_channels_admin_secret
storage_encryption_key = ""  # set via TF_VAR_storage_encryption_key

Generate secrets:

export TF_VAR_relayer_api_key="$(uuidgen | tr '[:upper:]' '[:lower:]')"
export TF_VAR_channels_admin_secret="$(openssl rand -base64 32)"
export TF_VAR_webhook_signing_key="$(openssl rand -hex 32)"
export TF_VAR_storage_encryption_key="$(openssl rand -base64 32)"   # must be base64-encoded 32 bytes

Step 5.5: Set Up Artifact Registry

Cloud Run cannot pull directly from ECR Public. Configure an Artifact Registry remote repository to proxy it:

  1. GCP Console > Artifact Registry > Create Repository
  2. Format: Docker, Mode: Remote, Source: Custom, URL: https://public.ecr.aws
  3. Name it ecr-public, choose your region

Then reference the proxied image in your container_image tfvar (as shown in Step 5.4).

Tag scheme: mainnet-<version> (pinned, recommended for prod), mainnet-latest (tracks latest), testnet-<version>, testnet-latest.

The public image ships with a public Soroban RPC endpoint that rate-limits under production load. Override it with private providers after deployment in Step 5.8.

Step 5.6: Plan and Apply

terraform plan -out plan.tfplan
terraform apply plan.tfplan

The initial apply takes 10 to 15 minutes. Memorystore provisioning is the slowest leg. Private Service Access peering and SSL cert provisioning also take a few minutes.

Key outputs:

OutputUsed for
cloud_run_service_nameService management, gcloud run commands
cloud_run_service_uriDirect Cloud Run access (bypasses the LB)
load_balancer_ipDNS record creation
redis_hostManual Redis inspection (from a VM in the VPC)
pubsub_topicsMap of queue names to Pub/Sub topic names
kms_signing_key_idFull KMS key ID for signer creation
artifact_registry_urlArtifact Registry URL

Step 5.7: Set Up DNS and SSL

The Google-managed SSL certificate needs DNS to point at the load balancer IP before it can provision.

Without Cloudflare:

  1. Create an A record: channels.your-company.com to <load_balancer_ip>.
  2. Wait 15 to 60 minutes for the certificate to provision (check status in GCP Console > Network Services > Load Balancing > certificate tab).

With Cloudflare:

  1. Create a Cloudflare A record: channels.your-company.com to <load_balancer_ip> (proxy OFF initially, grey cloud).
  2. Create a Route53 A record: channels.your-company.com to <load_balancer_ip>.
  3. Wait for the Google-managed cert to become ACTIVE.
  4. Switch Route53 to a CNAME: channels.your-company.com to channels.your-company.com.cdn.cloudflare.net.
  5. Turn the Cloudflare proxy ON (orange cloud).

Step 5.8: Override RPC Endpoints

The public image ships with a public Soroban RPC endpoint that rate-limits under production load. After the service is healthy, override it with private providers. This is a one-time call — the config persists in Redis across restarts.

curl -s \
  -H "Authorization: Bearer <your-relayer-api-key>" \
  -H "Content-Type: application/json" \
  -X PATCH https://channels.your-company.com/api/v1/networks/stellar:mainnet \
  -d '{
    "rpc_urls": [
      { "url": "https://your-primary-rpc.com/key", "weight": 100 },
      { "url": "https://your-secondary-rpc.com/key", "weight": 100 }
    ]
  }'

Verify:

curl -s -H "Authorization: Bearer <your-relayer-api-key>" \
  "https://channels.your-company.com/api/v1/networks?per_page=200" \
  | jq '.data[] | select(.id=="stellar:mainnet") | .rpc_urls'

Use at least two independent providers from different operators. The relayer load-balances by weight and rotates on failure.

Re-run this PATCH only if you restart with RESET_STORAGE_ON_START=true, which wipes Redis including the network config. Normal restarts and redeployments preserve it.

Step 5.9: Create the Fund-Relayer Signer

Create a Cloud KMS signer using the provided script:

ENV=mainnet API_KEY="$TF_VAR_relayer_api_key" \
GCP_SA_KEY_FILE="$HOME/path/to/sa-key.json" \
./scripts/gcp-kms-signer.sh

This calls the relayer API with "type": "google_cloud_kms" and creates a signer backed by the Cloud KMS key that Terraform provisioned.

Then create the fund relayer:

curl -s -X POST https://channels.your-company.com/api/v1/relayers \
  -H "Authorization: Bearer $TF_VAR_relayer_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "id": "channels-fund",
    "name": "channels-fund",
    "network": "mainnet",
    "signer_id": "<signer-id-from-above>",
    "network_type": "stellar",
    "paused": false,
    "policies": { "min_balance": 0, "fee_payment_strategy": "relayer" }
  }'

Step 5.10: Bootstrap the Channel-Account Pool

Size the pool before bootstrapping. Formula: min_pool = ceil(target_TPS x avg_settlement_seconds x safety_factor). Stellar settlement averages 5 to 7 seconds; use 1.5x as a safety factor. At 23 TPS sustained that gives 173 channels minimum (see §10.1 for detail). For a new deployment with no existing traffic, 50 to 100 channels is a reasonable starting point. Use --dry-run to preview what will be created before committing.

Install the oz-channels CLI from the cli/ directory in this repo:

# From the root of relayer-channels-infra
cd cli
bun install
bun run build

# Link the CLIs globally
cd packages/oz-channels && bun link
cd ../oz-relayer && bun link

# Verify
oz-channels --help
oz-relayer --help

Requires the Bun runtime (Node.js 22+ compatible).

Create a profile and bootstrap:

oz-channels profile init prod-mainnet
# Prompts for: URL, API key, plugin ID (channels), admin secret, network

# Preview
oz-channels bootstrap --to 200 --dry-run -p prod-mainnet

# Provision
oz-channels bootstrap --to 200 -p prod-mainnet

Step 5.11: Verify End-to-End

# Health check
curl -sS https://channels.your-company.com/api/v1/health

# Generate an API key (if Cloudflare is enabled)
curl -X POST https://channels.your-company.com/gen

# Smoke test
oz-channels smoke run -p prod-mainnet

A healthy service returns {"status":"ok"} on the health check. The smoke test submits a test transaction end-to-end and polls for confirmation — success prints a confirmed transaction ID. If the smoke test times out without confirmation, check channel pool size (oz-channels channels list -p prod-mainnet) and fund account balance (oz-relayer relayer balance channels-fund -p prod-mainnet) before debugging further.


6. Configuration Reference

Reference for all environment variables and secrets the module manages automatically. See §11 for the full Terraform variable listing.

Module-Managed Container Environment Variables

The Terraform module sets these. Do not override them unless you have a specific reason.

Env varSet toSource
HOST0.0.0.0Module
STELLAR_NETWORKvar.stellar_networkModule
FUND_RELAYER_IDvar.fund_relayer_idModule
API_KEY_HEADERx-consumer-keyModule, keyed to the Cloudflare Worker rewrite
REPOSITORY_STORAGE_TYPEredisModule
RESET_STORAGE_ON_STARTfalseModule
METRICS_ENABLEDtrueModule
METRICS_PORT8081Module
LOG_FORMATjsonModule
LOG_LEVELvar.log_levelModule
REDIS_URLredis://<memorystore-host>:<port>Module, derived from Memorystore
REDIS_READER_URLredis://<read-endpoint>:<port>Module, falls back to primary on BASIC tier
GCP_PROJECT_IDvar.project_idModule
GCP_REGIONvar.regionModule
DISTRIBUTED_MODEvar.distributed_modeModule
QUEUE_BACKENDvar.queue_backend (when distributed)Module
PUBSUB_TOPIC_PREFIXAuto-derived: relayer-{network}-{environment}Module
PUBSUB_PROJECT_IDvar.project_idModule

Module-Managed Secrets (from Secret Manager)

Container env varSecret Manager IDRequired?Notes
API_KEY{app_name}-relayer-api-keyYesAuthenticates all API requests to the relayer
PLUGIN_ADMIN_SECRET{app_name}-channels-admin-secretYesRequired for channel management operations
WEBHOOK_SIGNING_KEY{app_name}-webhook-signing-keyOptionalOnly created when webhook_signing_key is set in tfvars. Required if you use webhook notifications, otherwise omit it.
STORAGE_ENCRYPTION_KEY{app_name}-storage-encryption-keyOptionalOnly created when storage_encryption_key is set in tfvars. Encrypts sensitive data at rest in Redis. Strongly recommended for production. Must be base64-encoded 32 bytes (openssl rand -base64 32).

The lifecycle { ignore_changes = [secret_data] } on secret versions means that once a secret is created, Terraform will not overwrite the value if you rotate it through gcloud or the Console.

Rotation procedure:

# Update the secret
echo -n "new-value" | gcloud secrets versions add \
  relayer-channels-relayer-api-key --data-file=- \
  --project=your-project

# Force Cloud Run to pick up the new value
gcloud run services update relayer-channels-service \
  --region=us-east1 --project=your-project \
  --update-labels="redeploy=$(date +%s)"

Production Reference Values

If you are targeting OpenZeppelin's reference scale (about 2M+ tx/day), these are the env-var values to tune:

container_environment = [
  # Worker concurrency
  { name = "BACKGROUND_WORKER_TRANSACTION_REQUEST_CONCURRENCY",                 value = "200" },
  { name = "BACKGROUND_WORKER_TRANSACTION_SENDER_CONCURRENCY",                  value = "200" },
  { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_STELLAR_CONCURRENCY",  value = "300" },
  { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_CONCURRENCY",          value = "1" },
  { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_EVM_CONCURRENCY",      value = "1" },
  { name = "BACKGROUND_WORKER_NOTIFICATION_SENDER_CONCURRENCY",                 value = "1" },
  { name = "BACKGROUND_WORKER_SOLANA_TOKEN_SWAP_REQUEST_CONCURRENCY",           value = "1" },
  { name = "BACKGROUND_WORKER_RELAYER_HEALTH_CHECK_CONCURRENCY",                value = "1" },

  # API + plugin concurrency
  { name = "RELAYER_CONCURRENCY_LIMIT",        value = "800" },
  { name = "PLUGIN_MAX_CONCURRENCY",           value = "8000" },
  { name = "MAX_CONNECTIONS",                   value = "4000" },

  # Timeouts
  { name = "REQUEST_TIMEOUT_SECONDS",           value = "60" },
  { name = "PLUGIN_POOL_REQUEST_TIMEOUT_SECS",  value = "60" },
  { name = "PLUGIN_GLOBAL_TIMEOUT_MS",          value = "55000" },
  { name = "PLUGIN_POLLING_TIMEOUT_MS",         value = "45000" },

  # Rate limits
  { name = "RATE_LIMIT_REQUESTS_PER_SECOND",    value = "400" },

  # Redis pools
  { name = "REDIS_POOL_MAX_SIZE",               value = "3000" },
  { name = "REDIS_READER_POOL_MAX_SIZE",        value = "3000" },

  # Transaction cleanup
  { name = "TRANSACTION_EXPIRATION_HOURS",      value = "0.1" },

  # Contract-level pool isolation
  { name = "LIMITED_CONTRACTS",                 value = "C<contract1>,C<contract2>" },
  { name = "CONTRACT_CAPACITY_RATIO",           value = "0.6" },
]

Environment-Based Defaults

SettingProductionNon-production
Min Cloud Run instances21
Max Cloud Run instances104
CPU always allocatedYesNo
Redis tierSTANDARD_HA (failover)BASIC
Redis memory5 GB1 GB
LB deletion protectionEnabledDisabled
Log retention30 days7 days

7. Operational Playbook

Day-2 operations: routine deploys, rollbacks, scaling, channel-pool management, and observability. For initial provisioning, see §5.

7.1 Deploys

Routine deploy (new container image):

  1. Build and push the new image to Artifact Registry (or update the remote repo tag).
  2. Update container_image in tfvars to the new tag.
  3. Run terraform apply. Cloud Run creates a new revision and routes traffic to it.

7.2 Rollbacks

Set container_image back to the previous tag and run terraform apply. Cloud Run keeps previous revisions available for instant rollback.

7.3 Scaling

Adjust in tfvars:

cpu                = "4"
memory             = "8Gi"
min_instance_count = 3
max_instance_count = 20

Running terraform apply applies the change without interruption.

7.4 Channel-Pool Management

# Add slots 201..400
oz-channels bootstrap --from 201 --to 400 -p prod-mainnet

# List current channels
oz-channels channels list -p prod-mainnet

# Add or remove individual channels
oz-channels channels add channel-0050 -p prod-mainnet
oz-channels channels remove channel-0050 -p prod-mainnet

7.5 Monitoring Pub/Sub

Check queue health in GCP Console > Pub/Sub > Subscriptions > Metrics tab:

MetricWatch for
num_undelivered_messagesA growing backlog means processing is falling behind
oldest_unacked_message_ageAbove 60s sustained means workers may be stuck
Pull/Ack operationsHealthy when messages are consumed as fast as they arrive

7.6 Monitoring Redis

Check in GCP Console > Memorystore > Instance > Monitoring tab:

MetricWatch for
CPU utilizationSpikes above 75% sustained
Memory usageClimbing past 70%
Connected clientsApproaching the connection limit

7.7 Inspecting Transactions

oz-relayer tx show <tx-id> -r channels-fund -p prod-mainnet --json
oz-relayer tx list -r channels-fund --status pending -p prod-mainnet
oz-relayer relayer balance channels-fund -p prod-mainnet

7.8 Observability

The relayer emits structured JSON logs and Prometheus-format metrics. On GCP, these map to Cloud Logging and Cloud Monitoring.

Cloud Logging

Cloud Run streams stdout and stderr to Cloud Logging automatically. With LOG_FORMAT=json, the relayer produces structured entries with fields like level, target, span.tx_id, span.relayer_id, and span.request_id.

Viewing logs:

# Recent errors
gcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="relayer-channels-service" AND severity>=ERROR' \
  --project=your-project --limit=20 --freshness=1h --format='value(textPayload)'

# Filter by transaction ID
gcloud logging read 'resource.type="cloud_run_revision" AND textPayload:"<tx-id>"' \
  --project=your-project --limit=20 --freshness=1h

# Live tail
gcloud logging tail 'resource.type="cloud_run_revision" AND resource.labels.service_name="relayer-channels-service"' \
  --project=your-project

In the Console: Cloud Logging > Logs Explorer, then filter by resource.type="cloud_run_revision" and resource.labels.service_name="<your-service>".

Cloud Monitoring Built-In Metrics

Cloud Run and Pub/Sub emit metrics to Cloud Monitoring automatically, with no agent required.

Cloud Run metrics (GCP Console > Cloud Run > Service > Metrics tab):

MetricWhat it tells you
run.googleapis.com/container/cpu/utilizationCPU usage per instance. Sustained above 80% means scale up.
run.googleapis.com/container/memory/utilizationMemory usage. Sustained above 70% risks OOM.
run.googleapis.com/request_countRequest throughput by response code. Watch for 5xx spikes.
run.googleapis.com/request_latenciesp50/p95/p99 latency. Watch for degradation.
run.googleapis.com/container/instance_countActive instances. Confirms autoscaling behavior.
run.googleapis.com/container/startup_latenciesCold-start time. High values affect first-request latency.

Pub/Sub metrics (GCP Console > Pub/Sub > Subscription > Metrics tab):

MetricWhat it tells you
pubsub.googleapis.com/subscription/num_undelivered_messagesQueue depth. A growing backlog means processing is falling behind.
pubsub.googleapis.com/subscription/oldest_unacked_message_ageHow long the oldest message has waited. Above 60s sustained means workers may be stuck.
pubsub.googleapis.com/subscription/pull_message_operation_countPull throughput. Confirms workers are active.
pubsub.googleapis.com/subscription/ack_message_operation_countAck throughput. Confirms messages are being processed.

Memorystore metrics (GCP Console > Memorystore > Instance > Monitoring tab):

MetricWhat it tells you
redis.googleapis.com/stats/cpu_utilizationRedis CPU. Spikes above 75% sustained need attention.
redis.googleapis.com/stats/memory/usage_ratioMemory usage. Climbing past 70% means you should plan capacity.
redis.googleapis.com/stats/connected_clientsConnection count. Watch for approaching limits.
redis.googleapis.com/stats/commands_processedCommand throughput. Correlates with transaction volume.

Log-Based Metrics

Create custom metrics from log patterns in Cloud Logging > Log-based Metrics > Create Metric:

Metric nameFilterPurpose
relayer/errorsresource.type="cloud_run_revision" AND severity>=ERRORTotal error rate
relayer/pool_capacitytextPayload:"POOL_CAPACITY"Channel pool exhaustion events
relayer/provider_pausedtextPayload:"provider paused"RPC failover events
relayer/tx_confirmedtextPayload:"confirmed"Transaction confirmation rate

Or through gcloud:

gcloud logging metrics create relayer-errors \
  --project=your-project \
  --description="Relayer error count" \
  --log-filter='resource.type="cloud_run_revision" AND resource.labels.service_name="relayer-channels-service" AND severity>=ERROR'

Alerting

Create alert policies in Cloud Monitoring > Alerting > Create Policy:

AlertMetricConditionSeverity
High error raterelayer/errors (log-based)More than 50 errors in 5 minCritical
Cloud Run high CPUcontainer/cpu/utilizationAbove 80% for 10 minWarning
Cloud Run high memorycontainer/memory/utilizationAbove 70% for 10 minWarning
Pub/Sub backlog growingsubscription/num_undelivered_messagesAbove 5000 for 10 minWarning
Pub/Sub old messagessubscription/oldest_unacked_message_ageAbove 300s for 5 minCritical
Pool exhaustionrelayer/pool_capacity (log-based)Above 0 in 5 minCritical

Configure notification channels (email, Slack, PagerDuty) in Cloud Monitoring > Alerting > Notification Channels.

Prometheus Metrics

The relayer exposes Prometheus-format metrics on port 8081 at /debug/metrics/scrape (enabled by METRICS_ENABLED=true). When enable_prometheus = true, the Cloud Run service account has monitoring.metricWriter permissions for Google Cloud Managed Prometheus.

To scrape these metrics:

  • Use Google Cloud Managed Prometheus with a sidecar collector.
  • Run a self-hosted Prometheus instance that scrapes the Cloud Run service.
  • Rely on the built-in Cloud Run metrics above for most operational needs.

7.9 Stellar-Side Monitoring

GCP metrics reflect service health. These signals reflect Stellar network health; monitor both.

Fund account balance:

oz-relayer relayer balance channels-fund -p prod-mainnet

Alert when the balance drops below 50 XLM. A depleted fund account fails all fee-bumps silently — transactions submit but cannot be paid for.

Ledger close time: Stellar closes a ledger roughly every 5 seconds under normal conditions. Sustained close times above 10 seconds indicate network stress; settlement latency will exceed the assumptions used in your channel pool sizing. Query Horizon to check:

curl -sS "https://horizon.stellar.org/ledgers?order=desc&limit=5" | jq '._embedded.records[] | {sequence, closed_at}'

TRY_AGAIN_LATER in logs: Horizon is rejecting transactions due to fee competition. This is a Stellar congestion event, not a service failure. Raise MAX_FEE (see §10.7). If TRY_AGAIN_LATER appears alongside provider paused, check RPC provider health first — an unresponsive provider can force retries against a congested fallback.

RPC provider health: Confirm both endpoints are reachable:

curl -sS -X POST <your-rpc-url> \
  -H 'Content-Type: application/json' \
  -d '{"jsonrpc":"2.0","id":1,"method":"getHealth"}' | jq .

8. Debugging Guide

How to Think About Errors

Almost every failure in this system belongs to one of several layers, and the fastest way to debug is to decide which layer owns the symptom before you start reading logs. A request travels from the edge (Cloudflare) to the load balancer, to Cloud Run, into the Channels plugin. The plugin then talks to Redis, Pub/Sub, Cloud KMS, and the Stellar RPC. A 5xx returned at the edge is a different problem from a transaction that was accepted, queued, signed, and then rejected by Horizon.

So when something breaks, work in this order:

  1. Where did it fail? A request that never returns a tx_id failed before or during the synchronous path (edge, LB, auth, fee budget, enqueue). A request that returned a tx_id but never confirmed failed in the async path (channel acquisition, build/simulate, sign, fee-bump, submit, status poll).
  2. What layer owns that step? Match it to a component: auth and rate limits live at the edge and the relayer API, sequence and channel contention live in Redis and the plugin, signing lives in KMS, and the final accept or reject comes from the RPC and Horizon.
  3. Pull the logs for that layer using the entry points below, then match against the common patterns.

The point of this ordering is to avoid reading the wrong logs. Pool exhaustion, sequence drift, and an RPC throttle all look like "transactions are failing" from the outside, but each one lives in a different layer and has a different fix.

Entry Points

You haveStart with
Transaction IDoz-relayer tx show <tx-id> -r channels-fund --json -p <env>
Error messageSearch Cloud Logging for the error pattern
Time windowgcloud logging read with --freshness
Stellar tx hashQuery Horizon, then work backwards to the relayer's tx record
"What's failing right now"Filter logs by severity>=ERROR

Common Log Patterns

PatternWhat it means
provider pausedRPC failover triggered
sequence, counterSequence-number drift or contention
POOL_CAPACITYChannel-account pool exhausted
LOCKED_CONFLICTTwo workers tried to acquire the same channel
TRY_AGAIN_LATERHorizon-side throttling

Redis Inspection

Connect from a VM in the same VPC:

redis-cli -h <redis_host> -p <redis_port>
KEYS *tx:*
GET "oz-relayer:relayer:channels-fund:tx:<tx-id>"

9. Security Model

Covers secrets handling, network isolation, IAM role assignments, TLS posture, and KMS key management. Review before modifying IAM bindings or network ingress settings.

9.1 Secrets Handling

All secrets are stored in Secret Manager. They are currently passed as plain environment variables to Cloud Run. See Known Issues for the plan to switch to secret_key_ref references.

9.2 Network Isolation

  • Cloud Run ingress: restricted to internal plus load balancer traffic (INGRESS_TRAFFIC_INTERNAL_LOAD_BALANCER in production, INGRESS_TRAFFIC_ALL for testing).
  • Cloud Run egress: a VPC Connector with PRIVATE_RANGES_ONLY. Private traffic goes through the VPC (to Memorystore), and public traffic (Stellar RPC, KMS API) goes direct.
  • Memorystore: reachable only through Private Service Access (VPC peering). No public IP.
  • Pub/Sub: IAM-scoped. Only the Cloud Run service account has publisher and subscriber access to the relayer's topics.

9.3 IAM Least-Privilege

The Cloud Run service account ({app_name}-run) has:

RoleScopePurpose
secretmanager.secretAccessorPer-secretRead secrets at startup
monitoring.metricWriterProjectWrite custom metrics
logging.logWriterProjectWrite application logs
monitoring.viewerProjectRead Pub/Sub backlog depth
cloudkms.signerVerifierPer-keySign transactions
cloudkms.publicKeyViewerPer-keyRead the public key
pubsub.publisherPer-topicPublish job messages
pubsub.subscriberPer-subscriptionPull and ack messages
artifactregistry.readerPer-repositoryPull container images

9.4 TLS Posture

  • Load balancer: Google-managed SSL certificate, HTTPS on 443, HTTP redirects to HTTPS.
  • Memorystore: transit encryption is disabled, since Private Service Access provides network-level isolation. Enable it if your compliance requirements call for it and the relayer binary supports TLS (see Known Issues).
  • Cloudflare to LB: set the Cloudflare zone SSL mode to "Full" for end-to-end TLS.

9.5 Cloud KMS for Stellar Signers

  • Key algorithm: EC_SIGN_ED25519 (the Stellar-compatible ED25519 curve).
  • Protection level: SOFTWARE. HSM is also supported but adds latency.
  • IAM: the Cloud Run SA has signerVerifier and publicKeyViewer on the key.
  • Rotation: provision a new key, register a new signer and relayer, fund the new on-chain account, drain the old one, then retire it.

10. Key Gotchas

Operational sharp edges encountered in production deployments. Each item describes a failure mode, its cause, and the fix.

10.1 Channel-Account Exhaustion (POOL_CAPACITY)

Sizing formula:

min_pool = ceil(target_TPS x avg_settlement_seconds x safety_factor)

At about 23 TPS sustained, with roughly 5s Stellar settlement and a 1.5x safety factor: 23 x 5 x 1.5 = 173 channels minimum.

Recovery: oz-channels bootstrap --from <existing+1> --to <new-total>.

10.2 SSL Certificate Provisioning

Google-managed certificates need DNS to point at the LB IP before they provision. With Cloudflare enabled, you have to temporarily point DNS straight at the LB IP (bypassing the Cloudflare proxy), wait for the cert to become ACTIVE, then switch to the Cloudflare CNAME.

If the cert is stuck in FAILED_NOT_VISIBLE for more than 30 minutes, it usually needs to be recreated. Bump the cert name suffix in load-balancer.tf (for example -cert-v2 to -cert-v3) and re-apply. The create_before_destroy lifecycle provisions the new cert before removing the old one, so there is no downtime.

10.3 VPC Connector CIDR Overlap

If you run multiple environments (stg and prod) in the same VPC, each one needs a unique connector_ip_cidr_range (for example 10.8.0.0/28 for stg and 10.9.0.0/28 for prod).

10.4 Private Service Access (Shared Connection)

A VPC can hold only one Private Service Access connection to servicenetworking.googleapis.com. If stg creates it first, prod's apply will fail unless update_on_creation_fail = true is set on the google_service_networking_connection resource. The module handles this.

10.5 Pub/Sub Topic Prefix and Image Compatibility

The PUBSUB_TOPIC_PREFIX env var has to match what the container image expects. Different image versions may or may not append a trailing dash to the prefix. If you see "topic does not exist" errors with double dashes (relayer-mainnet-prod--), remove the trailing dash from the prefix. If topics are missing entirely (no dash), add it back.

10.6 STORAGE_ENCRYPTION_KEY Format

The encryption key has to be base64-encoded 32 bytes (44 characters with = padding). Generate it with openssl rand -base64 32. Hex-encoded keys fail silently with "Invalid key length: expected 32 bytes, got 0".

10.7 Fee-Bump Tuning Under Congestion

Set this through the MAX_FEE env var (default 1000000 stroops, which is 0.1 XLM). Under network congestion, raise it to 10000000 (1 XLM). The Channels plugin uses static fees, so it does not dynamically bump on INSUFFICIENT_FEE.


11. Terraform Variables Reference

Complete listing of all module variables. Required variables must be set in terraform.tfvars; optional variables document their module defaults here.

Required

NameTypeDescription
project_idstringGCP project ID
regionstringGCP region (for example us-east1)
environmentstringDeployment environment (prod, stg). 1 to 16 chars.
networkstringVPC network name or self_link
subnetworkstringSubnet name or self_link
domain_namestringFQDN for the service
container_imagestringContainer image URI
relayer_api_keystringRelayer API key (sensitive)
channels_admin_secretstringAdmin secret (sensitive)

Optional, Core

NameTypeDefaultDescription
app_namestring"relayer-channels"Resource name prefix
name_suffix_environmentbooltrueAppend -{env} to names (auto-off for prod)
labelsmap(string){}Labels for all resources

Optional, Networking

NameTypeDefaultDescription
connector_machine_typestring"e2-micro"VPC connector machine type
connector_min_instancesnumber2Min connector instances
connector_max_instancesnumber3Max connector instances
connector_ip_cidr_rangestring"10.8.0.0/28"CIDR for the VPC connector (/28, must not overlap)

Optional, Container / Cloud Run

NameTypeDefaultDescription
container_portnumber8080Container port
cpustring"1"CPU allocation ("1", "2", "4")
memorystring"2Gi"Memory allocation
min_instance_countnumbernullMin instances. Auto: 2 (prod), 1 (non-prod)
max_instance_countnumbernullMax instances. Auto: 10 (prod), 4 (non-prod)
cpu_always_allocatedboolnullAlways allocate CPU. Auto: true (prod)
health_check_pathstring"/api/v1/health"Probe path
container_environmentlist(object)[]Additional env vars (user overrides win)

Optional, Application

NameTypeDefaultDescription
stellar_networkstring"testnet"mainnet or testnet
fund_relayer_idstring"channels-fund"Fund relayer ID
distributed_modebooltrueEnable distributed queue processing
queue_backendstring"pubsub"pubsub (recommended) or redis
log_levelstring"warn"Application log level

Optional, Secrets

NameTypeDefaultDescription
webhook_signing_keystring""Webhook signing key (sensitive). Only set it if you use webhook notifications, otherwise omit it.
storage_encryption_keystring""Encrypts data at rest in Redis. Must be base64-encoded 32 bytes (sensitive). Strongly recommended for production.

Optional, Redis

NameTypeDefaultDescription
redis_tierstringnullBASIC or STANDARD_HA. Auto per environment.
redis_memory_size_gbnumbernullMemory in GB. Auto: 5 (prod), 1 (non-prod).
redis_versionstring"REDIS_7_2"Redis version

Optional, Cloudflare

NameTypeDefaultDescription
enable_cloudflareboolfalseEnable the Cloudflare Workers gateway
cloudflare_zone_idstring""Required when Cloudflare is enabled
cloudflare_account_idstring""Required when Cloudflare is enabled
relayer_static_api_keystring""Static API key injected by the Worker upstream (sensitive). Use the same value as relayer_api_key.
key_saltstring""Salt for hashing user API keys before storing in KV (sensitive). Generate with openssl rand -base64 32.
gen_ip_rate_hournumber2Max /gen per IP per hour
relay_rpm_per_keynumber60Max relay RPM per key

Optional, Load Balancer

NameTypeDefaultDescription
lb_deletion_protectionboolnullAuto: true (prod), false (non-prod)
lb_log_sample_ratenumber0Request log sampling (0 disables it)

Outputs

NameDescription
cloud_run_service_nameCloud Run service name
cloud_run_service_uriCloud Run service URI (internal)
cloud_run_service_account_emailCloud Run service account email
load_balancer_ipGlobal static IP of the HTTPS LB
domain_nameService domain name
redis_host / redis_port / redis_read_endpointMemorystore connection info
pubsub_topics / pubsub_subscriptionsMap of queue names to Pub/Sub resource names
secret_idsMap of secret names to Secret Manager IDs
kms_key_ring_name / kms_signing_key_name / kms_signing_key_idCloud KMS key info
artifact_registry_repository / artifact_registry_urlArtifact Registry info
cloudflare_worker_nameWorker name (null if disabled)

12. Known Issues

Tracked limitations with current workarounds. These are active constraints, not historical bugs.

Memorystore Redis TLS

Transit encryption is disabled because the relayer binary is not compiled with TLS support for Redis connections. This is acceptable because Memorystore is reachable only through Private Service Access (VPC peering), so traffic never leaves Google's network.

Secret Manager References

Secrets are currently passed as plain environment variables to Cloud Run instead of using secret_key_ref Secret Manager references. This is a workaround for a 0-byte issue hit during the initial deployment. The plan is to switch back to Secret Manager references for a better security posture.

On this page

1. OverviewWhat You End Up WithWhat This Guide Assumes You Already Have1.5 How Channels Works on Stellar2. ArchitectureCloud ArchitectureComponentsApp Architecture (Channels Plugin Runtime)Transaction LifecyclePub/Sub Queue TopologyCapacity Profile3. PrerequisitesAccounts and AccessToolingStellar-Side PrerequisitesReference Repositories4. Environments5. Step-by-Step DeploymentStep 5.1: Set Up AuthenticationStep 5.2: Get the ModuleStep 5.3: Configure the Terraform BackendStep 5.4: Create Your tfvarsStep 5.5: Set Up Artifact RegistryStep 5.6: Plan and ApplyStep 5.7: Set Up DNS and SSLStep 5.8: Override RPC EndpointsStep 5.9: Create the Fund-Relayer SignerStep 5.10: Bootstrap the Channel-Account PoolStep 5.11: Verify End-to-End6. Configuration ReferenceModule-Managed Container Environment VariablesModule-Managed Secrets (from Secret Manager)Production Reference ValuesEnvironment-Based Defaults7. Operational Playbook7.1 Deploys7.2 Rollbacks7.3 Scaling7.4 Channel-Pool Management7.5 Monitoring Pub/Sub7.6 Monitoring Redis7.7 Inspecting Transactions7.8 ObservabilityCloud LoggingCloud Monitoring Built-In MetricsLog-Based MetricsAlertingPrometheus Metrics7.9 Stellar-Side Monitoring8. Debugging GuideHow to Think About ErrorsEntry PointsCommon Log PatternsRedis Inspection9. Security Model9.1 Secrets Handling9.2 Network Isolation9.3 IAM Least-Privilege9.4 TLS Posture9.5 Cloud KMS for Stellar Signers10. Key Gotchas10.1 Channel-Account Exhaustion (POOL_CAPACITY)10.2 SSL Certificate Provisioning10.3 VPC Connector CIDR Overlap10.4 Private Service Access (Shared Connection)10.5 Pub/Sub Topic Prefix and Image Compatibility10.6 STORAGE_ENCRYPTION_KEY Format10.7 Fee-Bump Tuning Under Congestion11. Terraform Variables ReferenceRequiredOptional, CoreOptional, NetworkingOptional, Container / Cloud RunOptional, ApplicationOptional, SecretsOptional, RedisOptional, CloudflareOptional, Load BalancerOutputs12. Known IssuesMemorystore Redis TLSSecret Manager References