DEV Community: NTCTech

Operating Gateway API in Production: What the Migration Guides Don't Cover

NTCTech — Thu, 23 Apr 2026 13:14:15 +0000

You migrated. Traffic is flowing. ReferenceGrants are in place. The controller reconciliation loop is clean. And then — quietly, without a single alert firing — things start breaking in ways your observability stack was never built to see.

Most Gateway API migration guides end at cutover. That is the wrong place to stop. The real operational surface of gateway API production begins exactly where those guides close — and it is governed by a different set of failure physics than anything Ingress introduced.

The thesis is explicit: Gateway API doesn't just change how traffic is routed. It changes where routing failures live — and how invisible they become.

The Gap Nobody Talks About

Part 0 was the decision. Part 1 was the shift. Part 2 was the migration. Part 3 is the reality.

When you ran Ingress, failures were infrastructure-visible. A misconfigured annotation broke routing and your logs showed it. A missing backend returned a 502 and your alerting fired. The failure surface was shallow and legible.

Gateway API moves routing failures into the decision layer. HTTPRoutes can be accepted by the controller — syntactically valid, status condition green — while silently misrouting traffic. ReferenceGrants can be deleted during a routine namespace cleanup with no downstream alert. Header matching logic from the annotation era doesn't translate 1:1, and the mismatch produces no error. It just routes incorrectly.

This is not a tooling gap. It is an architectural one.

Observability: What Changes After Gateway API

Ingress failures were infrastructure-visible. Gateway API failures are decision-layer invisible.

Understanding what your monitoring stack actually covers requires mapping it against three distinct layers:

Layer 1 — Controller Metrics (What You Get)

Standard Prometheus scraping covers the controller layer. Reconciliation loop latency, controller health, memory and CPU. This is the layer most teams think of as "Gateway API observability" — and it is the least useful layer for diagnosing production routing failures. A healthy controller reconciliation loop tells you nothing about whether the routing decision it produced is correct.

Layer 2 — Spec State (What You Miss)

HTTPRoute status fields are not surfaced by default in most monitoring stacks. The conditions you need to be watching — Accepted, ResolvedRefs, Parents — exist in the Kubernetes API but require explicit instrumentation. A route in Accepted: True with a backend in ResolvedRefs: False will route requests to nothing — and your controller metrics will show green the entire time.

Layer 3 — Runtime Behavior (What Actually Matters)

Routing outcomes, backend selection, header and path matching decisions. 200 OK is the new 500: a request that returns a success status from the wrong backend is operationally identical to a silent outage. Runtime behavior requires traffic-level instrumentation — service mesh telemetry, eBPF-based flow data, or access log enrichment — to become visible.

Your monitoring stack sees the controller. It does not see the routing decision.

Policy Enforcement at the Gateway Layer

Gateway API introduces routing-level trust boundaries, not just network boundaries. The real shift is temporal:

NetworkPolicy → Packet-level, always-on
OPA / Gatekeeper / Kyverno → Admission-time, pre-deploy
Gateway API → Runtime routing authorization, request-time ReferenceGrant is not configuration. It is a security boundary.

A ReferenceGrant deletion — which can happen silently during namespace cleanup, RBAC rotation, or automated resource pruning — immediately collapses cross-namespace routing trust. There is no deprecation window. Traffic stops reaching its backend, and the only signal is a ResolvedRefs: False condition that most teams aren't alerting on yet.

The Day-2 Failure Patterns

These are not edge cases. These are the failures teams discover in the first 30–60 days of production.

Failure Mode 01 — Route Accepted, Traffic Misrouted
Accepted: True means valid configuration — not correct behavior. Backend weight misconfiguration, path prefix overlap, or header match ordering errors produce accepted routes that route to the wrong destination. No alerts fire. Traffic just goes somewhere wrong.

Failure Mode 02 — Cross-Namespace Trust Collapse
ReferenceGrant deleted during routine cleanup. Cross-namespace routing immediately fails. The backend is healthy, the controller is healthy, the HTTPRoute status goes ResolvedRefs: False and traffic stops. Recovery requires manual ReferenceGrant reconstruction.

Failure Mode 03 — Header Routing Regression
Annotation-era header logic doesn't translate 1:1 to HTTPRoute match semantics. The route is accepted, the match appears correct in the spec, and the wrong backend receives traffic silently.

Failure Mode 04 — Controller Version Skew
Gateway API evolves faster than most controller upgrade cycles. HTTPRoutes that reference unsupported features are accepted but silently not enforced — the spec says it should work, the controller says nothing, and behavior is undefined.

Failure Mode 05 — TLS Cert Rotation Gap
cert-manager and Gateway API have different mental models of certificate binding. Rotation timing mismatches produce TLS termination failures that appear as backend connectivity issues — not certificate errors — in most monitoring stacks.

Multi-Cluster and Multi-Tenant Considerations

Gateway API simplifies single-cluster routing. It complicates multi-cluster ownership.

The fundamental shift at multi-tenant scale: the problem is no longer routing. The problem is who is allowed to define routes.

Gateway-per-team is the operationally cleaner model for most enterprises — blast radius is contained, ReferenceGrant surface is minimal. The shared Gateway model reduces resource overhead but introduces a ReferenceGrant audit problem at scale that platform engineering needs to own, not application teams.

Cross-cluster route federation remains experimental. Model it as beta operationally, regardless of what the controller documentation claims.

The Real Problem

Teams think they migrated an ingress layer. What they actually introduced is a new control plane.

This is the thread that runs through the entire series. The control plane shift isn't a Gateway API phenomenon — it is the defining architectural pattern of this infrastructure era. Every layer that used to be configuration is now a control plane: service meshes, policy engines, GitOps operators, and now routing.

The teams that operate Gateway API well in production are not the ones with the best controllers. They are the ones that rebuilt their observability model before they needed it.

Gateway API doesn't fail loudly. It fails in decisions your tooling doesn't see.

Architect's Verdict

Part 0 was the decision. Part 1 was the shift. Part 2 was the migration. Part 3 is the reality — and the reality is that Gateway API production operations require a fundamentally different observability model, a new policy enforcement layer, and an audit discipline that didn't exist when you were running Ingress.

DO:

Treat Gateway API as a control plane layer — instrument routing decisions, not just traffic
Alert on HTTPRoute status conditions — ResolvedRefs: False is a production incident
Audit ReferenceGrants continuously — treat deletions as security boundary changes, not cleanup
Pin controller versions to the Gateway API channel they implement — track skew explicitly
Own the ReferenceGrant audit function at the platform engineering layer DON'T:
Assume Accepted: True means working — it means syntactically valid configuration
Treat migration as completion — cutover is the start of the operational surface, not the end
Let controller behavior drift from spec assumptions
Port Ingress annotation logic directly to HTTPRoute without verifying match semantics
Trust cross-cluster Gateway API federation claims without verifying your controller's implementation channel

Architecture diagrams and full failure mode breakdown at rack2cloud.com

Series:

Part 0: Ingress-NGINX Deprecation: What to Do Next
Part 1: Gateway API Is the Direction. Your Controller Choice Is the Risk.
Part 1.5: The Control Plane Shift
Part 2: Kubernetes Ingress to Gateway API Migration
Part 3: Operating Gateway API in Production ← You Are Here

Kubernetes Is Not an LLM Security Boundary

NTCTech — Wed, 22 Apr 2026 12:50:44 +0000

The CNCF flagged it three days ago. Most teams haven't processed what it actually means.

Kubernetes lacks built-in mechanisms to enforce application-level or semantic controls over AI systems. That's not a bug. It's not a misconfiguration. It's a category error in how we're thinking about AI workload security.

Kubernetes isolates containers. It does not isolate decisions.

What Kubernetes Actually Controls

To be clear about the problem, you need to be precise about the scope.

Kubernetes enforces pod isolation, RBAC, network policy, resource limits, and admission control. A well-configured cluster with Cilium, Kyverno, and Falco is genuinely hardened.

All of those controls operate at the infrastructure layer. None of them understand what an LLM is doing inside that boundary.

The Three-Layer Problem

Think of it as three distinct boundaries:

Infrastructure Boundary (Kubernetes): Controls compute, network, identity. Cannot see model behavior, prompts, or outputs.

Application Boundary: Controls API access and service logic. Cannot see model reasoning or semantic intent.

LLM Boundary — the actual risk layer: Controls prompts, outputs, tool usage. This is the layer your current tooling doesn't reach.

Most teams have the first two layers covered. The third is largely unaddressed.

The Failure Mode Kubernetes Will Never Catch

Here's the production scenario that matters:

User submits a prompt with a hidden injection instruction
Model retrieves internal context via RAG
Model outputs sensitive internal data in its response
Response returns HTTP 200
No alerts fire. No logs capture what the model decided. From Kubernetes' perspective: successful request. Pod healthy. RBAC respected. Latency within SLA.

From a security perspective: complete boundary failure.

This is the observability inversion. Traditional monitoring asks: did it run? was it fast? did it error?

LLM observability needs to ask: was it correct? was it safe? was it allowed?

Infrastructure observability measures execution. LLM observability measures outcomes.

What the Actual Boundary Requires

Four control layers need to exist above Kubernetes:

Ingress Control — prompt validation and injection filtering before the model sees the request.

Egress Control — output scanning and PII detection before the response leaves the system.

Action Control — for agentic systems with tool access, explicit allow-lists scoped per model and context. RBAC governs which service account can call which API. This governs which model, in which context, is permitted to trigger which action. Not the same constraint.

Audit Control — sovereign, immutable inference logging. If your inference logs live in a vendor's platform, you don't fully own the audit trail.

Emerging implementations like Kong AI Gateway and Portkey are building toward this pattern — but the pattern matters more than the product. These four components need to exist regardless of what implements them.

When Kubernetes Is Enough

To be honest: there are AI workloads where infrastructure controls are sufficient.

Stateless, isolated LLM — no persistent context
No tool access — text output only
No sensitive context in scope
No external system impact If your workload meets all four conditions, your infrastructure boundary largely holds.

The moment you add RAG retrieval, tool use, memory, or agentic orchestration — any one of them — you're operating at the LLM Boundary layer, and Kubernetes alone isn't sufficient.

Most enterprise AI workloads don't meet those conditions.

The Practical Takeaway

Your Kubernetes security posture is necessary. It is not sufficient for LLM workloads.

The cluster can be hardened. The model is still non-deterministic. Those are two different problems requiring two different control layers.

If you're running LLMs on Kubernetes with only infrastructure-layer controls, you have a boundary problem you haven't measured yet. The absence of alerts isn't evidence of safety — it's evidence that your observability doesn't reach the layer where LLM risk lives.

Full architecture breakdown including the LLM Security Boundary Model and LLM Control Plane Pattern framework at rack2cloud.com

AVS Is a Migration Strategy. Treating It as a Destination Is the Mistake.

NTCTech — Tue, 21 Apr 2026 12:20:25 +0000

Most teams evaluating Azure VMware Solution frame it as an architecture decision.

It isn't. AVS is a migration strategy — and the moment you start treating it as a destination, the financial and architectural consequences start compounding.

The Framing Problem

AVS looks like the safe path out of a Broadcom licensing conversation. Your team knows vSphere. Your tooling maps to VMware constructs. You move workloads without retraining anyone or rearchitecting anything.

What you're not choosing is where to run workloads. You're choosing how hard it will be to leave later.

AVS feels like staying on-prem — just relocated into Azure's billing model. That's the trap. Because you're not escaping VMware. You're relocating it into a metered, provider-controlled environment.

AVS doesn't remove lock-in. It changes where the lock-in lives.

What Changes When You Land on AVS

The familiar operational surface is real. vSphere, vSAN, NSX-T — your ops team recognizes everything they're looking at. Microsoft operates the hardware layer. You operate the guests.

What you lose is the exit path you had on-prem.

On-prem exit cost is physical and operational. AVS exit cost is financial, architectural, and contractual — simultaneously. When you eventually leave AVS, you're not executing a migration. You're executing a second transformation: translating VMware constructs to a target platform while simultaneously unwinding a managed service relationship and absorbing Azure egress costs at scale.

AVS exit is not a migration. It's a second transformation.

When AVS Is Correct

There are legitimate use cases — but they're narrower than the sales motion suggests.

AVS makes sense when:

Compliance requirements are written around vSphere-specific behaviors and can't be renegotiated
Your team has deep VMware expertise and no capacity to absorb an operational model shift during migration
You have a defined, dated exit plan to move off AVS onto native Azure within 3–5 years
You have specific application workloads with hard VMware dependencies that have no near-term abstraction path The key phrase is defined exit plan. If you don't have one, AVS becomes your destination by default.

The Hidden Cost Layer

The published price is for compute. The real cost is in everything around it.

Dedicated bare metal at a three-node minimum floor. vSAN storage overhead that materially reduces usable capacity. NSX-T licensing embedded in the bill whether you use the full capability stack or not. And the one most teams miss: traffic between AVS and native Azure services isn't always free. At scale, that adds up fast — and it almost never appears in the initial cost modeling.

The AVS Decision Test

Before finalizing the architecture decision, run one check.

Are you using AVS to:

Buy time for a defined migration? — Valid.
Avoid retraining your team? — Risky deferral.
Delay re-architecting legacy workloads? — Expensive later. Only one of these is a strategy.

The Verdict

AVS as a deliberate bridge with a committed exit timeline is a rational use of the platform. AVS without a defined exit path is deferred lock-in — you've traded Broadcom's licensing model for Microsoft's managed service model, paid for the familiar operational surface, and left yourself with an exit that's more expensive and more complex than what you started with.

Model the exit before you commit to the entry.

Full architectural breakdown — including the trade-off comparison table, exit cost analysis, and native Azure contrast — is on Rack2Cloud: Azure VMware Solution vs Native Azure: Architecture Trade-offs, Costs, and Exit Risk

The Restore Path Is the Most Neglected Part of Backup Design

NTCTech — Sun, 19 Apr 2026 13:37:47 +0000

The restore path is where backup architectures fail — not the backup job, not the retention policy, not the storage tier.

This is not an operations failure. It is a design omission.

Most architectures are designed to write data — not to get it back.

The Backup Job Is Not the Goal

Most backup architectures are designed around the protection plane — backup jobs complete, retention windows are enforced, replication targets are confirmed. Dashboards go green. SLA reports are generated. The architecture is declared healthy.

None of that measures whether recovery actually works.

A backup job confirms that data was written to a target at a point in time. It tells you nothing about whether that data can be read back under load, whether the application stack can be reconstructed in the correct sequence, whether identity dependencies survive the restore, or whether the recovered state is consistent at the application layer rather than just bootable at the VM layer.

The restore path is the sequence of operations, dependencies, and decision points between a backup completion event and a verified, production-usable recovered state. It is not a single operation. It is an architecture — and most teams have never designed it.

A successful backup proves nothing about your ability to recover.

What the Restore Path Actually Contains

Recovery doesn't fail in one place. It fails across layers that were never designed together.

A functional restore path has four layers that must be explicitly designed, not assumed:

Data retrieval. Where does the backup live, how long does retrieval take, and what are the network and hydration constraints at scale? Object storage restore speeds differ from on-premises targets by orders of magnitude. Cloud archive tiers introduce retrieval latency that can turn a four-hour RTO into a 48-hour one. The rehydration bottleneck is real — and it belongs in the design, not the postmortem.

Dependency sequencing. What order do workloads need to come back online? Databases before application tiers. Identity before anything that authenticates. DNS before anything that resolves. Most organizations have never documented this sequence. The engineers who know it are the ones who happen to be on call during an incident — and that is not an architecture. That is institutional knowledge waiting to walk out the door.

Identity bootstrap. If the production identity plane is compromised or unavailable, what does the recovery environment authenticate against? This is the question that stops most recoveries cold. Ransomware operators understand this — they target the identity plane specifically because a workload that cannot authenticate is not a recovered workload. It is a running VM with no access path.

Application-layer validation. A restored VM that boots is not a recovered application. Application-consistent recovery requires more than a successful backup job — it requires that the restored state is usable at the application layer, not just reachable over the network. Hash validation, restore pipelines, and application-layer health checks must be defined before an incident, not improvised during one.

Why Teams Skip It

The restore path is ignored because it doesn't produce visible success.

There is no dashboard for "can we actually recover."

Backup vendors measure protection-plane health because that is what they can instrument. Job completion rates, storage utilization, replication lag — these are real signals about a system that is working as designed. Recovery-plane health requires the organization to design and test it independently. No vendor ships a product that validates your dependency sequencing documentation or your identity bootstrap runbook. That work belongs to the architect.

The result is a discipline where the visible work gets done and the invisible work gets skipped. Recovery drills exist precisely to surface this gap — but most teams treat them as a compliance exercise rather than an architectural stress test. A drill that confirms the backup is readable is not a recovery test. A recovery test proves the entire restore path — retrieval, sequencing, identity, application validation — executes within the declared RTO under realistic conditions.

Backup success is easy to measure. Recovery success requires you to prove your assumptions wrong.

The Restore Path as a Design Constraint

Recovery is not a procedure problem. It is a constraint problem.

Your RTO is not a target. It is the output of constraints you probably haven't modeled.

Those constraints include retrieval throughput ceilings at your backup target tier, hydration time at scale, network path availability between the recovery environment and the backup source, identity availability in an isolated recovery context, and application dependency ordering that cannot be parallelized. Each constraint has a measurable impact on recovery time. Most organizations have modeled none of them.

The RTO in most DR documentation is not derived from constraint analysis. It is a number someone wrote down during a compliance exercise — unchallenged, untested, and disconnected from the actual physics of the restore path. When the incident arrives, the gap between the documented RTO and the real recovery time is not a surprise. It is the predictable output of skipping the constraint modeling.

The Three-Layer Resilience Model treats recovery as a distinct architectural layer — Layer 3, with its own design requirements and failure modes, separate from backup and DR. The restore path is the operational expression of that layer. If it has not been designed, Layer 3 does not exist regardless of how many backup jobs are completing successfully.

Architect's Verdict

If your organization has a documented backup architecture and no documented restore path, you have half a data protection design. The backup plane tells you that data exists somewhere. The restore path determines whether you can use it when it matters. Teams that invest in protection-plane completeness without modeling restore-path constraints are not protected — they are insured against a risk they have not actually priced.

Design the restore path with the same rigor you applied to the backup architecture. If you haven't tested your restore path against real constraints, your RTO isn't a commitment. It's a guess.

Originally published at rack2cloud.com

Agentic AI Has a Control Plane Problem — Because It Became the Control Plane

NTCTech — Fri, 17 Apr 2026 13:04:36 +0000

Agentic AI control plane governance is the architecture problem most teams are not modeling — and the one that will produce the most expensive failures in 2026.

The control plane became the most sensitive layer in modern infrastructure. So we locked it down.

Kubernetes gave us control plane isolation — the API server, etcd, and the scheduler separated from the workloads they govern. IAM gave us least-privilege scoping — execution authority bounded to the minimum required. Cloud architecture gave us blast radius containment — failure domains designed to limit the lateral spread of a single misconfiguration or breach.

We spent a decade building these constraints. They are not theoretical. They are the operational lessons of every infrastructure failure that taught us what happens when execution authority goes ungoverned.

Agentic AI reintroduces the same problem — without the controls.

We Rebuilt an Agentic AI Control Plane and Skipped Every Safeguard

The mapping is direct. Every infrastructure concept that governs how control planes operate has an agentic equivalent. None of them carry the governance model forward.

Infrastructure Concept	Agentic Equivalent	What's Missing
Control plane API	Tool / API invocation	Policy enforcement layer
IAM roles	Agent credentials	Scope boundaries, auditability
etcd / state store	Memory / vector store	Versioning, governance, access control
Orchestrator	Agent runtime	Isolation boundary

Every column on the right exists in agentic systems today. None of them carry the operational discipline that made the left column safe to run in production.

We spent a decade separating execution from control. Agentic AI collapses that boundary again.

The Agent Is No Longer an Application

This is where the architecture regression becomes a structural risk.

An application calls an API. An agent invokes tools, persists state, chains actions across systems, and makes decisions that trigger further actions — autonomously, at machine speed, across infrastructure it does not own.

That is not an application. That is a control plane with execution authority.

The distinction matters because the entire governance model for applications assumes bounded execution. An application has a defined scope. It calls what it is told to call. It does not decide. An agent decides — and those decisions have downstream effects across every system it can reach.

Most teams are treating agentic AI as a new class of application. They are deploying it inside the application layer, scoping its credentials like a service account, and monitoring it with the same observability stack they use for stateless workloads.

This is the architectural mistake. The agent is not operating at application scope. It is operating at control plane scope. And when a control plane runs without isolation, without enforced policy, and without bounded execution authority — you already know how that ends. You've seen it at the infrastructure layer.

This class of risk has a name: Unbounded Control Planes — a control plane that can create actions, without enforced policy, across systems it does not own.

Kubernetes failed closed. Agentic systems fail open.

The Four Failure Modes That Only Surface in Production

Failure Mode 01 — Credential Amplification

Agents aggregate permissions across every tool they can invoke. The effective access scope is broader than any single IAM role you reviewed at deployment. Blast radius is not the agent's scope — it is the union of every system it can reach.

Failure Mode 02 — Unbounded Execution Chains

One prompt becomes twelve API calls across three systems before a human sees any output. Each step can trigger the next. There is no circuit breaker, no step boundary, no re-evaluation gate. The execution chain is only visible after the damage is already distributed.

Failure Mode 03 — State Persistence Without Governance

Agent memory is not a cache. It is a state layer that shapes every future decision. It is not versioned, not scoped, not audited. When it influences a cross-system action six interactions later, the dependency is invisible — until a failure event forces the trace.

Failure Mode 04 — No Control Plane Isolation

The agent runtime lives inside the application layer. Its credential scope operates at infrastructure authority. There is no isolation boundary between where the agent executes and what it can modify. The application perimeter does not contain infra-level execution authority.

What Architects Need to Get Right (Before This Breaks in Production)

The answer is not a new security framework. It is the governance model you already built for infrastructure — applied deliberately to a layer that is behaving like infrastructure whether you designed it that way or not.

1. Treat Agent Credentials as Control Plane Credentials

If an agent can invoke APIs, it holds infrastructure authority — not application scope. No shared tokens. No implicit trust. Scoped, auditable, revocable — the same standard you apply to anything that can modify state at the infrastructure layer.

Agent identity is not app identity. It is control plane identity.

2. Isolate the Agent Runtime from the Systems It Controls

An agent should not operate inside the same blast radius as the resources it can modify. The execution boundary needs to be explicit — separate runtime, no direct lateral access, mediation layer between the agent and the systems it reaches.

If the agent lives inside your application layer, your control plane is already compromised.

3. Govern Memory as State — Not as a Feature

Persistent memory is not context. It is a state layer that influences future actions across systems. Version it. Scope it. Audit it. Apply the same governance you would apply to any state store that participates in cross-system decision-making.

Unbounded memory creates untraceable behavior.

4. Constrain Execution — Agents Should Not Chain Without Boundaries

The risk is not a single action. It is the accumulation of actions across systems without re-evaluation gates. Limit tool chaining. Enforce step boundaries. Require explicit re-evaluation before an agent proceeds across a system boundary.

Unbounded execution is how small decisions become systemic failures.

5. Reintroduce the Control Plane Boundary — Explicitly

Define where the agent's authority begins and ends before deployment, not after the first production incident. If you do not define the boundary, the agent will — and it will define it as broadly as its credentials allow.

We did not lose control of infrastructure because systems became complex. We lost control when we stopped enforcing boundaries. Agentic AI removes those boundaries by default. Architects need to put them back — deliberately.

Architect's Verdict

The agent is your agentic AI control plane.

If your agent can take action across systems, it is part of your control plane — whether you designed it that way or not. The governance model, the isolation requirements, the credential discipline — none of that is optional at control plane scope. You already know this. You built it once. The only question is whether you apply it again before production forces the lesson.

Architecture diagrams and full failure mode breakdown at rack2cloud.com.

Kubernetes Ingress to Gateway API Migration: How to Move Without Breaking Production

NTCTech — Wed, 15 Apr 2026 12:37:36 +0000

Most Gateway API migrations don't fail during the cutover.

They fail in the translation layer — quietly, before traffic ever moves. The annotation audit skipped. The ingress2gateway output treated as deployment-ready. The staging environment that shared none of the complexity of production. By the time the failure surfaces, it looks like a Gateway API problem. It isn't. It's a migration preparation problem.

Ingress-NGINX hit EOL on March 24 — the repository is read-only, no patches, no CVE fixes. Kubernetes 1.36 drops April 22 with Gateway API as the centerpiece. The window where this was a future consideration closed.

Before You Migrate — The Annotation Audit

The annotation count per Ingress resource is the number that determines which migration path is actually viable. Run this before anything else:

# Count annotations per ingress resource across all namespaces
kubectl get ingress -A -o json | \
  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): \(.metadata.annotations | length) annotations"' | \
  sort -t: -k2 -rn

Three tiers, three different migration realities:

0–5 annotations — ingress2gateway 1.0 handles 80–90%. Most of what lands in your HTTPRoute manifests will be correct. Manual review still required.

6–20 annotations — partial translation. Common annotations (CORS, backend TLS, path rewrite, regex) are covered. Less common ones — configuration-snippet, auth-url, server-snippet — require architectural decisions.

20+ annotations — the tool cannot help you. What those annotations are collectively doing needs to be understood and redesigned before a single manifest is written.

Also find shared Ingress resources — single Ingress objects routing 40+ hostnames for multiple teams. These are coordination problems, not migration targets:

kubectl get ingress -A -o json | \
  jq -r '.items[] | select(.spec.rules | length > 5) |
  "\(.metadata.namespace)/\(.metadata.name): \(.spec.rules | length) host rules"'

ingress2gateway 1.0 — Syntax Translator, Not Architecture Translator

ingress2gateway 1.0 is a genuine improvement — supports 30+ common Ingress-NGINX annotations with behavioral equivalence tests that verify runtime behavior in live clusters, not just YAML structure.

ingress2gateway print \
  --providers=ingress-nginx \
  --namespace=production

Translates cleanly: host/path routing, TLS referencing existing Secrets, CORS headers, backend TLS, path rewrites, regex matching.

Does not translate:

nginx.ingress.kubernetes.io/configuration-snippet — custom Lua, no Gateway API equivalent
nginx.ingress.kubernetes.io/server-snippet — server-level config, no direct equivalent
nginx.ingress.kubernetes.io/auth-url / auth-signin — external auth, requires HTTPRoute filter or extension
ConfigMap global defaults — proxy buffer sizes, upstream keepalive, timeout values don't transfer automatically

Implicit defaults that disappear: Ingress-NGINX's ConfigMap applies defaults globally that aren't in your Ingress manifests. They don't transfer. Document your ConfigMap before migration.

What to Migrate First

Migration sequence matters more than migration speed.

Migrate first: New services with no Ingress config. Internal services with 2–3 host rules and no custom annotations. These establish the operational pattern.

Migrate second: Services with standard CORS, TLS, and path rewrite annotations ingress2gateway handles cleanly. Validate behavioral equivalence before decommissioning each Ingress resource.

Migrate last: configuration-snippet services, external auth integrations, shared Ingress resources, anything with a P1 incident in the last 90 days.

The Side-by-Side Pattern — The Only Safe Model

Cutover-first is an anti-pattern. Both controllers run simultaneously against the same cluster, sharing the same external load balancer IP.

# Install Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml

# Deploy Gateway API controller alongside existing Ingress controller
kubectl apply -f https://github.com/nginx/nginx-gateway-fabric/releases/download/v1.5.0/nginx-gateway-fabric.yaml

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: production-gateway
  namespace: nginx-gateway
spec:
  gatewayClassName: nginx-gateway
  listeners:
  - name: https
    port: 443
    protocol: HTTPS
    tls:
      mode: Terminate
      certificateRefs:
      - name: production-tls
        namespace: nginx-gateway

The one rule: Never configure both an Ingress resource and an HTTPRoute for the same hostname and path simultaneously. The two controllers compete for the same traffic.

HTTPRoute Translation — Before and After

# Before — Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

# After — HTTPRoute
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: app-route
  namespace: production
spec:
  parentRefs:
  - name: production-gateway
    namespace: nginx-gateway
  hostnames:
  - "app.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /api
    filters:
    - type: URLRewrite
      urlRewrite:
        path:
          type: ReplacePrefixMatch
          replacePrefixMatch: /
    backendRefs:
    - name: api-service
      port: 8080

Traffic splitting — native in HTTPRoute, no annotations needed:

rules:
- backendRefs:
  - name: api-stable
    port: 8080
    weight: 90
  - name: api-canary
    port: 8080
    weight: 10

Adjacent Dependencies — Address Before First HTTPRoute

cert-manager: Requires v1.14.0+ for Gateway API support. Configuration moves from Ingress annotations to Gateway resource annotations.

ExternalDNS: Requires v0.14.0+ for Gateway API support. DNS records for HTTPRoute hostnames won't be created automatically on older versions — DNS resolution fails silently.

Prometheus/alerting: Gateway API controllers expose different metric structures than Ingress-NGINX. Dashboards keyed to Ingress-NGINX metric names won't work without updates.

DNS Cutover Sequence

All services validated under load via HTTPRoutes in side-by-side state
Keep Ingress resources — rollback safety
Reduce DNS TTL to 60 seconds — 24 hours before cutover
Update external DNS record
Monitor error rates for 30 minutes
Remove decommissioned Ingress resources after 24 hours of clean traffic — not before

Production Failure Modes — Works in Staging, Breaks in Prod

Header routing mismatch — HTTPRoute header matching is exact by default. Ingress-NGINX treats some header matching case-insensitively. Verify your Gateway implementation's behavior explicitly.

ReferenceGrant missing — the most common failure in multi-team clusters. An HTTPRoute in namespace frontend referencing a Service in namespace api requires a ReferenceGrant. Without it: accepted status, 500 response.

apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
  name: allow-frontend-routes
  namespace: api
spec:
  from:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    namespace: frontend
  to:
  - group: ""
    kind: Service

TLS handshake surprise — Ingress-NGINX's TLS defaults (cipher suites, protocol versions) live in the ConfigMap. Gateway API controllers start from their own defaults. Validate TLS behavior against legacy clients explicitly before cutover.

Implicit defaults disappearing — proxy timeouts, upstream keepalive, buffer sizes set in the Ingress-NGINX ConfigMap don't transfer. A service relying on a 600-second proxy timeout reverts to the controller's default silently. Audit the ConfigMap before any service migrates.

Architect's Verdict

ingress2gateway 1.0 handles straightforward migrations cleanly. The gap it cannot close is between syntax translation and architectural translation. Find the untranslatable annotations during the audit — not during the rollback.

The side-by-side pattern is the correct one. Both controllers running against the same load balancer IP costs nothing and eliminates the primary risk vector: the all-at-once cutover that discovers production failure modes under incident conditions.

The migration doesn't fail where you think it will. It fails in everything you assumed would just translate.

Part of the Rack2Cloud Kubernetes Ingress Architecture Series. Full post with interactive examples at rack2cloud.com.

AWS vs Azure vs GCP: The Decision Framework Most Teams Skip

NTCTech — Tue, 14 Apr 2026 12:08:29 +0000

A cloud provider decision framework should answer one question: not which cloud is best, but which set of tradeoffs your organization can actually absorb. Most teams never ask it. They choose based on pricing sheets, discount conversations, and whoever gave the best demo — then spend the next three years engineering around the decision they didn't fully think through.

There's a post that gets written every six months. Three columns. Feature checkboxes. A winner declared. It's benchmarked theater dressed up as architectural guidance — and it's the reason teams keep making the same mistake.

The right question isn't "which cloud is best?" It's being asked at the wrong altitude entirely. The right question is: what are you optimizing for, and which provider's tradeoffs are closest to what you can actually absorb?

This isn't a feature comparison. It's a cloud provider decision framework for architects who have already been burned once and need a structured way to make a decision they'll live with for years.

The Problem With Vendor Comparisons

Before the framework, let's name the three traps every vendor comparison falls into — and that this post deliberately avoids.

Feature parity illusion. Every major cloud provider offers compute, storage, managed Kubernetes, serverless, and a database catalog. At the feature checklist level, they're nearly identical. Comparing feature lists is the architectural equivalent of choosing a car by counting cup holders.

Benchmark theater. Vendor-commissioned benchmarks measure the workload the vendor chose, on the instance type the vendor wanted, in the region the vendor optimized. Real workloads don't run like benchmarks. Your I/O patterns, burst behavior, and inter-service communication do not map to a synthetic test.

Pricing misdirection. List price comparisons ignore egress, inter-AZ traffic, support tier costs, managed service premiums, and the billing complexity tax your team will pay in engineering hours to understand the invoice. A cheaper instance type in a more complex billing model is often the more expensive decision.

This cloud provider decision framework evaluates AWS, Azure, and GCP across five axes — not features, not pricing sheets. Each axis surfaces a tradeoff you will encounter in production. The goal is not to find a winner. The goal is to understand which set of tradeoffs your organization can actually absorb.

Cloud Provider Decision Framework: Five Axes That Actually Matter

Control vs Abstraction — How much of the stack do you own?
Cost Model Behavior — Not pricing. How the bill actually behaves.
Operational Model — IAM, networking, and tooling friction at scale.
Workload Alignment — Does the provider's architecture match what you're running?
Org Reality — The axis most teams skip entirely.

Axis 1: Control vs Abstraction

This is the most misunderstood dimension in cloud selection. Teams conflate "control" with complexity — but what you're actually evaluating is how far down the stack you can operate, and how much the provider's abstractions constrain your architecture.

AWS is the lowest-level of the three. VPC construction, subnet design, routing tables, security group rules — AWS exposes the plumbing. That's a feature for teams with the operational depth to use it. It's a liability for teams that don't. You can build anything on AWS. You can also build yourself into remarkably complex corners.

Azure is architected around abstraction. Resource Groups, Management Groups, Subscriptions, Policy assignments — the entire governance model is built to match enterprise org charts. The tradeoff is that Azure's abstractions were designed for Microsoft shops. If your org runs Active Directory, M365, and has an EA agreement, Azure's model fits like it was built for you. Because it was.

GCP is opinionated in a different way — it enforces simplicity at the networking and IAM layer in a way AWS doesn't. GCP's VPC is global by default. Its IAM model is cleaner. But GCP's "simplicity" is Google's opinion of simplicity, and it constrains what you can express in ways that become visible at enterprise scale.

Provider	Control Model	You Gain	You Give Up
AWS	Lowest-level primitives	Maximum architectural expression	Operational complexity at scale
Azure	Enterprise abstraction layers	Governance fit for enterprise orgs	Flexibility outside Microsoft patterns
GCP	Opinionated simplicity	Cleaner IAM and networking defaults	Enterprise-scale expressiveness

The connection to platform engineering is direct. If your team is building an Internal Developer Platform on top of your cloud provider, the abstraction model matters more than almost anything else. A low-level provider like AWS gives you the raw materials but requires your platform team to build the guardrails. Azure's governance model gives you guardrails by default but constrains the golden paths you can construct.

Axis 2: Cost Model Behavior (Not Pricing)

What you need to model is how the bill behaves — not what it says on page one of the pricing calculator.

Egress is the hidden architecture tax. Every provider charges for data leaving the cloud. The rate, the exemptions, and the behavior at scale differ enough to change architecture decisions. High-egress architectures — analytics platforms, media pipelines, hybrid connectivity — need to model this before selecting a provider, not after.

Inter-service costs. Cross-AZ traffic isn't free on any major provider. For microservices architectures with high inter-service call volumes, this becomes a non-trivial line item. GCP's global VPC model reduces some of this friction; AWS's multi-AZ design philosophy creates it by default.

Billing complexity tax. AWS has the most expansive managed service catalog, which means the most billing dimensions. Understanding your AWS bill — truly understanding it, not approximating it — requires tooling, organizational process, and someone responsible for it. Azure's billing model is simpler for organizations already inside the Microsoft commercial framework. GCP's billing is generally considered the most transparent of the three.

Cloud cost is now an architectural constraint — not a finance problem.

![Cloud cost iceberg diagram showing list price above the waterline and hidden costs including egress, inter-AZ traffic, and billing complexity below

](https://hello.doclang.workers.dev-uploads.s3.amazonaws.com/uploads/articles/qnfvb0zcr49ulh0iw5fo.jpg)

Axis 3: Operational Model

The operational model question is: what does Day 2 look like? Not the demo. Not the quickstart. The third year, when you have 400 workloads, three teams, and a compliance audit.

IAM complexity. AWS IAM is the most powerful and the most complex. Role federation, permission boundaries, service control policies, resource-based policies — the surface area is enormous. That power is real. So is the blast radius when a misconfiguration propagates. Azure's RBAC model maps cleanly to Active Directory groups and organizational hierarchy. GCP's IAM is the cleanest conceptually but constrains some enterprise patterns.

Networking model. AWS VPCs are regional and require explicit peering, Transit Gateways, or PrivateLink for cross-VPC connectivity. This creates operational overhead at scale that is non-trivial. GCP's global VPC is genuinely simpler. Azure's hub-spoke topology is well-documented and fits enterprise network patterns, but the Private Endpoint DNS model is a known operational hazard — the gap between the docs and production behavior is where most architects get surprised.

Tooling ecosystem. Terraform covers all three providers, but ecosystem depth varies. AWS has the most community modules, the most Stack Overflow answers, and the most third-party tooling integration. This has operational value that doesn't appear on a feature matrix.

Your identity architecture lives underneath all of this — but the failure modes look different depending on which IAM model you're operating.

Axis 4: Workload Alignment

Different workloads have different gravitational pull toward different providers. This isn't brand loyalty — it's physics.

Workload Type	Natural Fit	Why
AI / ML training at scale	GCP	TPU access, Vertex AI, native ML toolchain depth
Enterprise apps + M365/AD	Azure	Identity federation, compliance tooling, EA pricing
Cloud-native / microservices	AWS	Broadest managed service catalog, deepest ecosystem
High-egress data pipelines	GCP	More favorable inter-region and egress cost model
Regulated / compliance-heavy	Azure	Compliance certifications depth, sovereign cloud options
Maximum architectural control	AWS	Lowest-level primitives, largest IaC community surface

Note the word "natural fit" — not "only choice." Any of the three providers can run any of these workloads. What the table captures is where the provider's architecture meets your workload with the least friction. Friction has a cost. It shows up in engineering hours, workarounds, and architectural debt.

Axis 5: Org Reality (The Axis Most Teams Skip)

This is the axis that overrides everything else — and it's the one that never appears in vendor comparison posts.

Team skillset. The best-architected platform in the world fails if your team can't operate it. If your infrastructure team has five years of AWS experience, choosing Azure because the deal was better introduces a skills gap that will cost more in operational incidents than the discount saved.

Existing contracts. Enterprise Agreements, committed use discounts, and Microsoft licensing bundles change the financial calculus entirely. An organization with $2M/year in Azure EA commitments is not evaluating Azure on its merits alone — it's evaluating a sunk cost and an existing commercial relationship. That's real, and it belongs in the decision.

Compliance and data residency. Sovereign cloud requirements, data residency mandates, and industry-specific compliance frameworks constrain provider choice in ways that no feature matrix captures. Any cloud provider decision framework that doesn't account for compliance jurisdiction is incomplete for enterprise use.

The vendor lock-in vector. Lock-in doesn't happen through APIs. It happens through networking topology, managed service dependencies, and IAM entanglement.

Where Cloud Provider Decision Frameworks Break Down

Most failed cloud selections share one of four failure modes.

Choosing on discount. A 30% first-year commit discount from a provider whose operational model is misaligned with your team's skillset is not a good deal. The discount is front-loaded. The operational friction is paid for years.

Ignoring egress. Architecture decisions made without modeling egress costs are architecture decisions that will be revisited — expensively. The interaction between egress, inter-AZ, and PrivateLink costs requires architectural modeling, not a pricing page scan.

Over-indexing on one workload. Selecting a provider based on its ML/AI capabilities when only 10% of your workloads are AI-adjacent means the 90% pays a friction tax for an advantage that benefits a minority of what you're running.

Assuming portability. "We can always move" is the most expensive sentence in enterprise cloud strategy. Data gravity, networking entanglement, and IAM architecture make workloads significantly less portable than they appear on day one.

The Multi-Cloud Trap

Multi-cloud is usually an outcome of org politics, not an architecture strategy.

Multi-cloud as a strategy means you deliberately spread workloads across providers to avoid lock-in, optimize for workload-specific fit, or maintain negotiating leverage. This is valid in limited, well-scoped scenarios.

Multi-cloud as an outcome means different teams made different decisions, different acquisitions landed on different providers, and now you have operational complexity without the strategic benefit. This is what most "multi-cloud" environments actually are.

Multi-cloud doesn't prevent outages — it can make them cascade in ways that single-cloud architectures don't.

The Decision Table

If You Optimize For	Lean Toward	What You Give Up
Maximum architectural control	AWS	Operational simplicity — AWS rewards depth
Enterprise governance fit	Azure	Cost transparency, flexibility outside Microsoft patterns
ML/AI workload fit	GCP	Ecosystem breadth, enterprise tooling depth
Egress cost minimization	GCP	Managed service catalog breadth
Managed service ecosystem	AWS	Billing simplicity, networking elegance
Compliance + data residency	Azure	Cost structure flexibility outside EA model
Org familiarity / team skills	Current provider	Possibly better workload fit — skills gaps are real costs

Architect's Verdict

The best cloud provider isn't universal. There is no winner in this comparison because the comparison is the wrong unit of analysis. The right unit is: which set of tradeoffs does your organization have the capability, the commercial reality, and the operational depth to absorb?

AWS rewards teams with the depth to use low-level control. Azure rewards organizations already inside the Microsoft ecosystem. GCP rewards workloads where simplicity and ML tooling matter more than ecosystem breadth. None of those statements are disqualifying for any provider — they're maps to where the friction lives.

The teams that make this decision well are the ones who start with the question: what are we optimizing for? Not which cloud has the most features. Not which rep gave the better demo. Not which provider gave the biggest first-year discount.

You're not choosing a cloud provider. You're choosing a set of tradeoffs you'll live with for years. Choose with your eyes open.

Originally published at rack2cloud.com

The Control Plane Shift: Why Every Infrastructure Decision in 2026 Is the Same

NTCTech — Mon, 13 Apr 2026 12:25:13 +0000

Your VMware renewal lands. The number is larger than last year. You open a spreadsheet and start modeling Nutanix.

Your platform team flags that Terraform is on the IBM/HashiCorp BSL and they want to evaluate OpenTofu.

Your Kubernetes backup posture comes up in an audit. Someone asks whether Velero gives you real portability or just the appearance of it.

Your AI inference bill arrives 40% higher than the compute spend it replaced.

These feel like four separate conversations. Different vendors, different teams, different budget lines.

They're not. Underneath each one, the structural question is identical: who controls your control plane, and what does it cost you when that control shifts?

What "Control Plane" Actually Means Here

Not just Kubernetes API server and etcd. In the broader architectural sense: the system that determines what your infrastructure does, how it changes, and who has authority to make it change.

Every major platform ships with a control plane embedded in the product. You don't buy a hypervisor — you buy a hypervisor plus the governance model that dictates its future. You don't buy backup tooling — you buy backup behavior plus the model that controls the recovery logic.

What's new in 2026: the cost and risk of that embedded control plane has become the dominant factor in platform decisions — more than features, more than performance. And renewal cycles on multiple control plane dependencies are arriving simultaneously.

Axis 01 — Virtualization: From Architecture to Vendor Exposure

Pre-Broadcom: VMware evaluation = architecture evaluation. Benchmarks, vSAN replication factors, RTO/RPO modeling.

Post-Broadcom: the conversation starts with the renewal number.

The unit of decision changed. You're no longer optimizing architecture — you're managing vendor exposure. The question isn't which hypervisor is technically superior. It's whether you accept Broadcom's contract model or design around it.

The four real axes:

Axis	The Question
Cost Predictability	Can you model your VMware bill 3 years out?
Control Plane Ownership	Who dictates how your architecture evolves?
Migration Physics	What does your actual workload inventory look like?
Exit Cost (Future)	Are you trading one lock-in for another?

That last axis is the one most migration assessments skip. Nutanix's Prism is a different control plane — not the absence of one.

Axis 02 — IaC: From Tooling to State Ownership

Terraform's state file is not metadata. It is the authoritative mapping between every HCL declaration and its real-world provider identity. It is the control plane record that makes apply deterministic rather than destructive.

When HashiCorp moved to BSL — and IBM acquired HashiCorp in 2025 — the question that mattered wasn't whether the binary still worked. It was: who controls the evolution of the system that owns your infrastructure state?

OpenTofu's CNCF membership and MPL 2.0 license provide a structurally different answer. Multi-vendor Technical Steering Committee. Community roadmap. At Spacelift, 50% of all deployments now run on OpenTofu. The fork executed.

But the honest frame: migrating to OpenTofu replaces a vendor support contract with internal operational ownership. That trade is worth it for many teams. It is not cost-free for any of them.

Axis 03 — Kubernetes: Portability Theater vs. Real Recovery Authority

The Velero CNCF move at KubeCon EU 2026 is the clearest example.

Vendor-neutral governance = no single vendor controls the roadmap. Real.

Vendor-independent operations = your recovery path survives without them. Still an engineering problem.

Velero's restore path still requires live external object storage. Your IAM credential chain still needs to survive the same incident your cluster didn't. CNCF governance doesn't change operational dependencies.

Kubernetes portability is real at the workload layer. Control plane survivability — backup, networking, identity, state — must be engineered explicitly at every layer below it.

Axis 04 — AI Infrastructure: From Compute to Cost Placement

AI inference crossed 55% of total AI cloud spend in early 2026. Most teams are still running inference on the same GPU clusters used for training — architecturally equivalent to running prod databases on dev servers.

The control plane problem: cost is behavioral, not provisioning-based. Every token, every API call compounds. Teams that accepted a hyperscaler's AI infrastructure defaults — model selection, routing logic, token budgets — accepted a cost control plane they don't own.

The fix is cost-aware model routing: a decision layer between request and model. A keyword lookup should not get the same compute as multi-step reasoning. That routing decision is a control plane decision. Most teams left it at the platform default.

The Unified Pattern

Every control plane shift follows the same sequence:

Vendor embeds control plane in product
Product adoption creates dependency
Vendor adjusts terms (pricing, licensing, governance, architecture)
Exit cost revealed — higher than anticipated
Architect decides: accept new terms or engineer around them — under time pressure

The mistake: treating each instance as a separate vendor negotiation. It's a portfolio of control plane exposures with compounding renewal cycles.

The Three-Question Test

For every platform in your stack:

01 / If the vendor changes the terms tomorrow — what breaks and what survives?
Map every dependency: licensing validation, management APIs, backup paths, routing logic.

02 / If you migrate in three years — what is the actual cost?
Not licensing delta. State files, runbooks, operational muscle memory, migration windows.

03 / If you accept the control plane as-is — what architectural choices does it foreclose?
Every dependency narrows the option space for future decisions.

Architect's Verdict

The control plane shift is not a trend. It's the operating condition of enterprise infrastructure in 2026.

The right response isn't eliminating all vendor control planes — they exist because they solve real problems. The right response is making the control plane decision explicitly, with visibility into the exit cost, before the renewal cycle forces it.

Answer the three questions for every platform in your stack. The shift is already happening. The only variable is whether you're navigating it deliberately or reacting under pressure.

Originally published at rack2cloud.com — architecture-first analysis for enterprise infrastructure teams.

containerd vs CRI-O: Memory Overhead at Scale (Real Node Density Limits)

NTCTech — Sat, 11 Apr 2026 12:43:23 +0000

When evaluating containerd vs CRI-O, the decision rarely comes down to features — it comes down to what happens at node density limits.

At low pod counts, every container runtime looks efficient. At scale, memory overhead becomes the limit you didn't plan for.

This isn't a benchmark. It's about how many pods you actually fit per node — and what happens to your infrastructure cost when the runtime you chose starts eating into that headroom.

Why Runtime Memory Overhead Gets Ignored Until It Hurts

Most runtime comparisons test containerd and CRI-O at idle or single-digit pod counts. The numbers look clean. The difference looks negligible. Teams make a selection based on ecosystem alignment or documentation quality and move on.

Then the cluster scales.

What changes isn't the per-pod overhead in isolation — it's the compound effect of runtime daemons, kubelet interaction, and scheduling burst behavior under real workloads. That's where containerd and CRI-O start to diverge in ways that matter to infrastructure cost.

What Most Benchmarks Miss

What Benchmarks Test:

Baseline runtime memory at rest
Single container startup time
Low-density scenarios (10–20 pods)
Isolated runtime behavior

What They Miss:

Memory behavior under scheduling bursts
Daemon overhead as pod count climbs
Kubelet + runtime interaction at high churn
System pressure when nodes approach capacity

The result is a clean number that tells you almost nothing about how your nodes behave at 60% or 80% capacity. Real clusters don't idle. They schedule, reschedule, crash-loop, and scale — and runtime overhead compounds with every event.

containerd vs CRI-O: The Scaling Curve

Based on observed patterns across production environments and CNCF published data:

~25 pods — Negligible difference.
Both runtimes perform within margin of error. Memory delta is under 1% of node capacity on a standard 8GB worker node. Runtime choice has no operational impact at this density.

~75 pods — Measurable divergence begins.
containerd's daemon architecture carries slightly higher baseline memory than CRI-O's leaner footprint. The gap is real but not yet a scheduling constraint — roughly 3–5% delta in runtime-attributed memory.

150+ pods — Overhead becomes a capacity question.
Cumulative runtime daemons, per-container shim processes, and kubelet overhead can represent 8–12% of total node memory at high density. On a node targeting 200 pods, that's capacity you planned for workloads now allocated to infrastructure.

CRI-O's stricter CRI compliance and leaner daemon model gives it a measurable edge at the 150+ tier. The tradeoff is ecosystem reach and operational tooling.

What That Overhead Actually Costs

Consider a cluster running 1,000 pods across worker nodes sized at 8GB RAM:

At 150 pods per node, you need roughly 7 nodes
A 10% memory overhead difference means one of those nodes runs at reduced usable capacity
Across 10 nodes, you're looking at the equivalent of one full node consumed by runtime overhead

At AWS on-demand pricing for a standard compute-optimized instance, that's $150–$400/month depending on instance class — for overhead that never appeared in your initial sizing model.

Operational Reality: What the Memory Number Doesn't Tell You

Debugging complexity
containerd's tooling ecosystem is broader. ctr, crictl, and third-party integrations are more mature. When something breaks at 3AM, the containerd debugging path has wider community coverage. CRI-O's stricter model means fewer surprises — but fewer resources when you hit an edge case outside the OpenShift ecosystem.

Ecosystem alignment
containerd is the default runtime for EKS, GKE, and most upstream Kubernetes distributions. CRI-O is the native runtime for OpenShift and optimized for environments where strict CRI compliance is a hard requirement. If you're on OpenShift, the decision is already made for you.

Stability under churn
High pod churn — rolling deployments, HPA scaling events, crash-loop recovery — stresses runtime stability differently than steady-state operation. containerd's production hardening gives it an edge in high-churn environments. CRI-O performs well in stable, controlled environments where pod lifecycle is more predictable.

How to Use This in Your Node Sizing

Know your target pod density. Under 50 pods per node — runtime memory overhead is not a decision factor. Targeting 100+ — it belongs in your sizing calculation.
Add 10–15% runtime overhead buffer at high density regardless of runtime choice.
Match runtime to ecosystem, not benchmarks. containerd wins on reach, tooling, and churn stability. CRI-O wins on memory efficiency at extreme density.

Architect's Verdict

containerd is the right default for most teams — broader ecosystem support, better tooling, and proven stability under high churn make it the lower-risk choice at scale. CRI-O earns its place in environments where pod density is extreme and operational complexity is tightly controlled, or where OpenShift is already the platform. The memory delta between them is real at 150+ pods per node, but it's a sizing input, not a reason to fight your ecosystem. Model the overhead, right-size your nodes, and pick the runtime your platform already expects.

Originally published on rack2cloud.com — architecture for engineers who run things in production.

Velero Going CNCF Isn't About Backup. It's About Control.

NTCTech — Fri, 10 Apr 2026 12:53:01 +0000

The Velero CNCF backup announcement at KubeCon EU 2026 was framed as an open source governance story. Broadcom contributed Velero — its Kubernetes-native backup, restore, and migration tool — to the CNCF Sandbox, where it was accepted by the CNCF Technical Oversight Committee.

Most coverage treated this as a backup story. It isn't.

Velero moving to CNCF governance is a control plane story disguised as an open source announcement. And if your team is running stateful workloads on Kubernetes, the distinction between vendor-neutral governance and vendor-independent operations is the architectural decision that sits beneath the headline.

What the Velero CNCF Backup Move Actually Means

Velero originated at Heptio — founded by Kubernetes co-creators Joe Beda and Craig McLuckie — which VMware acquired in 2019. It's been under VMware, then Broadcom stewardship ever since. The project operates at the Kubernetes API layer, not the storage layer. All backup operations are defined via CRDs (Backup, Restore, Schedule, BackupStorageLocation, VolumeSnapshotLocation) and managed through standard Kubernetes control loops.

At KubeCon EU, Broadcom formalized the transition: Velero is now a CNCF Sandbox project, with maintainers from Broadcom, Red Hat, and Microsoft.

Broadcom's own framing was telling: "We really don't want people to mistrust the open source project and believe that it's somehow a VMware thing even though it hasn't been a VMware thing for quite some time."

This move is as much about trust repair as governance mechanics.

Vendor-Neutral ≠ Vendor-Independent

This is the distinction most teams will miss.

Vendor-neutral governance means no single vendor controls the roadmap. CNCF governance means Broadcom can no longer make breaking changes to Velero unilaterally. Community-steered, broader contributor base. That's real.

Vendor-independent operations means your recovery path survives without the vendor. That's a different question entirely — and CNCF governance doesn't answer it.

Your backup storage location is still a cloud bucket outside your cluster. Your IAM credentials still have to reach that bucket. Your restore workflow still depends on a working target cluster. None of those operational dependencies changed on March 24th.

The Real Architecture Question

When your cluster dies — what actually survives?

Velero operates at the Kubernetes API layer, which makes it a state reconstruction layer, not a storage tool. A Velero backup is a portable snapshot of declarative cluster state — namespaces, CRDs, RBAC policies, PVC claims — not a disk image.

That portability is the real capability. A backup taken on VKS can theoretically be restored on EKS, AKS, or bare-metal kubeadm — because it operates through the Kubernetes API, not hypervisor-specific snapshots.

But state reconstruction has limits:

Axis	What Velero Controls	What Velero Depends On
Backup Definitions	CRDs inside cluster	etcd — gone if cluster is gone
Restore Logic	Velero controller + API server	Working target cluster
Metadata	Object metadata, resource specs	External object storage bucket
APIs	Kubernetes API layer ops	Cloud IAM for bucket access

Velero cannot bootstrap a cluster from nothing. It cannot authenticate to object storage without valid IAM credentials. It cannot run a restore without a target cluster already operational.

The Four Production Failure Modes

These won't appear in the press releases:

01 / Object Storage Dependency
Every backup lands outside your cluster in object storage. Full cluster failure + network partition = recovery blocked, regardless of whether the backup data is intact.

02 / IAM Credential Survivability
Velero authenticates via IAM roles, IRSA, or Workload Identity — all provisioned outside Velero itself. If your identity system is compromised or the cloud control plane is unavailable, the data exists but is unreachable.

03 / Restore-Time Complexity
Velero restores Kubernetes objects. It does not restore external databases, DNS records, ingress configurations, or certificate bindings. The gap between "backup succeeded" and "system restored" is proportional to how many external dependencies your workloads carry.

04 / Air Gap Theater
Velero deployed with on-premises MinIO, backups running, compliance checkbox ticked. The problem: restore still requires live access to that storage endpoint, live IAM credentials, and a functional API server. If those dependencies fail, the air gap was theater. The backup exists. The restore doesn't work.

The Broadcom Signal Worth Reading

Broadcom has been navigating a trust deficit since the VMware acquisition — the pricing restructuring, perpetual license elimination, and VCF bundling created a market perception that it would eventually lock down everything it touched.

The Velero CNCF contribution is a counter-signal. By relinquishing governance of a project at the center of Kubernetes backup and migration, Broadcom is demonstrating that at least some of its stack is genuinely community-governed.

It also creates a clean architectural separation: Velero as open, portable, community-governed backup — VKS/VCF as proprietary platform layer. That separation is useful for teams evaluating VMware Cloud Foundation. Your backup portability is no longer contingent on your platform choice.

That's a genuine architectural benefit — independent of the marketing attached to it.

Architect's Verdict

The CNCF move is real and it matters — but not for the reasons most teams will act on.

If your concern is Broadcom controlling Velero's roadmap to disadvantage non-VMware users: that concern is now materially reduced. Multi-vendor maintainership and CNCF oversight create real structural separation.

If your concern is operational — whether Velero works when your cluster is down: the CNCF transition changes nothing. Object storage dependency still exists. IAM credential chain still needs to survive the same incident your cluster didn't. Restore-time complexity is still proportional to your external dependencies.

The teams that benefit most from this transition are those running multi-distribution environments who hesitated to standardize on Velero because of its VMware lineage. The governance change removes a legitimate organizational objection. The operational architecture still requires the same engineering discipline it always did.

CNCF doesn't remove risk. It changes where the risk lives — from project governance to operational design. Most teams haven't engineered the latter. That's the work.

Originally published at rack2cloud.com — architecture-first analysis for enterprise infrastructure teams.

Terraform vs OpenTofu (2026): Should You Switch After the BSL Change?

NTCTech — Thu, 09 Apr 2026 13:00:09 +0000

The question isn't "Terraform vs OpenTofu."

The real question is whether your infrastructure control plane is owned by a vendor — or governed as open infrastructure.

Here's how the timeline actually played out:

2023: HashiCorp switched Terraform from MPL to BSL. Every infrastructure team debated switching. Most didn't.

2024–2025: OpenTofu matured under Linux Foundation governance. Terraform deepened its HCP integration. The gap between the two stopped being about features and started being about platform models.

2026: The decision has real weight. Teams that delayed are now facing renewal cycles, growing HCP dependency, or organizational pressure around vendor lock-in.

What Actually Changed — Two Layers

Layer 1 — The BSL Change (2023)

MPL → BUSL license restriction
SaaS competitors directly impacted
HashiCorp signaled platform consolidation intent
Community trust fractured
OpenTofu fork initiated under Linux Foundation

Layer 2 — What Happened Since (2024–2026)

OpenTofu: governance matured, provider compatibility stabilized, ecosystem confidence grew
Terraform: deeper HCP integration, Sentinel expansion, increased platform dependency
IBM acquired HashiCorp — strategic direction now corporate
TACOS platforms added OpenTofu support
Enterprise teams started treating OpenTofu as production-viable

The 2023 debate was about licensing. The 2026 decision is about control plane ownership.

OpenTofu in 2026: From Fork to Control Plane

OpenTofu didn't just replicate Terraform. It removed the licensing constraint from the control plane.

Governance. OpenTofu operates under the Linux Foundation — the same model that underpins Linux, Kubernetes, and the cloud-native ecosystem. Foundation-backed, vendor-neutral, long-term stability commitment.

Compatibility. Strong parity with Terraform's core HCL syntax, provider protocol, and state file format. The overwhelming majority of existing Terraform configurations migrate without modification.

Ecosystem. Major cloud providers, Kubernetes operators, and TACOS platforms (Spacelift, Scalr, Env0, Atlantis) all support OpenTofu. The ecosystem gap argument from 2023 has largely closed.

Enterprise viability. Air-gapped environments, sovereign infrastructure, and strict OSS license compliance now have a production path that doesn't require BSL acceptance.

Where Terraform Still Leads

Terraform's advantage is no longer the CLI. It's the surrounding platform.

HCP Terraform → Managed execution + state + RBAC
Not just remote state — a managed execution environment with RBAC, audit logging, run history, and policy enforcement. For platform teams that have built internal developer platforms on top of HCP, replacing this requires rebuilding significant operational infrastructure.

Sentinel → Enforceable policy-as-code at scale
Sentinel is deeply embedded in large enterprise environments — cost control policies, tagging enforcement, resource type restrictions, compliance guardrails all expressed as Sentinel policies enforced at plan time. OpenTofu has no equivalent. If your compliance posture depends on Sentinel, you are not switching tools. You are replacing a governance model.

CDKTF → Developer-centric IaC workflows
TypeScript, Python, Go, or Java synthesized to HCL. In platform engineering contexts where developer experience is first-class, CDKTF is a meaningful advantage.

Enterprise support contracts
Vendor-backed SLA-backed support. Matters for procurement requirements and executive risk tolerance that mandates HashiCorp backing.

Control Plane Comparison — 2026

Dimension	Terraform	OpenTofu
License Model	BUSL	MPL 2.0
Governance	HashiCorp / IBM	Linux Foundation
Managed Platform	HCP Terraform	TACOS ecosystem
Policy Enforcement	Sentinel (mature)	OPA / partner tools
Vendor Lock-In	Higher	Lower
Air-Gap Support	Limited	Strong
Enterprise Support	Vendor-backed SLA	Community + partners

The Switching Cost Nobody Benchmarks

Most teams evaluate syntax compatibility. The real cost is execution model disruption.

1. State Migration Reality
State files are portable — OpenTofu reads them natively. But remote backend configurations, state locking behavior, workspace structures, and drift exposure during the transition window are real operational risks. For large environments with hundreds of state files, the migration itself becomes a project.

2. Provider Behavior
Subtle version mismatches exist between Terraform and OpenTofu provider implementations. Long-tail providers and custom internal providers built against Terraform's plugin SDK may behave differently. Audit your full provider inventory before committing.

3. Module Ecosystem
Private module registries work with OpenTofu. But modules with HCP-specific features — remote runs, Sentinel policy attachments, workspace-level configuration — require refactoring.

4. Workflow and CI/CD Disruption
Every pipeline stage that touches infrastructure needs auditing. Policy enforcement changes (Sentinel → OPA or partner tools) require rewriting governance logic. This is the most underestimated cost.

5. Organizational Change
Teams that have operated Terraform for years have embedded operational patterns. The retraining and adjustment period doesn't show up on a comparison matrix — but it shows up in velocity for 3–6 months post-migration.

Who Should Switch

Switching is viable and increasingly rational if:

CLI-driven workflows with no HCP Terraform dependency
No Sentinel policies in production
Air-gapped or sovereign infrastructure requirements
Strong need for licensing predictability or OSS compliance
BSL concerns from legal or procurement

You are not switching tools — you are replacing a platform if:

HCP Terraform is central to your execution model
Sentinel is embedded in compliance workflows
Large internal platform teams built on HashiCorp toolchain
CDKTF in active use
Enterprise support contract required by procurement

Evaluate but don't commit yet if:

Mid-migration orgs with hybrid IaC tooling
Partial HCP usage without deep Sentinel investment
Watching the IBM/HashiCorp strategic direction

The Drift Problem

Drift is a control problem. Not a tooling problem.

Terraform doesn't solve drift. OpenTofu doesn't solve drift. Both are state-based systems with the same fundamental limitation — they know what they deployed, not what exists right now.

Switching tools doesn't change your drift exposure. What changes it is operational discipline around state, enforcement of IaC-only change workflows, and detection tooling.

The tool is not the answer. The governance model is the answer.

Architect's Verdict

If your workflows are CLI-driven with no HCP dependency and no Sentinel policies in production — switching is viable and increasingly rational. Run a provider audit, scope your state migration, and move.

If HCP Terraform is central and Sentinel is embedded in compliance — you are not switching tools. You are replacing a platform. Scope it properly over 12–18 months or don't start.

If you're mid-transformation — run OpenTofu on a parallel workload now. Build the operational knowledge before you need it.

This is not a tooling decision. It's a control plane migration.

For the full post including HTML comparison tables, decision framework blocks, and the complete internal link map — read it on Rack2Cloud.

Gateway API Is the Direction. Your Controller Choice Is the Risk.

NTCTech — Tue, 07 Apr 2026 12:28:04 +0000

Gateway API Kubernetes adoption is settled. The project has made its call — GA in 1.31, role-based model, the ecosystem is moving. That decision is not the hard part.

What isn't made — and what most guides skip entirely — is the controller decision that sits underneath it. Gateway API defines the routing model. It does not define what runs your traffic, how that component behaves under load, or what happens when it restarts in a cluster with five hundred routes and an incident already in progress. That's the controller decision. And it's where the architectural risk actually lives.

This post covers what the controller decision actually hinges on: failure modes, Day-2 behavior, and the operational tradeoffs that don't appear in comparison matrices.

Gateway API defines the model. Your controller choice determines the blast radius.

Gateway API Kubernetes: Why the Controller Decision Matters

Gateway API graduated to GA in Kubernetes 1.31. The role-based model — GatewayClass, Gateway, HTTPRoute — separates infrastructure concerns from application routing in a way the original Ingress API was never designed to do. For platform teams managing multi-tenant clusters, this separation is architecturally significant: app teams manage their HTTPRoutes, platform teams own the Gateway and GatewayClass, and the permission model is explicit rather than annotation-based.

The migration from Ingress to Gateway API is well-documented at the spec level. What's less documented is the operational delta between controllers that implement it. Two clusters running Gateway API with different controllers can behave completely differently under the same failure condition. The API is standardized. The runtime behavior is not.

The Fork That Matters: Ingress API vs Gateway API

Before the controller decision, the API model decision — because the two are not interchangeable and your controller selection is downstream of it.

The Ingress API (networking.k8s.io/v1) is stable, universally supported, and battle-tested. It handles HTTP/HTTPS routing with host and path matching. It also handles almost nothing else without controller-specific annotations — which is where the operational debt starts accumulating in year two and compounds quietly through year five.

The Gateway API is the successor — graduated to GA in Kubernetes 1.31. Typed resources, explicit cross-namespace permission grants via ReferenceGrant, expressive routing rules that live in version-controlled manifests rather than annotation strings. For new clusters, it is the correct default. For existing clusters with years of Ingress annotations in production, migration has a cost that needs to be planned rather than assumed away.

Pick the API model first. The controller decision follows from it — not the other way around.

Where Kubernetes Ingress Controllers Actually Fail

The ingress-nginx deprecation path has pushed a lot of teams into controller evaluation mode. Most of that evaluation happens at the feature level. Here's what happens at the operational level.

Failure Mode 01 — Reload Storms Under Churn

NGINX-based controllers reload the worker process on every configuration change. In stable clusters this is invisible. In clusters with aggressive autoscaling or frequent deployments, reload frequency produces tail latency spikes, dropped WebSocket connections, and gRPC stream interruptions that don't correlate cleanly with any deployment event.

Failure Mode 02 — Annotation Sprawl & Config Drift

The Ingress API handles basic routing. Everything else — rate limiting, authentication, upstream keepalive, CORS, proxy buffer tuning — lives in controller-specific annotations. In year one this is manageable. By year three, annotation blocks are copied without being understood, controller upgrades become change management exercises, and no one owns the full picture.

Failure Mode 03 — TLS & cert-manager Edge Cases

cert-manager is nearly universal in production Kubernetes. Its interaction with ingress controllers is a reliable source of subtle failures — certificate renewal triggers a resource update, the controller reloads, and a short window of stale certificate serving opens. Normally sub-second. Under ACME rate limiting or slow reload paths, the window extends and you get TLS handshake failures with no clean correlated deployment event.

Failure Mode 04 — Cold-Start Reconciliation Window

Ingress controllers are not stateless in practice. On restart they must reconcile all Ingress or HTTPRoute resources before serving traffic correctly. In clusters with hundreds of route objects, this window is non-trivial — and if readiness probes are configured to the process start rather than reconciliation completion, rolling updates and node evictions become incidents.

None of these failure modes appear in controller documentation. All of them will surface in production. The Kubernetes Day-2 incident patterns follow a consistent shape: the configuration was correct, the failure mode was structural, and it only became visible under the specific load condition that triggers it.

Reload-Based vs Dynamic Configuration: The Architectural Fork

The reload vs dynamic configuration distinction is the most operationally significant difference between controller architectures — more significant than any feature comparison.

NGINX-based controllers reload the worker process on configuration changes. The reload is fast — typically under 100ms. At low frequency: invisible. At 50–100 reloads per hour from a cluster with aggressive HPA configurations or high deployment velocity, the cumulative effect on tail latency and persistent connections is real. Monitor nginx_ingress_controller_config_last_reload_successful and reload frequency before this becomes a production problem.

Envoy-based controllers — Contour, Istio's gateway, and AWS Gateway Controller — use xDS dynamic configuration delivery. Route changes propagate without process restart. For clusters with high pod churn or KEDA-driven autoscaling, this is architecturally significant rather than a preference. The autoscaler choice and the ingress controller choice have a dependency that most teams don't map until they're debugging correlated latency spikes.

Resource requests and limits on ingress controller pods are not a secondary concern. An under-resourced controller pod that gets OOM-killed or throttled under burst load is a full ingress outage. Size the controller like it's critical infrastructure, because it is.

Controller Decision: Operational Tradeoffs by Profile

Controller	Config Model	Gateway API	Best Fit	Watch For
ingress-nginx (community)	Reload on change	Partial	Stable clusters, Ingress API incumbents	Reload storms under HPA churn
NGINX Inc. (nginx-ingress)	Hot reload (NGINX Plus)	Partial	Enterprise with NGINX support contracts	License cost, annotation parity gaps
Contour	Dynamic xDS	Native (GA)	New clusters, Gateway API-first	Smaller ecosystem, fewer extensions
Traefik	Dynamic	Beta	Dev/staging, operator-heavy envs	Gateway API maturity, CRD proliferation
AWS LB Controller	ALB/NLB native	Yes	EKS-only, AWS-native workloads	Hard AWS lock-in, ALB cost at scale
Istio Gateway	Dynamic xDS	Native	Existing service mesh deployments	Operational complexity, sidecar overhead

The service mesh vs eBPF tradeoff determines whether your ingress and east-west traffic share a unified data plane — and that decision has operational weight that shows up during incident response, not during initial deployment.

The Three Questions the Decision Actually Hinges On

What is your cluster's churn rate? Count your Ingress-triggering events per hour: HPA scale events, deployments, cert renewals, configuration changes. If that number is high and climbing, reload-based controllers carry real operational risk. The 502 and MTU debugging patterns that show up in ingress troubleshooting often trace back to reload timing under load rather than configuration errors.

Where does your annotation investment live? If you have years of Ingress annotations encoding routing logic across hundreds of resources, the Gateway API migration cost is real. Run that migration when you're doing a platform modernization anyway — not as a standalone project.

Who operates this at 2 AM? A controller that a three-person platform team can debug during an incident is better than a technically superior controller no one fully understands. The platform engineering model puts ingress in the platform team's operational domain — the controller needs to fit their observability stack, runbook model, and on-call capability.

The Day-2 Checklist Nobody Ships With

Before a controller goes to production, answer these:

[ ] What is the controller's behavior during a rolling update — and is there a zero-downtime upgrade path documented for your version?
[ ] How does it handle TLS certificate rotation under sustained load? Is the stale-cert serving window measured?
[ ] What metrics does it expose natively, and what requires custom instrumentation? Is reload frequency in your alerting stack?
[ ] What is the reconciliation time from cold start with your current route object count? Has this been measured — not estimated?
[ ] Is a PodDisruptionBudget configured, and does it account for the reconciliation window — not just process start?
[ ] What breaks first if the controller pod is evicted under node memory pressure? Is that failure mode in your runbook?
[ ] If you're running a service mesh — is the ingress controller in or out of the mesh data plane, and is that decision explicit?

The containerd Day-2 failure patterns and these ingress failure modes share a structural similarity: invisible during initial deployment, compounding under real production load, surfacing at the worst possible time.

Architect's Verdict

Gateway API is the correct architectural direction for new Kubernetes clusters in 2026. That decision is settled. The controller decision underneath it is not — and it carries more operational risk than the API model choice does.

For new infrastructure: Gateway API Kubernetes with Contour is the defensible default. The API is GA, the xDS-based configuration model eliminates reload risk, and you avoid accumulating annotation debt from day one. On EKS, the AWS Load Balancer Controller is the pragmatic choice if you're already committed to the AWS networking model — with the understanding that you are accepting the lock-in that comes with it.

For existing clusters on ingress-nginx: don't migrate for migration's sake. The ingress-nginx deprecation path has four documented options — evaluate them against your actual cluster profile, not the general recommendation.

Either way: measure your reload rate before it becomes a problem. Configure readiness probes against reconciliation completion, not process start. Don't assume cert-manager and your controller share the same definition of "ready." These failure modes are predictable. The only variable is whether they surface in your testing environment or in production during an incident.

Part of the Kubernetes Ingress Architecture Series on Rack2Cloud. Originally published at rack2cloud.com.