DEV Community: Saul Fernandez

Agentic Platform Engineering: How to Build an Agent Infrastructure That Scales From Your Laptop to the Enterprise

Saul Fernandez — Thu, 19 Mar 2026 00:51:33 +0000

Starting local is not thinking small. It's thinking strategically.

Stripe recently published a series on how they build "Minions" — fully autonomous coding agents that take a task and complete it end-to-end without human intervention. One agent reads the codebase, another writes the implementation, another runs the tests, another reviews the result. All in parallel. All coordinated. All producing production-ready code at scale.

Reading that, most engineers react one of two ways. Either "that's years away for us" — or they start thinking about what foundations would need to be in place to even begin moving in that direction.

This article is about the second reaction.

What Stripe describes is not a product you buy. It's a capability you build, incrementally, on top of solid infrastructure. And that infrastructure — the way agent configuration is stored, versioned, distributed, and composed — is what separates teams that can scale AI seriously from teams that are still copy-pasting prompts into chat windows.

I call this discipline Agentic Platform Engineering. And its first principle is simple: treat agent intelligence as infrastructure, not as improvisation.

The Problem at Any Scale

Whether you're a solo engineer with 10 repositories or a platform team with 200, you hit the same structural problem when you try to work seriously with AI agents.

At the individual level, it looks like this: you spend time configuring your agent well — crafting instructions, building reusable procedures, defining what it should and shouldn't do depending on where you're working. Then you switch machines, reinstall a tool, or try a different agent. Your configuration is gone. You start over.

At the team level, it's worse: every engineer has their own private, undocumented, non-transferable agent setup. There's no shared knowledge about how agents should behave in your codebase, no consistency in what they can and cannot do, no way to onboard a new member into an agentic workflow. The AI capability of your team cannot scale because it lives in individual heads and local files.

And at the enterprise level — the level where Stripe operates — you cannot even begin to think about autonomous agents running pipelines if you haven't solved the fundamental question: where does agent configuration live, who owns it, and how does it reach every context where it's needed?

Most engineers interact with AI agents in one of two ways:

Ad-hoc: No configuration, just prompting. Works for one-off tasks, but the agent has no memory of your stack, your conventions, or your constraints.
Single-file config: One big AGENTS.md or CLAUDE.md at the root of a repo. Better, but it doesn't scale — the same instructions get injected everywhere, regardless of whether they're relevant, and they live in one repo while you work across twenty.

Neither approach is infrastructure. Neither can scale. Neither gives you what you'd need to move toward autonomous multi-agent systems.

The question is: what would infrastructure actually look like?

The Architecture: Three Repos, One System

The solution I landed on separates concerns into three distinct repositories, each with a single, well-defined responsibility. This is not a personal productivity hack. It's a deliberate architectural pattern that mirrors how platform engineering works for any shared infrastructure — you version it, you document what exists, and you decouple the interface from the implementation.

agent-library/    ← The brain (tool-agnostic intelligence)
agent-setup/      ← The bridge (tool-specific deployment)
resource-catalog/ ← The map (inventory of everything)

Let me explain each one.

1. `agent-library` — The Brain

This is the single source of truth for everything the agent knows and how it behaves. It contains no tool-specific configuration. If tomorrow I switch from one AI coding tool to another, this repo stays untouched.

The structure:

agent-library/
├── library.yaml          ← Central manifest
├── SKILLS-INDEX.md       ← Human-readable index of all skills
├── layers/               ← Context-specific instructions
│   ├── global.md         ← Identity, principles, environment map
│   ├── repos.md          ← Shared git conventions
│   ├── work/             ← Work domain
│   │   ├── domain.md     ← Conservative rules, safety constraints
│   │   ├── terraform.md  ← Terraform-specific workflow
│   │   ├── gitops.md     ← GitOps/FluxCD rules
│   │   └── code.md       ← Code conventions
│   └── personal/         ← Personal domain
│       ├── domain.md     ← Experimental mode, fast iteration
│       ├── fintech-app.md
│       └── infra-gcp.md
├── skills/               ← Reusable procedures
│   ├── global/           ← Available everywhere
│   └── work/             ← Domain-specific
├── rules/                ← Always-on constraints
└── prompts/              ← Reusable prompt templates

The key concept is layers. Each layer is a Markdown file that becomes an AGENTS.md (or equivalent) for a specific directory. They are designed to be cumulative — the agent loads them from parent to child, each adding context on top of the previous one.

When the agent is working in ~/repos/work/terraform/, it loads:

~/.agent/AGENTS.md              → global.md        (who I am, core principles)
~/repos/AGENTS.md               → repos.md         (git conventions)
~/repos/work/AGENTS.md          → work/domain.md   (conservative, safety-first)
~/repos/work/terraform/AGENTS.md → work/terraform.md (terraform workflow)

Each layer is laser-focused. global.md doesn't know about Terraform. work/terraform.md doesn't know about React. The agent assembles its context from the bottom up, with exactly the information it needs for where it currently is.

The library.yaml manifest is the glue. It declares every layer, skill, rule, and prompt — what it is, where it lives in the repo, and where it should be deployed on the filesystem:

layers:
  - name: work-terraform
    description: "Terraform-specific rules for work domain"
    source: layers/work/terraform.md
    target: ~/repos/work/terraform/AGENTS.md
    scope: "~/repos/work/terraform/*"

skills:
  - name: terraform-plan
    description: "Terraform plan/apply workflow"
    source: skills/work/terraform-plan.md
    scope: "~/repos/work/terraform/*"

2. `agent-setup` — The Bridge

This repo is the adapter between the tool-agnostic brain and the specific AI agent tool I use today. If I switch tools next year, I replace only this repo. The brain stays the same.

Its core is a single setup.sh script that reads library.yaml and deploys everything via symlinks:

# Creates: ~/repos/work/terraform/AGENTS.md → agent-library/layers/work/terraform.md
# Creates: ~/.agent/skills/terraform-plan → agent-library/skills/work/terraform-plan.md
# ... and so on for every layer, skill, rule, and prompt

Why symlinks instead of copies?

Because when I edit a layer in agent-library, the change is immediately live everywhere. No re-deployment needed. setup.sh only needs to run again when I add a new file (a new symlink to create). For edits to existing files, the symlink already points to the right place.

The repo also contains tool-specific settings, keybindings, and extensions — things that only make sense for a specific tool.

3. `resource-catalog` — The Map

This is the index of everything that exists in my engineering ecosystem. It follows the Backstage catalog format — the same standard used in enterprise engineering platforms.

# components/agent-library.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: agent-library
  description: "Tool-agnostic agent configuration library"
  annotations:
    github.com/project-slug: your-username/agent-library
spec:
  type: ai-agent-config
  lifecycle: production
  owner: your-name
  system: personal-ai-agent-platform

Every repository I own — infrastructure, applications, documentation, and yes, the agent-library itself — is registered here with its type, owner, system, and source location.

The catalog is not where agent logic lives. It's a map, not an engine. The distinction matters: the catalog tells you that agent-library exists and what it is. The agent-library itself contains what the agent knows. Mixing these two concerns would be like embedding source code inside your package.json.

Skills: The Reusable Procedure Library

Beyond layers (which define how the agent behaves), the library contains skills — reusable, step-by-step procedures for common tasks.

A skill looks like this:

# Skill: Terraform Plan

Use this skill when working with Terraform in the work domain.

## Steps
1. Check context — confirm directory and workspace
2. terraform fmt -recursive
3. terraform validate
4. terraform plan -out=tfplan
5. Review plan — summarize what will change
6. Highlight risks — flag any destroys or critical changes
7. Wait for confirmation — never apply without explicit approval
...

## Red Flags (Stop and Ask)
- Any resource marked for destruction
- Changes to IAM policies
- Changes to production databases

Skills are invoked explicitly: /skill:terraform-plan. They're never loaded automatically — that's intentional. The agent doesn't pre-load every procedure it might need. It loads the skill when the task calls for it.

The Token Efficiency Design

This is where the architecture earns its keep.

A naive approach would be: put everything in one big AGENTS.md. All the rules, all the skills, all the context. The agent always knows everything.

The problem: that file becomes enormous. Every single message you send to the agent carries the full weight of that context as tokens. You're paying for Terraform rules when you're writing a Python script. You're loading GitOps procedures when you're working on documentation.

The architecture solves this at three levels:

Level 1: Layers are scoped by directory. The Terraform layer only activates when you're in ~/repos/work/terraform/. Not in your React app. Not in your docs.

Level 2: Each layer only declares what's relevant at its level. global.md lists 6 universal skills (debug, code-review, refactor, test, documentation, git-workflow). It does not list terraform-plan or catalog-management — those are irrelevant in most contexts. work/terraform.md lists terraform-plan and nothing else, because that's the only skill you need there.

Level 3: Meta-skills are scoped to their home. create-skill (the skill that creates new skills) is only available inside agent-library/. catalog-management is only available inside resource-catalog/. Why would the agent know how to modify the agent library while it's working on your fintech app?

The result: when the agent is in work/terraform/, its active context is exactly:

6 global skills
1 domain skill (infrastructure-review)
1 directory skill (terraform-plan)

That's it. No noise.

Disaster Recovery in Under 5 Minutes

The entire system is built for one guarantee: if everything breaks, you can rebuild from scratch in under 5 minutes.

# Step 1: Clone the three repos
mkdir -p ~/repos && cd ~/repos
git clone git@github.com:your-username/agent-library.git
git clone git@github.com:your-username/agent-setup.git
git clone git@github.com:your-username/resource-catalog.git

# Step 2: Deploy
cd agent-setup && bash setup.sh

# Step 3: Verify
cd ~/repos/work/terraform && your-agent "What context am I in?"
# → Agent responds with terraform-specific context ✓

Done. The agent has its full identity, all its domain knowledge, all its skills, and all the right rules for every directory it works in.

This is only possible because:

Everything is in git. No local-only configuration that can be lost.
The brain is separate from the tool. Reinstalling the tool doesn't lose the intelligence.
The manifest declares everything. library.yaml is the complete description of the system — setup.sh just executes it.

The Mental Model: A Package Manager for Agent Intelligence

Think of it like a software package manager, but for how agents think and behave.

library.yaml is your package.json — it declares everything that should exist and what it does. setup.sh is your npm install — it takes the manifest and wires everything up on any machine. The layers are your source modules — composable, scoped, loaded on demand. The skills are your function library — procedures you invoke when you need them, not before.

The difference from a traditional package manager: the "packages" here are not code. They're instructions for how to reason in a given context.

This matters beyond the individual level. If you're a platform team and you want every engineer to work with agents consistently — the same safety rules around production, the same conventions for code review, the same escalation procedures — you publish to agent-library. Engineers run setup.sh. Done. The intelligence is distributed, versioned, and auditable. Just like any other shared infrastructure.

This is the foundation Stripe's approach requires. Before you can run autonomous agents in parallel on real codebases, you need to have solved: where do agents get their instructions? Who updates them? How do changes propagate? How do you ensure an agent working on your payment service doesn't behave like an agent working on an internal tool?

The architecture described in this article is an answer to those questions — starting from a single developer setup, but designed to scale.

What This Looks Like Day-to-Day

When I open a terminal in ~/repos/work/terraform/:
The agent already knows it's in conservative mode, that any terraform apply needs a reviewed plan and explicit confirmation, that pre-commit hooks must run before any commit, and exactly which skill to use for the full workflow.

When I open a terminal in ~/repos/personal/fintech-app/:
The agent knows it can move fast, that this is a Python financial analysis platform, that API keys live in environment variables and never in code, and that tests run before committing.

When I want to add a new skill:
I run /skill:create-skill. The agent walks me through creating the file, registering it in library.yaml with the right scope, updating SKILLS-INDEX.md, and committing. The skill is live the moment it's committed — no redeployment needed (symlinks).

When a colleague joins my team or I get a new machine:
Three git clones and one bash script. Same agent, same behavior, same context everywhere.

The Design Decisions That Matter

Why not one big repo? Separation of concerns. The brain shouldn't depend on the tool. The catalog shouldn't contain executable logic. Mix them and you create coupling that makes the whole system fragile.

Why Backstage format for the catalog? It's an industry standard built exactly for this — describing what exists, who owns it, and how it relates to other things. It's human-readable, tool-agnostic, and designed to scale.

Why symlinks instead of copies? Real-time updates without redeployment. Edit terraform.md in the library, it's immediately live in ~/repos/work/terraform/. No sync step, no drift between source and deployed config.

Why scope skills to directories instead of loading all of them? Token efficiency and cognitive clarity. An agent with 30 loaded skills is an agent that has to decide which one applies. An agent with 2 loaded skills knows exactly what to use.

What I Haven't Built Yet

This is an honest article, so here's what's still on the roadmap — and what brings the architecture closer to the Stripe model:

Extensions (next): Custom tools for things like catalog lookup, repo navigation, and library sync directly from the terminal
MCP integration: Model Context Protocol servers for deeper, structured tool integrations — giving agents access to live data sources, not just static instructions
Multi-agent orchestration: The ability to spawn specialized agents in parallel for complex tasks — one reads the codebase, another implements, another validates. This is the direction Stripe's Minions move in, and this architecture is specifically designed so the layer system can feed each specialized agent exactly the context it needs, nothing more
Centralized distribution: Moving from local symlinks to a pull-based model where any machine or CI environment can fetch the latest agent configuration from agent-library automatically

Each of these is a step up the autonomy ladder. But none of them are possible without the foundation: versioned, scoped, composable agent configuration that you control.

Conclusion

Stripe's Minions are impressive. But they're not magic — they're the result of building the right infrastructure first.

The architecture described here — three repos, clear separation of concerns, a manifest-driven deployment, and scoped context loading — is that infrastructure, starting at the smallest possible scale. One developer, one machine, three git repositories.

The local setup is not the destination. It's the proof of concept for a pattern that scales: agent intelligence is configuration, configuration belongs in git, and git belongs in a system where it can be versioned, distributed, and composed.

Start local. Think at scale. Build the foundation that makes the next step possible.

That's Agentic Platform Engineering.

UPDATE - 1
The Cross-Org Agent Discovery Problem

Since publishing this article, a great discussion sparked by @globalchatads in the comments regarding a critical question: This local Monorepo/Symlink architecture is great for a single developer, but how does it actually scale to a multi-team Enterprise environment?

If Team A (Security) has an agent that needs to run an audit tool owned by Team B (Networking), how does Team A's agent discover that tool? Naively, we could give the agent a Git Personal Access Token (PAT) to read Team B’s repository. However, in a zero-trust enterprise, sharing Git tokens across domains creates a massive security overhead and tight coupling.

The Solution: Cross Organization Discovery for Agents

Instead of relying on shared filesystems or direct repository access, we need a network-routable Service Registry. Taking inspiration from standard web protocols, I've integrated the RFC 8615 (.well-known/) directory pattern into this architecture.

Here is how a distributed (Polyrepo) setup works in practice:

GitOps as the Source of Truth: Team B maintains their local agent-library in their own repository.
The Build Step: When changes are merged, a CI/CD pipeline parses their internal library.yaml, extracts the tools meant for public consumption (e.g., MCP server endpoints), and compiles a standardized agent-capabilities.json.
The Deployment: This JSON is published to an internal, highly available endpoint.
Runtime Discovery: When Team A's agent needs to interact with the Networking domain, it simply queries https://api.networking.internal/.well-known/agent-capabilities.json` to understand what tools are available and what OAuth scopes are required.

To demonstrate this, I have updated the Reference Architecture Repository with three key additions:

📄 The Discovery Protocol Docs: A deep dive into the JSON schema and discovery mechanics.
⚙️ Mock CI/CD Pipeline: An example GitHub Action showing how a team compiles and publishes their capabilities.
🛠️ Domain Discovery Skill: A base tool that allows your local agent to query remote domains and learn their capabilities on the fly.

Big thanks to the community for pushing this concept further. The evolution from local scripts to standards protocols is exactly what will define the next generation of Agentic Platform Engineering!

The Agentic Engineering Manifesto: Why Standards are My New Sovereign Frontier

Saul Fernandez — Sun, 01 Mar 2026 01:43:15 +0000

For the past few months, I’ve been obsessing over a single question: How do we move past the "AI as a toy" phase and actually integrate it into our production workflows without creating a massive, unmanageable mess?

As someone deep in the world of Kubernetes, GCP, and GitOps, I’ve realized that the hype around which LLM is "smarter" is a distraction. In 2026, the real battle isn't over tokens; it’s over architecture. If we don't standardize how our agents interact with our infrastructure, we aren't building progress—we are just building a new type of technical debt.

Beyond the "AI Platform Engineer"

You’ve probably heard the industry talking about AI Platform Engineers. Right now, that role is mostly about "plumbing"—managing GPUs, fine-tuning models, and making sure the inference API is up. It's necessary, but it's narrow.

I want to take this a step further. I’m defining a new specialization: The Agentic Platform Engineer.

While the AI Platform Engineer focuses on making the model available, I am focused on making the agent actionable. My goal isn't just to give the team a brain in a box; it’s to provide that brain with hands, a set of tools, and a strict "code of conduct" so it can operate safely inside our clusters.

My 5 Pillars for Agentic Sovereignty

To achieve this, I’m building my strategy around five core pillars. These aren't just tools; they are the standards that allow us to "own" our automation rather than just renting it from a provider.

1. The Universal Interface: MCP (Model Context Protocol)

I see MCP as the "USB-C" of our era. I’m moving away from writing custom, brittle connectors for every tool. By building MCP servers, I decouple the agent's "thinking" from the infrastructure's "doing." If I decide to swap Claude for a newer model tomorrow, my K8s and GCP integrations stay exactly the same. That is Sovereignty.

Reference: https://modelcontextprotocol.io

2. Intellectual Capital: Portable Skills (agentskills.io)

I’m tired of senior engineers spending half their day explaining the same rollback procedures. I’m codifying that wisdom into Skills. Using the agentskills.io standard, I can package complex DevOps logic into Markdown/YAML files that any agent can load. I’m essentially cloning my best troubleshooting logic and making it an evergreen asset of the company.

Reference: https://agentskills.io

3. The Digital Constitution: Local Rules (.cursorrules)

Every repo I manage now has a "Law of the Land." Through .cursorrules or .clinerules, I define the architectural boundaries. The agent doesn't have to guess if we prefer functional programming or how we tag our Terraform resources; it’s in the "Constitution." This is Governance at the source.

Reference: https://cursor.directory

4. Resilient Orchestration: Stateful Graphs (LangGraph)

Linear prompts are fine for writing emails, but they fail in production. For high-stakes tasks like production deployments, I use Graphs. Frameworks like LangGraph allow me to build flows with memory, self-correction, and—most importantly—Human-in-the-loop checkpoints. I don't just want an agent that "tries"; I want a system that follows a stateful, auditable path.

Reference: https://langchain-ai.github.io/langgraph

5. The Org Chart: Agent Swarms

I’ve realized that a single "god-agent" is a recipe for hallucinations. The future is Swarms. I’m architecting teams of specialists: one agent monitors the logs, another validates the security policy, and a third executes the fix via an MCP tool. It’s about building a digital squad that mirrors a high-performance engineering team.

Reference: https://docs.crewai.com

Why This Matters for the Platform Engineering Community

If we just "implement agents," we are following a trend. If we implement standards, we are building a competitive fortress.

The companies that will lead the next decade aren't the ones with the biggest API credits; they are the ones who own their Agentic Fabric. By specializing as Platform Agentic Engineers, we aren't just managing servers anymore—we are architecting the very intelligence that manages the servers for us.

We are moving from "writing code" to "governing autonomy." And honestly? There has never been a more exciting time to be in Platform Engineering.

From Terraform to Crossplane: How to Understand Crossplane if you already know Terraform?

Saul Fernandez — Sun, 01 Feb 2026 18:30:52 +0000

If you work in DevOps, Terraform is likely your favorite hammer. It’s reliable, solid, and has built your whole infrastructure (I hope so, at least). But suddenly, you start hearing about Crossplane, and you see diagrams with a thousand Kubernetes cubes, names like XRD and Compositions, and your brain just goes click: "Wait, why are there so many pieces just to do the same thing?".

If you feel this way, don’t worry. It’s not that Crossplane is unnecessarily complex; it’s that we’ve shifted from writing infrastructure "scripts" to creating an "Operating System" for our cloud.

Let’s break it down with the ultimate analogy so you never get lost again.

The Automated Coffee Shop Analogy

Imagine you want to automate a coffee shop.

In Terraform: You write a step-by-step recipe, execute it, and the kitchen serves a coffee. If someone drinks the coffee or drops the cup, the recipe does nothing until you go back to the kitchen and hit the "Execute" button again.
In Crossplane: You hire a Barista (the Controller) who never stops watching the table. If the coffee disappears, he replaces it instantly without you saying a word. The Barista always ensures that reality matches the menu.

The "Dictionary" for Terraform Survivors

This is where most people get lost. Let’s translate Crossplane concepts into what you already know from Terraform:

Terraform Element	Crossplane Element	What is it really?
Provider Plugin (`google`)	Provider	The Brain. The binary that knows how to talk to the Google, AWS, or Azure API. This is the expert "Bartista" that knows the secrets of the coffee shop.
N/A (Manual Installation)	CRD (Dictionary)	The Base. What teaches Kubernetes what a "CloudRun" or a "Bucket" is. This is the menu of the coffee shop.
Resource (`google_storage_bucket`)	Managed Resource (MR)	The LEGO Piece. The smallest, rawest resource that exists in the cloud. This is the coffee cup.
Variables (`variables.tf`)	XRD (Definition)	The Contract. The 4 fields you allow the developer to fill in. This is the form to order the coffee.
Module (`main.tf`)	Composition	The Blueprint. The recipe that says: "If I'm asked for X, I'll manufacture Y and Z." This is the recipe to make the coffee.
Module Call	Claim	The Order. The ticket left by the developer saying: "I want my coffee." This is the order to make the coffee.

WHAT actually installs the "atomic" resource?

This is the million-dollar question. How does that Bucket appear out of thin air?

The Foundation (Provider): This is what we install first. When you install the GCP Provider, the cluster "learns" languages. Before, it only spoke "Kubernetes"; now, it speaks "Google Cloud." It installs the CRDs (the base dictionary).
The Blueprint (Composition): You, as a platform expert, write a Composition. This is your standard. Here, you decide that all Buckets in your company must be private and located in europe-west3.
The Order (Claim): A developer creates a 5-line YAML (the Claim).
The Magic: The Provider sees the Claim, checks your Composition, and says: "Okay! I'm going to create the atomic resource (Managed Resource) in Google Cloud right now, based on the Composition."

How does it improve upon my beloved Terraform?

If Terraform already works, why change? This is where Crossplane shines:

1. Goodbye to "Drift" (Deviation)

In Terraform, if someone manually deletes a resource in the AWS console, your infrastructure stays broken until the next plan/apply.
In Crossplane, the Provider is watching. If you delete the resource, Crossplane recreates it in less than 60 seconds automatically. Real auto-healing.

2. Developer Self-Service

Instead of devs asking you for changes in complex Terraform repos, you give them an interface (XRD). The dev only enters: image: my-app:v1. They don’t see (and don’t need to see) the 200 lines of network, security, and IAM configuration you’ve hidden inside the Composition.

3. A Single Language: YAML

You no longer need to manage states (tfstate) in remote buckets with complex locks. The state of your infrastructure is the Kubernetes cluster itself. If Kubernetes is alive, your infrastructure is under control.

The Elephant in the Room: Where is my "Terraform Plan", I F*****g Want to See What is Going to Happen Before It Happens!

Let’s be honest: moving to Crossplane can be terrifying. In Terraform, the plan is your safety net. You see exactly what will happen before it happens. In Crossplane, you apply a YAML, and the controller starts working. If you accidentally delete a K8s object that represents a Database... poof, your production data could vanish.

This "fire and forget" nature is what keeps many DevOps away of Crossplane.

How to Sleep at Night (Solving the State Fear)

Fortunately, there are ways to build a safety net as strong (or stronger) than Terraform’s:

The "Orphan" Policy (Technological): Every Managed Resource in Crossplane has a field called deletionPolicy. By setting it to Orphan instead of Delete, you are telling Crossplane: "If someone deletes this object in Kubernetes, DO NOT touch the resource in the Cloud". This is mandatory for Databases and stateful resources.
ArgoCD as your "Plan" (Methodological): If you use GitOps (and you should), ArgoCD provides a visual Diff. Before syncing, you can see exactly what fields will change. It’s your new, visual, and much more readable terraform plan.
Deletion Protection (Cloud Native): Just like in Terraform, you should enable deletionProtection: true on critical resources (SQL, GCS Buckets) at the Provider level. Even if Crossplane tries to delete it, the Cloud Provider will reject the request.
Finalizers (Kubernetes Native): Kubernetes won't delete an object until its "Finalizers" are cleared. This gives you a window to catch accidental deletions before they reach the Cloud API.

Final Thoughts

Don’t throw Terraform in the trash; it’s still great for static foundations (like creating the Kubernetes cluster itself). But for everything that lives and breathes (databases, queues, cloud apps), Crossplane is the next level of evolution.

You go from being a "script writer" to a Platform Architect offering a catalog of living services to your company.

What do you think of using Crossplane? Do you dare to take the leap to the Control Plane or do you use it already?

Is Terraform for Kubernetes Applications Flawed? For Kubernetes, GitOps is The Way

Saul Fernandez — Mon, 21 Jul 2025 11:56:59 +0000

As DevOps and platform engineers, we've been rightly conditioned to chant the mantra of "Infrastructure as Code." So, when it comes to deploying applications on Kubernetes, reaching for a familiar tool like Terraform seems logical. It promises a unified workflow to manage everything from VPCs to Helm charts. However, this is a seductive but ultimately flawed path. Using Terraform to manage the lifecycle of Kubernetes-native applications is a significant anti-pattern that creates friction, fragility, and works against the very design principles of Kubernetes itself.

It's time for a candid discussion. By forcing a tool designed for static infrastructure provisioning onto the dynamic, ever-reconciling world of Kubernetes, we are setting ourselves up for failure. The path to a more resilient, secure, and efficient platform lies in embracing the principles Kubernetes was built for, and that means adopting GitOps.

The Clash of Titans: Terraform State vs. Kubernetes Reconciliation

The core of the problem lies in a fundamental conflict of state management. Terraform operates on a discrete, snapshot-based model. It runs, compares the desired state in your HCL code to its stored state file (.tfstate), and generates a plan to converge the two. It is the undisputed source of truth.

Kubernetes, however, already has a powerful state management and reconciliation system. Its control plane continuously works to match the cluster's live state with the desired state declared in its etcd database. It’s a closed-loop system designed for constant, autonomous correction.

When you layer Terraform on top of this, you create two competing sources of truth. This inevitably leads to state drift.

Imagine a scenario: an autoscaler scales your deployment in response to traffic, or an engineer uses kubectl to debug a pod and changes a label. Terraform knows nothing about these events. Its state file is now a lie, a stale snapshot of a past reality. The next terraform plan will either report unexpected "drift" that an engineer must manually reconcile or, worse, it could blindly destroy and recreate resources, causing an outage because it tries to "fix" a change that was intentional and necessary. This isn't just inefficient; it's dangerous.

Beyond Provisioning: The Application Lifecycle

Terraform excels at the create-configure-destroy lifecycle of foundational infrastructure. But applications are not static. They are living systems that require sophisticated lifecycle management:

Complex Deployments: How do you orchestrate a canary release or a blue-green deployment with Terraform? These patterns are foreign to its resource-centric model.
Rollbacks: A rollback in Terraform often means re-running a previous configuration, which can translate to a disruptive destroy-and-recreate cycle. In a Kubernetes-native workflow, a rollback is a rapid, surgical pointer change to a previous ReplicaSet, often completing in seconds.
Observability: Terraform provides basic logs, but it can't offer the rich, application-aware status reporting, health checks, and event history that a tool like ArgoCD provides directly from the cluster.

The GitOps Paradigm with ArgoCD: A Better Way

This is where GitOps shines. With a tool like ArgoCD, the Git repository becomes the single, unambiguous source of truth.

The architecture is both simple and powerful:

Desired State in Git: All manifests (Helm charts, Kustomizations, etc.) for your application are stored in a Git repository.
Continuous Reconciliation: The ArgoCD agent runs in your cluster, continuously comparing the live state against the desired state in Git.
Automatic Drift Correction: If it detects any deviation—whether from a manual kubectl change or a configuration error—it can automatically self-heal, reverting the cluster to the state defined in Git.

Every change, from a simple image tag update to a full-scale deployment, is managed through a pull request. This workflow provides a natural audit trail, enables peer review for operational changes, and empowers developers to manage their applications' lifecycles using the tools they already know and use.

The Final Frontier: What About Infrastructure Dependencies?

This is where the debate gets interesting. "Okay," you might say, "GitOps for the app, but I still need Terraform to create the app's S3 bucket, IAM role, or Cloud SQL database." This forces teams to manage application dependencies in separate repositories and pipelines, creating a new kind of fragmentation.

This is where we must distinguish between two classes of infrastructure:

Core Infrastructure: These are the foundational pillars of your platform. VPCs, Kubernetes clusters themselves, DNS zones, and top-level IAM policies. They have a massive blast radius and a slow change cadence. This is the perfect use case for Terraform. Its deliberate, plan-and-apply workflow is a feature, not a bug, when managing these critical resources.
Infrastructure Dependencies: These are resources tightly coupled to a single application. They should be created when the app is deployed and destroyed when it's removed. An S3 bucket for a specific microservice, a Pub/Sub topic for an event-driven workflow, or an IAM role for a single pod are prime examples.

Managing these dependencies with Terraform is clumsy. Why should an application team have to file a ticket or run a separate pipeline just to get a bucket?

Unifying the Stack with Crossplane and GitOps

This is where a tool like Crossplane completes the GitOps picture. Crossplane extends the Kubernetes API, allowing you to manage external resources as if they were native Kubernetes objects.

By installing a Crossplane provider for your cloud, you can define an S3Bucket or RDSPostgreSQLInstance directly in YAML, right alongside your Deployment and Service.

Now, the magic happens. You can place your application's Helm chart and its Crossplane dependency manifests in the same Git repository. ArgoCD, already watching that repo, will deploy everything in one go.

ArgoCD applies the manifests.
The Kubernetes API server receives the Deployment and the S3Bucket objects.
The Kubernetes scheduler deploys the application pods.
The Crossplane controller sees the S3Bucket object and provisions the actual bucket in your cloud provider.

The entire application stack, from its cloud dependencies to its Kubernetes configuration, is now a self-contained unit, managed through a single Git repository and a unified GitOps workflow. When you delete the ArgoCD application, it can trigger the deletion of the Kubernetes resources and instruct Crossplane to de-provision the corresponding cloud infrastructure, ensuring a clean, complete teardown.

Conclusion: Use the Right Tool for the Job

Using Terraform to manage Kubernetes applications is not just a matter of preference; it's a fundamental architectural mismatch. It's like using a screwdriver to hammer a nail—you might eventually get it in, but the process will be awkward, and the result will be fragile.

The path to a mature, scalable, and secure Kubernetes platform is clear:

Use Terraform for what it excels at: Provisioning and managing your stable, core infrastructure.
Use GitOps with ArgoCD for deploying and managing the dynamic lifecycle of your applications.
Use Crossplane alongside ArgoCD to manage application-specific infrastructure dependencies, creating truly self-contained and portable application definitions.

By separating these concerns and embracing a cloud-native, GitOps-centric approach, we can build platforms that are not only more powerful but also more resilient, secure, and easier to maintain. It’s time to stop fighting the current and let Kubernetes be Kubernetes.

HPA vs. KEDA in Kubernetes - The Autoscaling Guide to Know When and Where to Use Them

Saul Fernandez — Sun, 01 Oct 2023 12:15:46 +0000

Introduction

Hey there! Remember the first time you got your hands on Kubernetes? Ah, the good ol' days. I was so green back then that I thought Horizontal Pod Autoscaling (HPA) was the be-all and end-all for scaling in Kubernetes. I mean, it was like discovering fire; it felt like I had this incredible tool that could solve all my scalability problems.

Fast forward a bit, and I landed roles where KEDA was the star of the show, especially in machine learning event-driven applications. We were using RabbitMQ queue metrics to scale our ML consumers like a charm. It was like going from a bicycle to a sports car in the world of autoscaling.

Now, in my current gig, we started off with HPA, just like old times. But as we scaled and our needs evolved, we found ourselves hitting the same limitations I'd discovered years ago. That's when we decided to bring KEDA into the mix, and let me tell you, it's been a game-changer.

So why am I telling you all this? Because I want to share these hard-earned lessons with you. In this article, we're going to dissect HPA and KEDA, compare their strengths and weaknesses, and dive into real-world scenarios. My goal is to arm you with the knowledge to make informed decisions right from the get-go, so you know exactly when to use HPA and when to switch gears to KEDA.

What is HPA?

HPA automatically adjusts the number of pod replicas in a deployment or replica set based on observed metrics like CPU or memory usage. You set a target—like 70% CPU utilization—and HPA does the rest, scaling the pods in or out to maintain that level. It's like putting your scaling operations on cruise control.

Why Was HPA Devised?

Back in the day, before the cloud-native era, scaling was often a manual and painful process. You'd have to provision new servers, configure them, and then deploy your application. This was time-consuming, error-prone, and not very agile.

When Kubernetes came along, it revolutionized how we think about deploying and managing applications. But Kubernetes needed a way to handle automatic scaling to truly make the platform dynamic and responsive to the actual needs of running applications. That's where HPA comes in.

Simplicity: HPA is designed to be simple and straightforward. You don't need a Ph.D. in distributed systems to set it up. Just specify the metric and the target, and you're good to go.
Resource Efficiency: Before autoscaling, you'd often over-provision resources to handle potential spikes in traffic, which is wasteful. HPA allows you to use resources more efficiently by scaling based on actual needs.
Operational Ease: With HPA, the operational burden is reduced. You don't have to wake up in the middle of the night to scale your application manually; HPA has got your back.
Built-In Metrics: Initially, HPA was designed to work with basic metrics like CPU and memory, which are often good enough indicators for many types of workloads.

So, in a nutshell, HPA was devised to make life easier for DevOps folks like us, allowing for more efficient use of resources and simplifying operational complexities. It's like the Swiss Army knife of Kubernetes scaling for straightforward use-cases. What do you think? Want to dive deeper into any aspect of HPA?

So... When to Use HPA?

Predictable Workloads: If you're dealing with an application that has a fairly predictable pattern—like a web app that gets more traffic during the day and less at night—HPA is a solid choice. You can set it to scale based on CPU or memory usage, which are often good indicators of load for these types of apps.
Simple Metrics: HPA is great when you're looking at straightforward metrics like CPU and memory. If you don't need to scale based on more complex or custom metrics, HPA is easier to set up and manage.
Quick Setup: If you're in a situation where you need to get autoscaling up and running quickly, HPA is your friend. Being a native Kubernetes feature, it's well-documented and supported, making it easier to implement.
Stateless Applications: HPA is particularly well-suited for stateless applications where each pod is interchangeable. This makes it easier to scale pods in and out without worrying about maintaining state.
Built-In Kubernetes Support: Since HPA is a built-in feature, it comes with the advantage of native integration into the Kubernetes ecosystem, including monitoring and logging through tools like Prometheus and Grafana.

What is KEDA?

KEDA stands for Kubernetes Event-Driven Autoscaling. Unlike HPA, which is more about scaling based on system metrics like CPU and memory, KEDA is designed to scale your application based on events. These events could be anything from the length of a message queue to the number of unprocessed database records.

KEDA works by deploying a custom metric server and custom resources in your Kubernetes cluster. It then integrates with various event sources like Kafka, RabbitMQ, Azure Event Hubs, and many more, allowing you to scale your application based on metrics from these systems.

Why Was KEDA Devised?

Event-Driven Architectures: Modern applications are increasingly adopting event-driven architectures, where services communicate asynchronously through events. Traditional autoscalers like HPA aren't designed to handle this kind of workload.
Complex Metrics: While HPA is great for simple metrics, what if you need to scale based on the length of a Kafka topic or the number of messages in an Azure Queue? That's where KEDA comes in.
Zero to N Scaling: One of the coolest features of KEDA is its ability to scale your application back to zero when there are no events to process. This can lead to significant cost savings.
Extensibility: KEDA is designed to be extensible, allowing you to write your own scalers or use community-contributed ones. This makes it incredibly flexible and adaptable to various use-cases.
Multi-Cloud and On-Premises: KEDA supports a wide range of event sources, making it suitable for both cloud and on-premises deployments.

The Gap that KEDA Fills Over HPA

While HPA is like your reliable sedan, KEDA is more like a tricked-out sports car with all the bells and whistles. It was devised to fill the gaps left by HPA, particularly for applications that are event-driven or that require scaling based on custom or external metrics.

So, if you're dealing with complex, event-driven architectures, or if you need to scale based on metrics that HPA doesn't support out of the box, KEDA is your go-to. It's like the next evolution in Kubernetes autoscaling, designed for the complexities of modern, cloud-native applications.

Real-World Scenarios

Real Cases for Using HPA Over KEDA

1. Basic Web Application

Scenario: You're running a simple web application that serves static content and has predictable spikes in traffic, like during a marketing campaign.

In this case, the scaling needs are straightforward and based on CPU or memory usage. HPA is easier to set up and manage for this kind of scenario. You don't need the event-driven capabilities that KEDA offers.

2. Internal Business Application

Scenario: You have an internal application used by employees for tasks like data entry, which sees higher usage during business hours and lower usage otherwise.

Again, the load pattern is predictable and can be managed easily with simple metrics like CPU and memory. HPA's native integration with Kubernetes makes it a straightforward choice, without the need for the more complex setup that KEDA might require.

3. Stateless Microservices

Scenario: You're running a set of stateless microservices that handle tasks like authentication, logging, or caching. These services have a consistent load and don't rely on external events.

These types of services often scale well based on system metrics, making HPA a good fit. Since they're stateless, scaling in and out is less complex, and HPA can handle it easily.

4. Traditional RESTful API

Scenario: You have a RESTful API that serves mobile or web clients. The API has a steady rate of requests but might experience occasional spikes.

In this case, you can set up HPA to scale based on request rates or CPU usage, which are good indicators of load for this type of application. KEDA's event-driven scaling would be overkill for this scenario.

Why Choose HPA in These Cases?

Simplicity: HPA is easier to set up and manage for straightforward scaling needs. If you don't need to scale based on complex or custom metrics, HPA is the way to go.
Native Support: Being a built-in Kubernetes feature, HPA has native support and a broad community, making it easier to find help or resources.
Resource Efficiency: For applications with predictable workloads, HPA allows you to efficiently use your cluster resources without the need for more complex scaling logic.
Operational Ease: HPA requires less ongoing maintenance and has fewer components to manage compared to KEDA, making it a good choice for smaller teams or simpler applications.

Real Cases for Using KEDA Over HPA

1. Event-Driven ML Inference

Scenario: You have a machine learning application for real-time fraud detection. Transactions are events funneled into an AWS SQS queue.

Why KEDA Over HPA: With KEDA, you can dynamically adjust the number of inference pods based on the SQS queue length, ensuring timely fraud detection. HPA's system metrics like CPU or memory wouldn't be as effective for this use-case.

2. IoT Data Processing

Scenario: Your IoT application collects sensor data that's sent to an Azure Event Hub for immediate processing.

Why KEDA Over HPA: Here, KEDA's strength lies in its ability to adapt to the number of unprocessed messages in the Azure Event Hub, ensuring real-time data processing. Traditional HPA scaling based on CPU or memory wouldn't be as responsive to these event-driven requirements.

3. Real-time Chat Application

Scenario: You manage a chat application where messages are temporarily stored in a RabbitMQ queue before being delivered to users.

Why KEDA Over HPA: KEDA excels in this scenario by dynamically adjusting resources based on the RabbitMQ queue length, ensuring prompt message delivery. This is a level of granularity that HPA, with its focus on system metrics, can't offer.

4. Stream Processing with Kafka

Scenario: Your application consumes messages from a Kafka topic, and the rate of incoming messages can fluctuate significantly.

Why KEDA Over HPA: In this case, KEDA's ability to scale based on the Kafka topic length allows it to adapt to varying loads effectively. HPA, which isn't designed for such custom metrics, wouldn't be as agile.

Why Choose KEDA in These Cases?

Event-Driven Flexibility: KEDA is tailored for scenarios where system metrics aren't the best indicators for scaling, offering a more nuanced approach.
Custom Metrics Support: Unlike HPA, KEDA can interpret a wide range of custom metrics, making it versatile for complex scaling needs.
Resource Optimization: KEDA's ability to scale down to zero pods when idle can lead to significant cost savings.
Adaptability: The platform's extensible design allows for custom scalers, making it adaptable to a wide range of use-cases.

Conclusion

So there you have it, folks! We've journeyed through the world of Kubernetes autoscaling, dissecting both HPA and KEDA to understand their strengths, limitations, and ideal use-cases. From my early days of being enamored with HPA's simplicity to discovering the event-driven magic of KEDA, it's been a ride full of lessons.

If you're dealing with predictable workloads and need a quick, straightforward solution, HPA is your reliable workhorse. It's like your trusty old hammer; it might not have all the bells and whistles, but it gets the job done efficiently.

On the flip side, if your application lives in the fast-paced realm of event-driven architectures or requires scaling based on custom metrics, KEDA is your Swiss Army knife. It's built for the complexities and nuances of modern, cloud-native applications.

Remember, choosing between HPA and KEDA isn't about which is better overall, but which is better for your specific needs. So take stock of your application's requirements, your team's expertise, and your long-term scaling strategy before making the call.

As you venture into your next Kubernetes project, I hope this guide serves as a useful roadmap for your autoscaling decisions. And hey, since you're all about diving deeper, maybe explore setting up these autoscaling strategies in a hands-on way. Trust me, there's no better teacher than experience.

Happy scaling!

Bonus Track: Meet VPA

While we've focused on HPA and KEDA, let's not forget about the Vertical Pod Autoscaler (VPA). Unlike HPA and KEDA, which scale the number of pod replicas, VPA adjusts the CPU and memory resources for your existing pods. Think of it as making your pods beefier or leaner based on their actual needs.

Why Consider VPA?

Resource Optimization: VPA fine-tunes the CPU and memory allocated to each pod, helping you use cluster resources more efficiently.
Complementary: VPA can work alongside HPA or KEDA, offering another layer of autoscaling. While HPA and KEDA scale out, VPA scales up.
Stateful Apps: For applications that can't be easily scaled horizontally, like stateful services, VPA can be a better fit.

So, as you ponder your autoscaling strategy, keep VPA in your back pocket. It offers a different angle on scalability that might just be what your project needs.

Rethinking Infrastructure as Code: The Second Wave of DevOps and IaC by Winglang

Saul Fernandez — Tue, 26 Sep 2023 00:37:14 +0000

Introduction

Picture this: You're sitting at your desk, sifting through lines of Terraform code, and you can't shake the feeling that something's off. It's like you're trying to assemble a puzzle, but the pieces keep changing shapes. Then, you stumble upon an article by Nathan Peck, and it's like someone turned on a light. Nathan's words resonate with you, articulating the unease you've felt but couldn't put into words. It's time for a change—a second wave in DevOps and IaC. And as you'll soon discover, Winglang might just be the surfboard we all need to ride this wave.

The DevOps and Infrastructure as Code (IaC) landscapes have seen significant advancements over the years. However, as cloud ecosystems grow in complexity, so do the tools and practices we use. Nathan Peck's article on rethinking IaC and the "Second Wave DevOps" piece from System Initiative both point to the need for a new approach. Enter Winglang, a cloud-oriented programming language that promises to simplify the cloud development process. Could this be the future of DevOps and IaC? Let's dive in.

The Current State of DevOps and IaC

DevOps has come a long way since its inception, with best practices like CI/CD, feature flags, and shared observability becoming the norm. However, despite these advancements, 88% of companies still can't deploy more than once a week. Similarly, IaC tools like Terraform and CloudFormation have become increasingly complex, making them hard to manage and scale.

The Problems

Complexity: Both DevOps and IaC are grappling with the complexities of modern cloud services.
Stagnation: The tools and practices have evolved, but the overall system design hasn't kept pace.
User Experience: DevOps work is often filled with "tiny papercuts," making it the worst part of building a modern application.

The Need for a Second Wave: A Closer Look

The DevOps and IaC landscapes are at a crossroads. While the tools and practices have evolved, the overall system design hasn't kept pace with the complexities of modern cloud services. Two thought-provoking articles—Nathan Peck's "Rethinking Infrastructure as Code from Scratch" and System Initiative's "Second Wave DevOps"—highlight the need for a new approach. Let's delve into each.

Nathan Peck's CSS for IaC: Simplifying Complexity

Nathan Peck argues that as cloud services grow in complexity, the IaC tools we use are becoming increasingly complex and unmanageable. He suggests that we need a new approach to make IaC more scalable, maintainable, and easier to understand.

The CSS Analogy

Nathan draws an analogy with HTML and CSS in web development. Just like CSS allows you to separate styling from content, he proposes a similar layer for IaC. This layer would let you group configurations into "traits" that you can apply to multiple resources.

Benefits of Traits

Easier to Read: Semantic names for configurations.
Scale Expertise: Senior devs can create these traits, and junior devs can apply them.
Centrally Updatable: Change the trait in one place, and it updates everywhere.
Clean Removal: Removing a trait removes all its settings.

System Initiative's Second Wave DevOps: Rethinking System Design

The "Second Wave DevOps" article takes a broader view, focusing not just on IaC but on the entire DevOps landscape. It argues that despite the cultural shift towards DevOps, the tools and practices have not evolved to meet the complexities of today's cloud ecosystems.

The Stagnation Problem: The article points out that what worked for John Allspaw and Paul Hammond in 2009 is, in broad strokes, what we have been asking every single DevOps practitioner to do since. We've optimized individual parts of the system but haven't rethought how the whole system is put together.
The Call for a System Overhaul: The article calls for a "second wave" of DevOps tools that focus on improving the daily experience of DevOps work. It's not just about automating tasks but reimagining the entire workflow and system design.
The Risk of Not Changing: If we don't adapt, we risk the DevOps culture itself. The failure of our implementations could put the success of our culture change at risk, taking us back to a time when DevOps didn't exist, which was unequivocally worse for everyone involved.

Enter Winglang: A Cloud-Oriented Programming Language

As we grapple with the complexities and challenges in the DevOps and IaC landscapes, Winglang emerges as a potential game-changer. Designed to abstract away the intricacies of underlying cloud infrastructure, it allows developers to focus on what matters most: the application logic. But what makes Winglang stand out? Let's dissect its key features and see how it could be a cornerstone in the second wave of DevOps and IaC.

The Philosophy of Cloud-Oriented Programming

Winglang is built on the philosophy of "cloud-oriented programming," a paradigm that treats the cloud as a single, unified computer. This approach heavily relies on managed services and distributed programming to build systems that are intrinsically scalable, highly-available, and robust. It's a shift from seeing the cloud as a collection of services to viewing it as an integrated computing environment.

This philosophy aligns well with the calls for a second wave in DevOps and IaC. By treating the cloud as a unified system, Winglang inherently addresses many of the complexities and "tiny papercuts" that DevOps engineers face daily.

Watch Winglang Quick Introduction Video

Winglang Key Elements

High-Level Cloud Primitives: The Batteries-Included Approach: One of Winglang's standout features is its high-level cloud primitives. These are essentially pre-built, cloud-portable resources that you can plug into your application, much like importing a library in traditional programming languages.
The Power of Abstraction: These high-level primitives allow developers to leverage the full extent of the cloud without having to be infrastructure experts. It's akin to Nathan Peck's idea of using CSS-like "traits" to simplify configurations, but Winglang takes it a step further by integrating these abstractions directly into the programming language.
Local Cloud Simulator Developer's Dream: Winglang comes with a local cloud simulator, allowing you to run your applications locally. This is a game-changer for debugging, testing, and iterative development.
The Speed of Iteration: The local cloud simulator enables developers to see the effects of incremental changes at milliseconds latency. This aligns with the "Second Wave DevOps" article's call for tools that improve the daily experience of DevOps work.
Infrastructure as Policy: A Horizontal Approach: Winglang introduces the concept of "Infrastructure as Policy," where infrastructure concerns like deployment, networking, and security can be applied horizontally through policies, rather than being hardcoded into the application.
The Benefit of Separation: This feature allows for a clean separation between application logic and infrastructure concerns, making the codebase easier to manage and scale. It's a practical implementation of the "second wave" philosophy that calls for a complete overhaul of system design.

Conclusion

The complexities of modern cloud ecosystems demand a new approach to DevOps and IaC. Whether it's Nathan Peck's idea of CSS for IaC, System Initiative's call for a second wave in DevOps, or Winglang's cloud-oriented programming, it's clear that we're on the cusp of a significant shift in how we approach cloud development and operations.

Winglang seems to encapsulate many of the ideas and needs expressed by thought leaders like Nathan Peck and System Initiative. By offering a new way to approach cloud development, it could very well be a key player in the next wave of DevOps and IaC innovations.

NGINX vs. Traefik vs. Istio — Unlocking the Secrets to Mastering Kubernetes Ingress

Saul Fernandez — Mon, 25 Sep 2023 19:02:54 +0000

Let's be real, navigating the kubernetes ecosystem can feel like you're threading a labyrinth. One wrong turn, and you're staring down a Minotaur of complexity. That's why today, we're zeroing in on one of the most crucial decisions you'll make in your Kubernetes journey: selecting the right ingress controller.

We're pitting NGINX, Traefik and Istio against each other in an epic showdown. Why? Because your ingress controller is more than just a traffic cop; it's the gateway to your application, the bouncer at your club, and the guardian of your microservices.

So, whether you're architecting a sprawling microservices empire, scaling a dynamic cloud-native startup, or running a rock-solid enterprise application, this guide is your treasure map. We'll dissect features, complexity, performance, and community support to help you make an expert-level decision.

Ready to become the Gandalf of Kubernetes ingress? Let's dive in.

The Big Picture

NGINX

NGINX is the granddaddy of reverse proxies. It's been around for ages and is super stable. As an ingress controller, it's straightforward but might lack some of the dynamic features of the other two. It's like the established corporation that's been doing its thing for years.

Traefik

Traefik is the new kid on the block, designed to be cloud-native and super dynamic. It's easier to get started with and has some neat features like automated SSL certificate management. It's like that cool, agile startup that's disrupting the market.

Istio

Think of Istio as a combination of service mesh and ingress. It's a full-blown service mesh that can handle traffic routing, security, and more. It's like a "do-it-all" solution but comes with a steeper learning curve.

Key Differences

So, you've got the 30,000-foot view of Istio, Traefik, and NGINX. But let's face it, the devil's in the details, right? In this section, we're diving into the key differentiators that set these ingress controllers apart. We'll explore complexity, features, performance, and community support, so you can make an informed choice that fits like a glove.

Complexity & Learning Curve

NGINX: Low. It's straightforward but less dynamic.
Traefik: Moderate. Easier to get up and running.
Istio: High. You'll need to invest time to understand its many features.

Features

NGINX: Basic load balancing, SSL termination, and routing.
Traefik: Dynamic reconfiguration, middleware support, and automated SSL.
Istio: Traffic routing, fault injection, circuit breaking, and a lot more.

Performance

NGINX: Battle-tested and optimized for performance.
Traefik: Generally lighter and designed for cloud-native environments.
Istio: Can be resource-intensive because of its extensive features.

Community & Ecosystem

NGINX: Huge community but more in the general web server space.
Traefik: Growing community, especially in the cloud-native space.
Istio: Strong backing by Google and IBM.

When to Use Which?

Scenarios for NGINX

Stability & Maturity: Ideal for setups that require a tried-and-true solution.

Example: Enterprise Web Application
- You're in charge of an enterprise-level web application that has been running for years. Stability and performance are key. NGINX, being a mature and well-optimized solution, can provide the reliability you need.

Performance: If raw HTTP/HTTPS routing performance is a priority, NGINX is highly optimized.

Example: Content Delivery Network (CDN)
- You're running a CDN and need raw HTTP/HTTPS routing performance. NGINX is highly optimized for these kinds of workloads and can handle massive amounts of traffic with lower latency.

Scenarios for Traefik

Dynamic Environments: Perfect for cloud-native setups where services frequently scale.

Example: Media Streaming Service
- You're running a media streaming service like Netflix, where the demand can spike unpredictably during new releases. Services need to be dynamically scaled. Traefik can automatically discover and route traffic to these new instances without manual intervention.

Quick Start: If you want to get up and running quickly, Traefik is your friend.

Example: Startup MVP
- You're a startup aiming to quickly launch an MVP for a food delivery app. You don't have the luxury of time to go through extensive documentation. Traefik allows you to get your ingress routing up and running quickly, so you can focus on iterating your app.

Scenarios for Istio

Complex Microservices: Istio shines in environments with multiple services that need advanced routing and security features.

Example: Financial Trading Platform
- Imagine you're running a complex financial trading platform where multiple microservices are responsible for things like trade execution, risk assessment, and real-time analytics. You need advanced routing, security features, and observability. Istio can manage the service-to-service communication, enforce security policies, and provide detailed metrics and tracing.

Advanced Traffic Routing: Need canary deployments or A/B testing? Istio is your go-to.

Example: E-commerce Platform
- You have an e-commerce platform and want to roll out a new recommendation engine. With Istio, you can set up canary deployments to slowly introduce the new feature to a subset of users, monitor its performance, and roll it back if things go south.

Conclusion

Choosing between Istio, Traefik, and NGINX boils down to your specific needs and the complexity of your environment. Each has its own set of features, advantages, and trade-offs. So, what's it gonna be? Pick your weapon of choice and may the Kube be with you!

Unlocking the Power of Code Documentation with AskTheCode for ChatGPT4

Saul Fernandez — Sat, 02 Sep 2023 12:46:02 +0000

Introduction

Hey there, fellow DevOps enthusiast! Ever found yourself drowning in a sea of GitHub repositories, trying to make sense of codebases that look like a labyrinth? Or maybe you're tasked with onboarding new team members, and you wish there was a way to make the process smoother? Well, let me introduce you to the AskTheCode ChatGPT4 plugin. This bad boy is designed to bridge the gap between ChatGPT and GitHub repositories, making your life a lot easier. Stick around, and I'll break down why you should use it, how to get started, and even throw in some real-world examples using the official Kubernetes repository.

TL;DR

AskTheCode is a ChatGPT4 plugin that helps you interact with GitHub repositories. It supports multiple programming languages and can work with both public and private repos. It's a game-changer for code documentation, onboarding new team members, and much more. You can install it directly from ChatGPT and start using it straight away, having the link of your repo.

But Why to use AskTheCode in a DevOps and IaC Era?

DevOps is all about automating the software delivery process and improving collaboration between development and operations. Infrastructure as Code (IaC), on the other hand, is about managing and provisioning your cloud resources using code. Both of these domains involve a lot of codebase interaction, code reviews, and collaboration. Here's where AskTheCode can be a lifesaver:

Code Reviews Made Easy

In DevOps, code reviews are crucial for maintaining the quality and reliability of the software. AskTheCode can help you quickly understand the logic and structure of pull requests. Imagine asking, "What does this new function in the Terraform script do?" and getting a concise, accurate answer. It's like having a code review assistant.

Seamless Onboarding

New team members often struggle with understanding the existing infrastructure setup, especially if it's defined through IaC. AskTheCode can provide quick insights into what each part of the code does, making the onboarding process smoother. You could ask, "Explain the AWS setup in this Ansible playbook," and get a straightforward explanation.

Documentation Gaps

DevOps and IaC often suffer from poor or outdated documentation due to the fast-paced nature of changes. AskTheCode can help identify these gaps. You could ask, "Is there any missing documentation for this Kubernetes deployment YAML?" and then act on it.

Debugging and Monitoring

In DevOps, you often need to debug issues in real-time. AskTheCode can help by providing insights into common issues or bugs in the codebase. For example, you could ask, "What are common issues with this Dockerfile?" and get a list of potential pitfalls to avoid.

Collaboration Booster

DevOps is all about collaboration, and AskTheCode can serve as a centralized knowledge base. Team members can ask questions about the codebase and get instant answers, reducing the time spent in back-and-forths and meetings.

Version Control

In IaC, versioning is crucial. AskTheCode can help you understand the changes between different versions of your infrastructure code. You could ask, "What are the differences between version 1 and version 2 of this Terraform module?" and get a detailed comparison.

How to Use it?

First things first, you'll need to install the AskTheCode plugin. AskTheCode documentation is clear, so you can follow it -> https://docs.askthecode.ai/getting-started/prerequisites/
Once installed, pick the GitHub repository you want to analyse.
Finally, start asking ChatGPT questions related to the repository.

Do We Test It? Using the Official Kubernetes Repository

Alright, let's get our hands dirty with some real-world examples. Imagine you're a newcomer to the kubernetes open source project, working with the Kubernetes repository. Diving into an open source project for the first time can be like stepping into a maze. But don't worry, AskTheCode can be your guide. Here are some questions that would be super helpful for onboarding:

Project Overview: "Can you summarize the main components and architecture of the Kubernetes project?"
1. This will give you a 10,000-foot view of the project, helping you understand how all the pieces fit together.
First Steps: "What are some good first issues or beginner-friendly tasks in the Kubernetes repository?"
1. This can help you find a starting point for contributing to the project, something that's manageable and yet impactful.
Codebase Navigation: "How is the codebase organized? Can you highlight the most important directories and files?"
1. Knowing where everything is can save you a ton of time. It's like having a map of the labyrinth.
Development Workflow: "What's the typical development workflow for contributing to Kubernetes? Are there any specific coding guidelines?"
1. Understanding the workflow and coding standards can help you integrate seamlessly into the team and contribute more effectively.

So imagine the potential that a tool like this can have in your company, in your team, in your life :P (perhaps I am a bit epic about it, but I am a knowledge base lover and this is gorgeous :P)

Security Concerns and Downsides: A Word of Caution 🚨

While AskTheCode offers a plethora of advantages, it's essential to weigh these against some significant security concerns, especially if you're considering using this plugin within a corporate setting.

Admin GitHub Privileges: A Red Flag 🚩

One of the most glaring issues is that AskTheCode asks for admin-level GitHub privileges. From a security standpoint, this is a big no-no. Admin privileges provide extensive access, including the ability to delete repositories, something that most third-party tools shouldn't require. Personally, I find no justification for such elevated access levels, and it raises questions about the plugin's data privacy and security measures.

Legal Compliance

Before integrating any third-party tool like AskTheCode, it's crucial to consider its compliance with data protection laws like GDPR or CCPA. This is particularly important if the tool will be interacting with private repositories containing sensitive or proprietary information.

Internal Policies and Audit Trails

Your company's internal policies may have strict guidelines about third-party integrations. Additionally, the absence of an audit trail feature in AskTheCode makes it difficult to track interactions, which could be a compliance issue.

Final Thoughts

So there you have it! AskTheCode can be a powerful ally in your DevOps journey, making tasks like code documentation and team onboarding a walk in the park. However, given security concerns, I can't recommend using AskTheCode for private company repositories. The risks, in this case, outweigh the benefits.

Fortunately, it will become in your best ally when used for open-source projects or hobby-related coding, the plugin can be incredibly useful. It offers a quick and efficient way to navigate through codebases, making it a valuable asset for individual developers and contributors in the open-source community.

Feel free to dive in and explore. Trust me, you won't regret it!

Achieving Zero-Downtime Load Migration in Kubernetes GKE with Autoscaling

Saul Fernandez — Wed, 30 Aug 2023 14:03:24 +0000

Introduction

The magic of autoscaling ensures that your application scales seamlessly with demand. But what happens when you need to migrate your workloads from one node pool to another without causing disruptions? In this article, we'll dive into the process of migrating loads between GKE node pools while autoscaling is enabled, all without interrupting your services.

TL;DR (Too Long; Didn't Read)

Migrating loads between GKE node pools while keeping autoscaling operational might sound complex, but it's an essential operation in scenarios like resource optimization, maintenance, or refining your scaling strategy. To execute this successfully, you'll:

Create a new node pool: Prepare a destination for your workloads.
Cordoning nodes: Pause new pod scheduling on nodes in the source node pool.
Disable autoscaling: Temporarily halt automatic scaling for controlled migration.
Draining or rolling restarts: Shift running workloads off the source nodes.
Monitor and validate: Keep an eye on cluster health and application performance.

Why Do We Need This?

There are a few key scenarios that highlight the importance of seamless load migration:

Optimal Resource Utilization: New node pools might offer better resources or updated OS versions, making migration crucial for performance optimization.
Maintenance and Upgrades: During system updates or maintenance tasks, smooth load migration ensures continuous availability.
Scaling Flexibility: As your application scales, distributing the load across multiple node pools can help maintain optimal performance without overwhelming individual nodes.

Step-by-Step Guide

1. Create a the New Node Pool

In the Google Cloud Console:

Navigate to your GKE cluster.
Under "Cluster," select "Node Pools."
Click "Create a Node Pool" to set up a destination for your workloads.

Of course, those are steps for doing it manually but if you use IaC with Terraform, just let you know that you have to create the nodepool as the first step.

2. Cordoning Nodes

Cordoning nodes prevents new pods from being scheduled on nodes in the source node pool. This step prepares the pool for migration while keeping existing workloads running.

kubectl cordon NODE_NAME

3. Disable Autoscaling

By disabling autoscaling, you gain control over the migration process. This ensures that the source node pool won't unexpectedly scale during migration.

In the Google Cloud Console, navigate to your GKE cluster.
Under "Cluster," select "Node Pools."
Pick the source node pool and click "Edit."
Turn off autoscaling and save the changes.

4. Draining Nodes or Rolling Restarts

For a smooth transition, you can either drain nodes or perform rolling restarts on deployments.

Draining Nodes: Evict pods gracefully from nodes you're migrating using:

  kubectl drain NODE_NAME --ignore-daemonsets --delete-emptydir-data

Although draining the node is the most easiest and fast way to evict the pods from one node to other node, this can be impossible if PodDisruptionBudgets are set. If you find any problem related with this, follow with Rolling Restart procedure.

Rolling Restarts: If managed by Deployments, this approach gradually moves pods to other nodes with minimal service disruption. Identify the deployments allocated on the nodes, and rollout restart them.

  kubectl rollout restart DEPLOYMENT_NAME

Kubernetes' scheduler automatically places evicted pods onto new nodes in the destination node pool, ensuring minimal downtime.

6. Monitor and Validate

Keep a close watch on your GKE cluster to ensure the migration's success. Check the status of your workloads and their performance in the new node pool.

Conclusion

Seamlessly migrating loads between GKE node pools with autoscaling enabled might seem intricate, but with careful planning and execution, you can achieve it without service interruptions. GKE empowers you to manage these transitions efficiently as your application evolves, maintaining high availability and performance. Embrace the flexibility and tools GKE offers, and confidently manage your infrastructure's growth and change.

GitHub - Writting commits like a boss

Saul Fernandez — Fri, 03 Mar 2023 05:40:19 +0000

Writing clear and descriptive commit messages is an important part of the software development process. Good commit messages can help you and your team members better understand the changes made to the codebase, and can also serve as a useful reference for future development work. Here are some golden rules to follow when your commit messages like a f****** boss:

Specify the type of commit: At the beginning of the commit, specify the type of change you are doing to easily understand what the commit is about. Every project can use their own conventions, but these works for me.

feat: The new feature being added to a particular application
fix: A bug fix
style: Feature and updates related to styling
refactor: Refactoring a specific section of the codebase
test: Everything related to testing
docs: Everything related to documentation
chore: Regular code maintenance

Keep it concise: Your commit message should be brief and to the point. It should summarize the change made in the commit, but not be too lengthy.

Use the imperative mood: Use the imperative mood in your commit message, which means writing in the present tense and using action verbs. For example, "Fix the bug" instead of "Fixed the bug".

If needed, provide more detail in the body: To be completely honest, I do not do this so much, but I really do it when doing some kind of squash or a final final final final commit (you know what I mean). In this body I include why the change was necessary, any trade-offs made, and any other relevant information.

Be consistent: Use a consistent style and format for your commit messages throughout the project. This can help make them easier to read and understand.

If you follow these practices, your teammates will deeply love you, and you will show that are not the regular coder, but an exceptional one.

Teleport, the future of cloud infrastructure secure access

Saul Fernandez — Tue, 21 Feb 2023 18:15:14 +0000

What is Teleport?

Teleport is a modern security gateway designed for managing access to your infrastructure, including servers, applications, and databases. It provides secure access to your resources over the internet or through a private network, allowing authorized users to access these resources from anywhere, without the need for a VPN.

What are the key benefits of using Teleport over the standard VPNs?

Teleport provides several benefits (and benefits which you cannot live without after discovering them) over the standard VPN, including:

Better overall security: Teleport is designed specifically for managing access to infrastructure resources, while a standard VPN provides access to your entire network. This makes it easier to control access to specific resources and reduce the attack surface for potential threats. One thing that blew my mind when I discovered Terraform was how easily it was to provide access to Kubernetes (with kubectl client), databases (with sql clients) and nodes (with ssh access) in the simplest way. Not the whole network, but just the resources needed. You have to try it to believe it.
Granular access controls: With Teleport, you can control access to resources based on user roles and permissions. This ensures that only authorized users can access sensitive resources, reducing the risk of data breaches.
Simplified access management: Teleport streamlines access management for your team, allowing you to easily grant or revoke access to specific resources as needed.
Audit trail: Teleport provides a secure audit trail of all user activity, making it easy to identify and investigate any suspicious activity.
Two-factor authentication: Teleport supports two-factor authentication, adding an extra layer of security to your access management process. Well, to be honest, some VPN also uses this... so it is not a benefit perse, but I just wanted to point it.
Certificate-based authentication: Teleport supports certificate-based authentication, which provides a more secure and streamlined authentication process.
Integration with external identity providers: Teleport integrates with external identity providers like Okta, Active Directory, and OAuth2, making it easier to manage access for your entire team.

But Teleport also comes with some dawn backs that I feel compelled to share with you:

Complexity: Setting up and configuring Teleport can be complex, especially for organizations with large and complex infrastructure environments. This may require additional resources and expertise to implement and maintain.
Cost: Teleport is a commercial product, and as such, there are costs associated with using it. While there is a free open-source version of Teleport available, some features are only available in the commercial version.
Limited platform support: Teleport is primarily designed for managing access to Linux-based infrastructure resources. While it does support Windows-based resources, it may not be the best solution for organizations that primarily use Windows-based resources.
Adoption: As happens with all relatively new technologies, Teleport may not yet be widely adopted by other organizations or integrated with other third-party tools.
Learning curve: Teleport has its own unique terminology and concepts, which may require some learning and training for your team to effectively use it.

Let's go to dive in. How is the Teleport architecture?

The Teleport architecture consists of several components that work together in a flexible way to provide secure access to infrastructure resources. These components include:

Teleport Proxy: The Proxy provides a secure way to access infrastructure resources, whether they are located on-premises or in the cloud. The Proxy is deployed in front of the target resource and handles all access requests from Teleport users.
Teleport Authentication Service: Once the connection with the proxy has been established, the Authentication Service is responsible for authenticating users and devices and issuing access tokens. It supports a range of authentication methods, including certificate-based authentication, SAML, and OAuth2.
Teleport Node: The Node is installed on target resources to enable access by Teleport users. The Node communicates with the Teleport Proxy to verify user authentication and authorization before allowing access to the resource.
Teleport GUI: Web dashboard used by the users to login and access all the resources. At the same time, admin can use this interface to assign roles and audit trail of all user activity.

Is Teleport the future of remote cloud secured access?

Teleport provides a modern and secure approach to managing access to infrastructure resources, making it well-suited for remote and cloud-based environments. As more organizations move to the cloud and adopt a remote work model, secure and efficient access management becomes increasingly important. Teleport addresses this need by providing granular access controls, a secure audit trail, and support for a range of authentication methods.

While it's difficult to predict the future of remote access management, it's clear that secure and efficient access management will continue to be a critical need for organizations. Teleport's modern approach to access management, combined with its flexibility and integration capabilities, position it well as a leading solution for remote and cloud-based access management.

What alternatives are to Teleport with the same approach?

I not a Teleport advocate although using it has changed my life as Cloud Sec specialist. So I want to list and talk briefly about other alternatives that worth to mention.

HashiCorp Boundary: HashiCorp Boundary is an open-source solution for managing access to infrastructure resources. It provides secure access to resources across multiple environments, including on-premises and cloud-based resources.
ZeroTier: ZeroTier is a cloud-based solution for managing access to resources across multiple environments. It provides a software-defined network that enables secure and efficient access to resources.
Pritunl: Pritunl is an open-source solution for managing access to infrastructure resources. It provides granular access controls, a secure audit trail, and support for a range of authentication methods.
BeyondCorp: BeyondCorp is a security model developed by Google that focuses on managing access to resources based on user identity and device security. It provides a zero-trust approach to access management, similar to Teleport.

Conclusion

Welcome to the future of access security. Bye bye to the long living VPNs. The future is remote and Teleport (and their alternatives) will become the successor of VPNs and the new guy (or its brothers) on the floor of the cloud security access scene.

How to use DORA Metrics to improve DevOps practices and start delivering better business outcomes

Saul Fernandez — Tue, 21 Feb 2023 17:32:21 +0000

What are DORA Metrics?

DORA (DevOps Research and Assessment) metrics are a set of key performance indicators (KPIs) used to measure the performance of software development and delivery processes in a DevOps environment. These metrics were developed by the DevOps research group at Google, led by Dr. Nicole Forsgren, and are widely used in the industry.

The four main DORA metrics are:

Deployment Frequency
Lead Time for Changes
Mean Time to Recover (MTTR)
Change Failure Rate

Let me explain to you further each one of them.

Deployment Frequency

This metric measures how often code changes are deployed to production. It is typically measured as the number of deployments per unit of time (e.g., per day, per week, or per month). A high deployment frequency indicates a fast and efficient delivery process. It also enables organizations to release new features and fixes to users more frequently and respond to market changes faster. However, a high deployment frequency must be balanced with stability and quality to avoid introducing bugs and errors into the system.

Lead Time for Changes

This metric measures the time it takes for a code change to go from commit to deployment. It includes all the steps involved in the software delivery process, such as code review, testing, and deployment. A shorter lead time indicates a more streamlined and efficient delivery process. It also enables organizations to respond to market changes and user feedback faster and with higher quality.

Mean Time to Recover (MTTR)

This metric measures the time it takes to recover from a production incident or outage. It is typically measured as the time between the detection of an incident and the resolution of the issue. A lower MTTR indicates a more resilient and efficient infrastructure. It also enables organizations to minimize the impact of incidents on users and business operations.

Change Failure Rate

This metric measures the percentage of code changes that result in a production incident or outage. It is typically measured as the number of failed changes divided by the total number of changes deployed. A lower change failure rate indicates a more stable and reliable system. It also enables organizations to minimize the risk of introducing bugs and errors into the system and to maintain user trust.

And why do I want to use DORA metrics?

The DORA metrics are not meant to be complicated, but rather to provide a standardized set of measures that organizations can use to assess their software delivery performance. The main reason to use DORA metrics is to gain insights into the effectiveness of your DevOps practices and identify areas for improvement.

The most important benefits of using DORA metrics are:

Objective Performance Measures: The DORA metrics provide objective performance measures that can be used to assess the effectiveness of your software delivery process. By using these metrics, you can gain a better understanding of how your organization is performing relative to industry benchmarks and identify areas where you can improve.
Standardized Metrics: The DORA metrics have become a widely accepted industry standard for measuring DevOps performance. This means that you can compare your organization's performance to other organizations using the same metrics, and you can use the metrics to communicate your performance to stakeholders such as customers, executives, and investors.
Focus on Continuous Improvement: The DORA metrics are designed to encourage a focus on continuous improvement. By tracking these metrics over time, you can see the impact of process changes and improvements on your software delivery performance. This enables you to make data-driven decisions about how to optimize your DevOps practices.
Alignment with Business Goals: The DORA metrics are designed to align with business goals such as speed, quality, and reliability. By improving your performance on these metrics, you can increase your organization's ability to deliver value to customers, respond to market changes, and achieve business objectives.

Now that I know the benefits of using DORA metrics... how do I implement it?

From the simplest to the more focused on DORA metrics, here you have a few tools that you may find useful:

DevOps Scorecards: DevOps scorecards are a visual representation of your performance on the DORA metrics. You can use Google Sheets, Microsoft Excel, or dedicated DevOps scorecard tools such as Jira Align or Asana.
Continuous Integration and Delivery (CI/CD) Tools: Many CI/CD tools, such as Jenkins, CircleCI, and TravisCI, provide built-in support for the DORA metrics. These tools enable you to track deployment frequency, lead time for changes, and change failure rate automatically. You can also integrate these tools with other monitoring and analytics tools to gain more insights into your performance.
APM and Monitoring Tools: Application Performance Management (APM) and monitoring tools, such as New Relic, Datadog, and AppDynamics, can provide visibility into your system's performance and identify bottlenecks that affect your performance on the DORA metrics. These tools enable you to monitor key metrics such as response time, error rates, and resource usage, and analyse them in real-time.
Value Stream Management (VSM) Tools: Value Stream Management (VSM) tools, such as Tasktop, Plutora, and ConnectALL, enable you to manage your entire software delivery value stream from end to end. These tools enable you to identify bottlenecks in your value stream, optimize workflows, and track your performance on the DORA metrics.

Remember that there are not better tools than others, but tools that fit better in your organization. As a starting point, you can use the scorecards if you are not using any project management software yet, and this is better than not tracking anything. So start with the tools that you feel better and improve it as needed.

As a bonus tip, some recommendations in how to start implementing DORA metrics

Define the Metrics: The first step is to define the DORA metrics you want to track. You may want to start with one or two metrics and gradually add more as you become more familiar with the process. It's important to ensure that the metrics are relevant to your business goals and are aligned with your organization's strategy.
Set Performance Targets: Once you have defined the metrics and chosen the tools, set performance targets for each metric. These targets should be realistic and achievable, based on your organization's current performance levels and the industry benchmarks. Performance targets can help motivate your team to improve and focus on continuous improvement.
Monitor and Analyse: Once you have set performance targets, start monitoring and analysing your performance. Track the metrics over time and compare them to your performance targets. This will help you identify areas where you need to improve and where you are doing well.
Continuous Improvement: Finally, use the insights you gain from monitoring and analysing your performance to continuously improve your software delivery processes. Use the data to make data-driven decisions about how to optimize your DevOps practices and improve your performance on the DORA metrics. This will help you achieve better business outcomes and provide more value to your customers.

By following these recommendations, you can begin to use DORA metrics to assess and optimize your software delivery process. Remember that the process of continuous improvement is ongoing, and you should always be looking for ways to improve your performance on the DORA metrics.

Conclusion

What cannot be measure cannot be improved, so using DORA metrics can help you to optimize your software delivery process, improve your DevOps practices, and achieve better business outcomes.

DEV Community: Saul Fernandez

Agentic Platform Engineering: How to Build an Agent Infrastructure That Scales From Your Laptop to the Enterprise

The Problem at Any Scale

The Architecture: Three Repos, One System

1. agent-library — The Brain

2. agent-setup — The Bridge

3. resource-catalog — The Map

Skills: The Reusable Procedure Library

The Token Efficiency Design

Disaster Recovery in Under 5 Minutes

The Mental Model: A Package Manager for Agent Intelligence

What This Looks Like Day-to-Day

The Design Decisions That Matter

What I Haven't Built Yet

Conclusion

The Agentic Engineering Manifesto: Why Standards are My New Sovereign Frontier

Beyond the "AI Platform Engineer"

My 5 Pillars for Agentic Sovereignty

1. The Universal Interface: MCP (Model Context Protocol)

2. Intellectual Capital: Portable Skills (agentskills.io)

3. The Digital Constitution: Local Rules (.cursorrules)

4. Resilient Orchestration: Stateful Graphs (LangGraph)

5. The Org Chart: Agent Swarms

Why This Matters for the Platform Engineering Community

From Terraform to Crossplane: How to Understand Crossplane if you already know Terraform?

The Automated Coffee Shop Analogy

The "Dictionary" for Terraform Survivors

WHAT actually installs the "atomic" resource?

How does it improve upon my beloved Terraform?

1. Goodbye to "Drift" (Deviation)

2. Developer Self-Service

3. A Single Language: YAML

The Elephant in the Room: Where is my "Terraform Plan", I F*****g Want to See What is Going to Happen Before It Happens!

How to Sleep at Night (Solving the State Fear)

Final Thoughts

Is Terraform for Kubernetes Applications Flawed? For Kubernetes, GitOps is The Way

The Clash of Titans: Terraform State vs. Kubernetes Reconciliation

Beyond Provisioning: The Application Lifecycle

The GitOps Paradigm with ArgoCD: A Better Way

The Final Frontier: What About Infrastructure Dependencies?

Unifying the Stack with Crossplane and GitOps

Conclusion: Use the Right Tool for the Job

HPA vs. KEDA in Kubernetes - The Autoscaling Guide to Know When and Where to Use Them

Introduction

What is HPA?

Why Was HPA Devised?

So... When to Use HPA?

What is KEDA?

Why Was KEDA Devised?

The Gap that KEDA Fills Over HPA

Real-World Scenarios

Real Cases for Using HPA Over KEDA

1. Basic Web Application

2. Internal Business Application

3. Stateless Microservices

4. Traditional RESTful API

Why Choose HPA in These Cases?

Real Cases for Using KEDA Over HPA

1. Event-Driven ML Inference

2. IoT Data Processing

3. Real-time Chat Application

4. Stream Processing with Kafka

Why Choose KEDA in These Cases?

Conclusion

Bonus Track: Meet VPA

Why Consider VPA?

Rethinking Infrastructure as Code: The Second Wave of DevOps and IaC by Winglang

Introduction

The Current State of DevOps and IaC

The Problems

The Need for a Second Wave: A Closer Look

Nathan Peck's CSS for IaC: Simplifying Complexity

The CSS Analogy

Benefits of Traits

System Initiative's Second Wave DevOps: Rethinking System Design

Enter Winglang: A Cloud-Oriented Programming Language

The Philosophy of Cloud-Oriented Programming

Winglang Key Elements

Conclusion

NGINX vs. Traefik vs. Istio — Unlocking the Secrets to Mastering Kubernetes Ingress

1. `agent-library` — The Brain

2. `agent-setup` — The Bridge

3. `resource-catalog` — The Map