Jeff Nelson

Senior Site Reliability Engineer with 20+ years of infrastructure experience, including 3+ years hands-on with AWS EKS, Kubernetes, and GitOps at production scale. Grounded in enterprise datacenter operations — VMware, SAN, multi-DC management — and now contributing to cloud-native platform engineering for a ~100-engineer organization. Strong execution engineer with depth in Kubernetes troubleshooting, incident response, cluster lifecycle management, and infrastructure as code. Comfortable taking ownership of hard problems and delivering results independently.

Senior Site Reliability Engineer

Wiser — acquired RW3 Technologies — Austin, TX (Hybrid)

2021 – Present

2021–2022: Post-acquisition — bridged 3 active datacenters while onboarding into AWS cloud infrastructure alongside the SRE team.

2022–Present: Full SRE sprint work — EKS, Terraform, ArgoCD, GitOps, incident response. Scope expanded when SRE and DevEx teams merged under one manager.

Led VMware environment decommission (2025) — owned the infrastructure side end to end: coordinated equipment sale, shipping, and physical removal. Executed Chicago DC site-to-site VPN teardown via Terraform/Atlantis across prod and test accounts. Dev teams handled workload migration. Recovered six figures in capital.
Independently led EKS Hybrid Nodes PoC — sole engineer on proof of concept extending EKS to on-premises Chicago DC hardware using AWS Hybrid Nodes (new AWS feature, 2024), SSM activation, nodeadm, and Cilium CNI for cross-environment pod networking. Designed networking path through site-to-site VPN to private EKS API endpoint. Findings informed team decision to decommission the DC rather than maintain hybrid infrastructure long-term.
Authored first v2 Crossplane Workload XRD and Composition — implemented the initial Workload kind against the team's new v2 platform standard, establishing the working pattern for all subsequent team workload deployments. Deployed ngri-analytics-ai as the first consumer of the new standard.
Diagnosed and resolved production auth service outage (SRE-2478) — service returning mixed 200/404/502 responses, dev team fully blocked. Identified Kubernetes Service selector contaminating endpoints with pods from three different services. Resolved ArgoCD sync blockage caused by immutable spec.selector field conflict by deleting and recreating affected Deployments. Flagged chart design flaw to engineering for a pre-merge guardrail.
Owned CI/CD pipeline incident response (SRE-2448) — traced failed GitHub Actions npm publish workflow to expired PAT from a suspended account. Restored pipeline, updated secrets, documented blast radius across three affected teams, and recommended migration to org-level secret to prevent recurrence.
Independently investigated cost cleanup discrepancy (STACK-1856) — when expected AWS savings didn't materialize after a large cleanup effort, traced root cause to DLM policies regenerating EBS snapshots within hours of deletion and scanning tool inflating estimates 15-20x using logical volume size instead of incremental storage. Delivered corrected projections and actionable recommendations to management.
Contributed to Spotinst Ocean → Karpenter + nOps migration — removed Spotinst nodeSelectors and scale-down annotations across online and polaris K8s workloads. Led organization-wide AWS resource tagging rollout (EBS, EFS, EC2, and more) required for nOps FinOps platform to function across teams and accounts.
Executed EKS core-services upgrades — as part of 1.31→1.35 cluster upgrade initiative, owned Traefik v2→v3 Helm chart migration (resolved breaking schema changes and CrashLoopBackOff on first deploy) and kube-prometheus-stack upgrade in sandbox. Executed full core-services upgrade sequence on eu-prod-euw1 production cluster, validating each service individually via ArgoCD.
Contributed to unused resource cleanup across AWS accounts — audited assigned accounts as part of org-wide cost reduction initiative, cross-referencing flagged resources against Terraform state, K8s workloads, and AWS API before deletion.
Supported Coralogix observability and PagerDuty operations — maintained OTEL integration across clusters, supported log-based alerting pipelines. Contributed to PagerDuty escalation policy setup. Evaluated FireHydrant, Incident.io, and Better Stack as incident management platforms and contributed findings to leadership decision.

Cloud & IaC	AWS (EKS, EBS, EFS, S3, ECR, RDS, ALB/NLB, TGW, Route53, VPC, IAM), Terraform, Terragrunt, Atlantis, Crossplane v2, Docker
Kubernetes	EKS, ArgoCD, Helm, Karpenter, KEDA, Kyverno, cert-manager, external-secrets, Traefik, Cilium, node-local-dns
CI/CD	GitHub Actions, GitOps workflows, GitHub Packages, secrets management
Observability	PagerDuty, Coralogix, OTEL, kube-prometheus-stack, nOps, FireHydrant, Incident.io, Better Stack, CheckMK, Nagios
Networking	AWS VPC/TGW/VPN, Cilium, Traefik, OpenVPN, DNS, load balancers (ALB/NLB)
Legacy / Other	Linux, Windows Server, VMware vSphere, SAN Storage (Nimble, EqualLogic, Compellent), MongoDB, HashiCorp Vault

Professional Summary

Core Competencies

Professional Experience

Technical Skills