Site Reliability Engineer

Reliability as an engineering discipline: SLOs, observability, incident management, and systematically stable production operations.

  • SRE
  • Site Reliability Engineering
  • Monitoring
  • Alerting
  • SLO
  • SLA
  • Incident Response
  • Observability
  • Chaos Engineering
  • Reliability

SREs treat operational problems as engineering problems. Instead of reactive firefighting, systematic and measurable reliability processes emerge that evolve alongside the product’s requirements.

Anyone who runs systems knows that 100% uptime is not a realistic goal. An SRE defines what actually matters, makes it measurable, and ensures that failures can happen in controlled amounts without turning into outages.

Core Skills

SLO / SLA / Error Budgets, Observability, Prometheus, Grafana, Loki, OpenTelemetry, Alertmanager, PagerDuty / Opsgenie, Incident Management, Runbooks, Blameless Postmortems, On-Call Processes, Chaos Engineering, SLI Definition, Toil Reduction, Kubernetes, Linux

Common skills are foundational skills shared by all our experts. Show common skills Show less
  • Git / GitHub / GitLab
  • Jira / Confluence
  • Slack / Microsoft Teams
  • Scrum / Kanban
  • Agile Methoden
  • CI/CD
  • Code Review
  • REST APIs
  • Docker
  • Linux
  • Technische Dokumentation
  • Deutsch / Englisch
  • Remote-Arbeit

Technology Stack

  • Prometheus / Thanos
  • Grafana / Loki
  • OpenTelemetry
  • PagerDuty / Opsgenie
  • Kubernetes
  • Chaos Engineering Tools
  • Runbooks / Playbooks

Want to discuss your project?

Tell us about your project. We respond quickly and without obligation.

Get in touch

We usually respond within one business day.