
SRE | Permanent | London, Hybrid, AWS
London, Greater London, South East, England
Apply by 2 Apr 2026
£90000 per annum
Job Ref.: BH-56925
Job Description
- Role: Site Reliability Engineer
- Type: Full-time permanent role
- Location: Hybrid, London City - 3 days per week on-site
- Salary: £90,000 per annum
- Industry: Technology - Gaming Platforms
The role You will help shape and drive how the firm builds and operates reliable, observable, secure, and cost-efficient systems on AWS. Working closely with development, platform, and incident management teams, you will define reliability in measurable terms and build the tooling and processes to achieve it, improving platform speed, stability, and scalability.
Key responsibilities
- Partner with engineering teams to define, measure, and manage SLOs/SLIs, using error budgets to guide delivery decisions.
- Enhance observability across services (metrics, logs, traces) to detect and resolve issues proactively.
- Lead cost optimisation: monitor spend, right-size workloads, tune autoscaling, and improve infrastructure efficiency.
- Improve production readiness via pre-deployment checks, post-release validation, and robust platform guardrails.
- Introduce and run chaos engineering experiments to strengthen resilience and recovery.
- Automate operational processes to reduce manual intervention and toil across the stack.
- Support major incident response, root-cause analysis, and continual improvement actions.
- Collaborate cross-functionally to raise standards for stability, security, performance, and compliance.
- 3 years’ experience in SRE, Platform, or DevOps roles within production environments.
- Strong Kubernetes operational experience (on-prem and AWS EKS).
- Hands-on experience defining and operating SLOs/SLIs, alerting, and incident workflows.
- Deep understanding of observability and telemetry (monitoring, logging, tracing).
- Infrastructure as Code with Terraform; experience with GitOps workflows and CI/CD.
- Scripting proficiency in Python, Bash, or Go.
- Proven ability to balance cost efficiency with reliability and performance.
- Excellent communication skills and the ability to work effectively across multiple teams.
- Experience running chaos engineering experiments.
- Exposure to high-throughput, low-latency systems.
- FinOps knowledge or cost management practices.
- AWS certifications (e.g., Solutions Architect, DevOps Engineer)