Descrição da Vaga
About the company
We provide
enterprise support and consulting
for open‑source analytics and data infrastructure platforms such as
Apache Druid, Apache Flink, StarRocks
and other emerging technologies.
Our customers run
mission‑critical, high‑volume systems
and rely on us to keep them fast, stable, and available. We’re a small, world-class expert, remote‑first team working across multiple time zones (US, Brazil, Europe, India, Philippines), supporting
100+ customer environments
with SLAs ranging from advisory support to 24/7 incident coverage.
About the role
We’re looking for an experienced
Service Delivery Manager
to take ownership of our service operations:
SLAs and incident processes
on‑call and skills coverage
SOPs and first‑line/SRE enablement
configuration management
SLA metrics and reporting
and coordination between customers and our engineering teams.
This is a
hands‑on role
, not a pure governance role. You will be close to real incidents, engineers, and customers and you’ll be expected to bring in practices you’ve already used successfully in previous service or managed‑services environments.
What you’ll do1. Service operations, on‑call & incidents
Design and maintain an
on‑call and coverage plan
that ensures all critical skills are available when needed (initially weekdays, evolving to full 24/7 where required).
Own the
incident management process
for your accounts: priorities, roles, communication cadence, escalations, and post‑incident reviews.
Define and monitor key
service metrics
(e.g., MTTA, MTTR, SLA compliance, backlog health) and drive improvements based on them.
Act as incident lead / coordinator during major incidents, keeping engineers focused and customers informed.
2. SOPs, runbooks & first‑line enablement
Create and maintain
SOPs, runbooks, and triage guides
for SRE engineers, covering common incident types and operational tasks.
Train and coach first‑line/SRE teams so they can confidently handle
initial triage, basic troubleshooting, and clear communication
, escalating only when needed.
Continuously refine documentation based on real incident experience and feedback.
3. Configuration management & readiness
Establish and run a
configuration management process
that keeps track of each customer’s environment (platforms in use, clusters, regions, configs, access, monitoring, key contacts).
Proactively close information gaps by working directly with customers and engineers.
Ensure configuration information is available and trustworthy during incidents and for onboarding new engineers.
4. Customer communication & governance
Be the
primary operational contact
for a set of enterprise customers.
Lead
regular service reviews and status calls
, presenting SLA performance, key incidents, risks, and improvement actions.
Present and agree on the
incident management process
with customers (channels, priorities, escalation paths, expectations).
Work closely with Account Management / Sales on renewals, expansions, and expectation management.
5. Commercial & delivery management
Clarify
what is in scope
vs. out of scope and work with customers and Sales to shape
paid change requests
when additional work is needed.
Monitor
effort vs. contract
, help protect margins, and flag risks early (under‑scoped contracts, chronic over‑use, under‑utilized capacity).
Work in a
matrix environment
, coordinating with different technical teams (e.g., database engineering, DevOps, SRE) to staff and deliver engagements effectively.
6. Onboarding & training
Design and maintain
onboarding paths
for new engineers joining support/delivery (shadowing, training on SOPs, environment overviews, “certification” on certain incident types).
Ensure new team members reach a productive, independent state quickly and safely.
What success looks like in 6–12 months
On‑call coverage is
clear, predictable, and sustainable
; engineers know when they’re on and what’s expected.
First‑line/SREs handle a meaningful share of incidents
without escalation
, using well‑maintained runbooks.
You can open a customer’s configuration, see an accurate picture, and use it during incidents and planning.
SLA and incident metrics are
tracked, reported, and discussed
regularly with customers and internally.
Customers have a clear understanding of
how incidents are handled
and feel confident in the process.
New engineers ramp up faster thanks to structured onboarding and training.
You’ll be a great fit if you have
5+ years
in a
Service Delivery, Managed Services, IT Operations, or Enterprise Support
role serving
external customers
(not only internal IT).
Experience with
24/7 or extended‑hours operations
, including on‑call or follow‑the‑sun setups.
Hands‑on experience with
incident management
and ITSM practices (incident/problem/change), ideally in an ITIL‑inspired environment.
A track record of
creating or improving SOPs/runbooks
and training first‑line / SRE teams.
Experience maintaining
configuration / environment data
for customer systems.
Comfort discussing technical topics with engineers (cloud, distributed systems, data platforms) and explaining them in clear business terms to customers.
Experience in
commercial delivery
: scope boundaries, change requests, effort vs. revenue, working alongside Sales / Account Management.
Strong communication skills in English, both written and spoken.
Nice to have
Background with
data, analytics, or streaming platforms
(e.g., Druid, Kafka, Flink, StarRocks, ClickHouse, TiDB, Hadoop, cloud data warehouses).
Experience working in small, fast‑moving, remote teams.
Location & working style
Remote‑first
- we collaborate online across multiple time zones.
Role requires regular overlap with European and North American business hours.
We are flexible on contract structure (direct employment or via a global payroll partner or contractor/B2B), depending on your location and preference.