Job Type
Work Type
Location
Experience
Define TASMU 2.0 operational architecture and service management model across capabilities (ITIL-aligned where applicable).
Establish observability standards: metrics/logs/traces/audits, OpenTelemetry instrumentation, dashboarding, alerting, and anomaly detection.
Define SLO/SLAs/OLAs, error budgets, and operational KPIs; ensure vendors deliver evidence and meet acceptance gates.
Design incident management workflows (triage, escalation, RCA), integrate with ITSM, and standardize runbooks/playbooks.
Define change and release management practices (CAB inputs, deployment rings, canary/rollback, feature flags coordination).
Establish resiliency and DR requirements: backup/restore patterns, RPO/RTO targets, DR testing cadence, and failover runbooks.
Define capacity, performance, and availability engineering processes (load testing, scaling policies, GPU/TPU capacity planning).
Implement security operations integration: SIEM/SOAR alignment, Defender/Sentinel alert routing, vulnerability/patch management SLAs.
Define FinOps operational controls: tagging standards, showback/chargeback, budgets, anomaly detection, cost optimization playbooks.
Lead operational readiness and handover: L1/L2/L3 trainiacross ng, reverse-shadowing, SOPs, and post-go-live stabilization plans.