Job Type
Work Type
Location
Experience
A leading Abu Dhabi-based holding group is hiring an AI/ML/DevOps Engineer to architect, operate, and continuously improve the end-to-end MLOps and LLMOps platform for a flagship enterprise AI programme. You'll be the technical authority reviewing, governing, and signing off CI/CD, data and model pipelines, infrastructure, deployment, security, and observability — ensuring secure, scalable, and compliant delivery across environments. Reports to the AI Product Manager within the AI Excellence Centre.
What you'll own:
Own the end-to-end MLOps/LLMOps reference architecture: ingestion → validation → feature and embedding pipelines → training and fine-tuning → evaluation → registry → deployment → monitoring — including RAG and agentic workflows.
Architect, review, and approve CI/CD for ML and LLM systems: code, data, prompt, and model artifact versioning; build and release pipelines (Azure DevOps / GitHub Actions); automated unit, integration, and contract testing; and promotion/rollback (blue-green / canary) across dev, test, and production.
Define and govern AI platform foundations on Azure: IaC (Bicep/Terraform), AML workspaces, AKS GPU node pools and scheduling, private networking (VNet integration / Private Link), identity (Managed Identities / PIM), secrets (Key Vault), and encryption and data residency controls.
Review and approve production deployment patterns for model and LLM serving (AKS / KServe / AML online endpoints), including containerization, inference optimization (batching, quantization where applicable), API management, autoscaling, resiliency, and RAG runtime components (vector store, retriever, re-ranker, cache).
Own observability and reliability for AI services: OpenTelemetry tracing, prompt and inference logs (with PII controls), latency/throughput/cost metrics, SLOs/SLIs, model performance monitoring, data and model drift detection, and LLM evaluations (quality, hallucination checks, toxicity and safety guardrails) with incident playbooks.
Establish and enforce MLOps/LLMOps governance: dataset lineage, data quality validation (schema and tests), feature store and model registry standards, artifact provenance (SBOM/SLSA), vulnerability scanning, approval gates for model and prompt releases, and compliance-aligned documentation for model risk (intended use, limitations, evaluation results).
Enable delivery squads — including the primary delivery partner — with "golden path" templates (AML pipelines, RAG blueprints, evaluation harnesses), reusable IaC modules, and coding standards; run deep technical design and architecture reviews and sign off production readiness (capacity, security, observability, DR) for all AI releases.
Support the Run & Operate model by enabling issue triage and minor enhancement workflows (ticket intake → fix → controlled release), ensuring changes follow the same release governance and quality gates.
Own the Operational Acceptance Gate: no production release without runbooks, monitoring dashboards, incident playbooks, access model, and DR test evidence.
Scope clarity: you provide platform standards, review, and sign-off — you do not replace the delivery partner's engineering, but you enforce the "golden path" and production readiness bar.