Your Team Responsibilities
Our mission is to embed AI-driven automation, telemetry, and observability into MSCIโs production environments, enabling the Quality Center of Excellenace team to deliver on its objectives of operational excellence, risk reduction, and quality at scale. We serve as the engineering backbone of production governance, ensuring that systems are reliable, efficient, and continuously improving through data-driven insights.
Your Key Responsibilities
- AI Tooling & Framework Development
- Build AI-driven tools for anomaly detection, incident triage, and root-cause analysis in production systems.
- Develop and deploy automation frameworks to support Level 1 / Level 2 support teams and streamline repetitive operational tasks.
- Create self-healing and predictive monitoring capabilities using ML models.
- Telemetry & Observability
- Implement telemetry pipelines in GCP (e.g., Stackdriver/Cloud Monitoring, BigQuery, Pub/Sub) and Azure (e.g., Application Insights, Log Analytics, Monitor).
- Build dashboards, automated alerting, and intelligent log/metric analysis frameworks across hybrid cloud environments.
- Leverage distributed tracing and logging frameworks to ensure end-to-end visibility of systems.
- Incident Management Automation
- Design AI-assisted runbooks to support incident triage and resolution.
- Automate classification and escalation of incidents using ML and rule-based systems.
- Integrate AI-powered insights with existing incident management platforms (e.g., ServiceNow, PagerDuty, Opsgenie).
- Collaboration
- Work closely with production teams, SREs, and system test engineers, incidnet managers to integrate AI solutions into live environments.
- Partner with cloud engineering teams to ensure solutions are scalable, secure, and compliant.
- Provide technical knowledge transfer and training on AI-enabled tools to support engineers.
Your skills and experience that will help you excel
- Bachelorโs or Masterโs degree in Computer Science, Data Engineering, or related field.
- 11+ years of hands-on engineering experience in production support, SRE, or AI/ML platform development.
- Strong programming skills in Python (preferred) and experience with AI/ML frameworks (PyTorch, TensorFlow, Scikit-learn).
- Hands-on expertise with GCP (BigQuery, Pub/Sub, Cloud Monitoring, Vertex AI) and Azure (Application Insights, Log Analytics, Azure ML, Azure Monitor).
- Experience with observability tools (Prometheus, Grafana, ELK, Splunk, Datadog).
- Proficiency with cloud-native infrastructure (Kubernetes, Docker, Terraform, CI/CD pipelines).
- Strong understanding of incident management and ITIL practices.
- Experience implementing AIOps solutions in hybrid cloud environments.
- Knowledge of MLOps best practices (model deployment, monitoring, retraining).
โ