We are seeking a highly skilled Ceph Cluster Development & Operations Engineer with strong expertise in C++ systems programming to design, extend, and maintain enterprise-scale Ceph distributed storage clusters. The role involves deep development in Ceph core subsystems (RADOS, OSD, RGW, MDS), performance optimization, and operational excellence across multi-site, multi-zone architectures.
You will work closely with system architects, SREs, and cloud infrastructure teams to ensure the reliability, scalability, and security of mission-critical storage systems deployed across multiple data centers and Kubernetes environments.
Key Responsibilities
- Design, build, and operate large-scale Ceph clusters including RADOS, RGW, RBD
- Contribute to or extend Ceph core components written in C++ (e.g., OSD, RGW, librados, BlueStore, MGR modules).
- Profile and optimize performance across network, disk I/O, and replication layers (PG placement, CRUSH rules, BlueStore tuning).
- Develop automation and tooling for cluster lifecycle management (deployment, upgrades, scaling, failover, and recovery).
- Integrate Ceph with Kubernetes (via Rook-Ceph, CSI drivers) and CI/CD pipelines for continuous delivery.
- Implement and validate multi-site replication and disaster recovery architectures for high availability.
- Develop and maintain secure storage solutions using dm-crypt, KMS integration, and CephX authentication.
- Build observability pipelines using Prometheus, Grafana, and custom exporters for metrics and health analytics.
- Write and maintain SOPs, automation scripts, and system documentation to support production-grade operations.
- Collaborate with upstream Ceph community or maintain in-house forks for feature development and bug fixes.
Qualifications
Required Skills
- Strong proficiency in C++ (C++11 or later), with experience in large-scale distributed systems or kernel-adjacent development.
- Deep understanding of Ceph architecture and its core components: MON, OSD, MGR, RGW, MDS, and CRUSH maps.
- Proficient in Linux systems programming, debugging (gdb, perf, valgrind), and performance profiling.
- Experience with Python or Go for tooling and automation.
- Strong foundation in data replication, erasure coding, and consistency models in distributed storage.
- Hands-on experience with Kubernetes, Rook-Ceph, Helm, Ansible, and related DevOps tools.
- Familiarity with TCP/IP, HTTP/S3 APIs, block storage (RBD/iSCSI), and object storage semantics.
- Ability to conduct root-cause analysis and lead performance investigations under production environments.
Preferred Skills
- Contributions to the Ceph open-source project or prior experience modifying Ceph source code.
- Experience with multi-site replication, object versioning, compliance retention, or legal hold features.
- Background in distributed storage systems, file systems, or cloud storage platforms.
- Familiarity with containerized environments, network virtualization, and cloud-native observability stacks.
- Excellent technical documentation and communication skills in English.