We are seeking a highly skilled Ceph Cluster Development & Operations Engineer with strong expertise in C++ systems programming to design, extend, and maintain enterprise-scale Ceph distributed storage clusters. The role involves deep development in Ceph core subsystems (RADOS, OSD, RGW, MDS), performance optimization, and operational excellence across multi-site, multi-zone architectures.

You will work closely with system architects, SREs, and cloud infrastructure teams to ensure the reliability, scalability, and security of mission-critical storage systems deployed across multiple data centers and Kubernetes environments.

‍

Key Responsibilities

Design, build, and operate large-scale Ceph clusters including RADOS, RGW, RBD
Contribute to or extend Ceph core components written in C++ (e.g., OSD, RGW, librados, BlueStore, MGR modules).
Profile and optimize performance across network, disk I/O, and replication layers (PG placement, CRUSH rules, BlueStore tuning).
Develop automation and tooling for cluster lifecycle management (deployment, upgrades, scaling, failover, and recovery).
Integrate Ceph with Kubernetes (via Rook-Ceph, CSI drivers) and CI/CD pipelines for continuous delivery.
Implement and validate multi-site replication and disaster recovery architectures for high availability.
Develop and maintain secure storage solutions using dm-crypt, KMS integration, and CephX authentication.
Build observability pipelines using Prometheus, Grafana, and custom exporters for metrics and health analytics.
Write and maintain SOPs, automation scripts, and system documentation to support production-grade operations.
Collaborate with upstream Ceph community or maintain in-house forks for feature development and bug fixes.

‍

Qualifications

Required Skills

Strong proficiency in C++ (C++11 or later), with experience in large-scale distributed systems or kernel-adjacent development.
Deep understanding of Ceph architecture and its core components: MON, OSD, MGR, RGW, MDS, and CRUSH maps.
Proficient in Linux systems programming, debugging (gdb, perf, valgrind), and performance profiling.
Experience with Python or Go for tooling and automation.
Strong foundation in data replication, erasure coding, and consistency models in distributed storage.
Hands-on experience with Kubernetes, Rook-Ceph, Helm, Ansible, and related DevOps tools.
Familiarity with TCP/IP, HTTP/S3 APIs, block storage (RBD/iSCSI), and object storage semantics.
Ability to conduct root-cause analysis and lead performance investigations under production environments.

‍

Preferred Skills

Contributions to the Ceph open-source project or prior experience modifying Ceph source code.
Experience with multi-site replication, object versioning, compliance retention, or legal hold features.
Background in distributed storage systems, file systems, or cloud storage platforms.
Familiarity with containerized environments, network virtualization, and cloud-native observability stacks.
Excellent technical documentation and communication skills in English.

‍

Latest jobs

Citadel

Quantitative Research Engineer – University Graduate (US)

📍

New York, NY

Apply now

Citadel

Quantitative Research Engineer – PhD Intern (US)

📍

Miami, FL

Apply now

Citadel

Quantitative Research Engineer – PhD Graduate (US)

📍

Miami, FL

Apply now

Citadel

Software Engineer – Intern (Asia)

📍

Singapore

Apply now

Ceph Cluster Development Engineer (C++ Focus)

Fortinet

‍

Key Responsibilities

Qualifications

Required Skills

Preferred Skills

Latest jobs

Search by city

Search by area

Search by role type