
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. At Goldman Sachs, SRE is responsible for the availability and reliability of our firm's most critical platform services, and ensures they meet the requirements of our internal and external users. We look for engineers who are motivated to collaborate with our businesses to build and run sustainable production systems, which can evolve and adapt to changes in our fast-paced, global business environment.
SRE team develops and maintains platforms that enables GS Engineering Teams to adhere to Observability requirements and SLA Management. It is part of SRE Platforms responsibility for designing, developing, and operating distributed systems which provide observability for Goldmanâs mission-critical applications and platform services. These systems span on-premises datacentres and multiple public cloud environments. We design and build highly scalable tools which provide the following functions to our global engineering teams:
The products and services we provide to our internal customers are used by thousands of engineers every day. We believe that reliability is the most important feature of any system, and we are devoted to giving our engineers the tools they need to build and operate reliable products.
As a developer in the SRE team, you will work with internal customers, vendors, product owners, and SREs to design and develop a large-scale distributed system to handle alert generation, metrics collection, log collection & trace events. You will run a production environment spanning cloud and on-prem datacentres. You will define observability features and drive their implementation.