Our SRE Engineer is one of our key roles in our Observability & Service Management team. This team is responsible for the implementation of our observability tools, alerts, and automated healing/runbooks. You will work closely with our SRE teams for each product to ensure we have the proper monitoring, metrics, KPI’s and SLI/SLA/SLO Error budgets. You will help build scripts and automation to help ensure the proper operations of our cloud environments. As part of the role, you will be involved in monthly patch management processes and annual DR testing practices. The candidate must have solid problem-solving skills and experience supporting large server farms, transition manual tasks to devops practices and understanding of highly available and fault tolerant architectures. All while maintaining 24x7 High Availability mission-critical traffic-intensive web infrastructures, and be familiar with commonly used server, storage and virtualization technologies.
- Ensure high availability of a SaaS platform that is built leveraging .Net and Java based microservices and monolithic applications
- Write, Debug and resolve internally developed scripts and jobs.
- Collaborate with development/SRE and other technology teams on requirements definition, capacity planning, and process refinement.
- Use data from a variety of performance and health management tools to deliver a continuous assessment of application performance and reliability.
- Adhere to existing operational processes and maintain up-to-date operational documentation.
- Monitor industry trends. Research, design, develop and implement solutions for fault tolerance, performance and capacity management.
- Participate in 24/7 on-call rotation and support major incidents as they occur
- Monitor and manage the ticketing queue (JIRA), using the daily work to develop projects to continually improve our daily work.
- Maintain documentation, runbooks and KPI’s for SRE applications and systems.
- Intermediate-level understanding of build/test/deploy automation tools and concepts.
- Very familiar with principles of Continuous Integration and Continuous Delivery.
- Experience with common Microsoft .NET build, test, packaging, and deployment tools and techniques or similar Java technologies.
- Exposure to Kubernetes and containerized runtime environments
- Experience with scripted provisioning of servers, applications, and/or infrastructure in a production environment.
- Familiarity with automated configuration management tool such as Chef, Puppet, or CFengine.
- Solid foundation in programming fundamentals (variables, control structures, boolean logic, OOP concepts).
- Experience in at least one of the following programming languages: C#, PowerShell, Ruby, Java.
- Experience with modern software development workflows, including code reviews, revision control, and test-driven development.
- Experience with Cloud (IaaS and PaaS) solutions on Google, Azure and/or AWS
- Intermediate-level understanding of design principles for high availability systems and software.
- Experience with troubleshooting distributed web applications.
- Comfortable writing basic SQL queries.
- BS degree in Information Technology, Business or related field or equivalent experience.
- 3+ years related experience.
- A minimum of three years of experience in a 24x7 operations organization.
- Working knowledge of cloud platforms (AWS, GCP and/or Azure).
- Familiarity with container technologies (Kubernetes, Docker, Rancher, etc.).
- Experience with CI/CD pipelines