Manager, Site Reliability Engineering - 13499
, United Kingdom
Bromsgrove, United Kingdom
We are seeking an experienced Site Reliability Engineering Manager to lead the team overseeing the operation, performance and reliability of the Multi Tenant BlackLine Accounts Receivable SAAS products>
These are hosted in Microsoft Azure datacentres using serverless technologies for PAAS, IAAS and SAAS components.
This position plays a key role in ensuring that Blackline’s Accounts Receivable products, services, infrastructure and public cloud are carefully planned and deployed in a time, place, and configuration which is ideal for serving BL’s users. Your role encompasses aspects of capacity planning, technical project execution, performance monitoring, site reliability, security and software engineering. You must be equally at home explaining analyses and project recommendations to senior management as you are discussing the technical findings to engineers or building tools to automate and scale their impact.
You will manage a team of 24/7 SRE staff managing day to day operations & monitoring, incident engagement, and disaster recovery activities. The candidate must possess solid critical thinking skills and have experience supporting 24x7 High Availability mission-critical traffic-intensive web infrastructures, and be familiar with public cloud hosting.
Roles and Responsibility (list in order of importance)
The Site Reliability Engineerin Manager will lead a dedicated team of Site Reliability Engineers solving real-life problems in a high-performance, and high-traffic environment, including
- Improves the BlackLine SaaS service experience by discovering and highlighting optimization opportunities with existing code or architectural design to address application availability, performance, observability, efficiency, and security challenges.
- Develops tools and systems to automate the identification, analysis, and remediation of application events, infrastructure issues, or requests.
- Manages Incident Response and delivers Root Cause analyses
- Manages Production Operations, including day-to-day administration of running processes, security and vulnerability management, 24x7 initial response to system alerts and requests from the Customer team
- Adhere to the change management and other established processes and procedures.
- Support our continued certification to ISO 27001, ISO 9001 and SOC2 standards
- Advocates for change across the organization. Ensures the implementation of change with appropriate communications, goals, resources, metrics, and reviews.
- Partners with internal organization and vendors to develop multi-year roadmaps influencing the direction and evolution of the operating environment and support protocols.
- Establish and maintain Key Performance Indicators for the overall health of the service and build tools to exercise and evaluate if these KPI’s are being met.
- Leads cross-functionally with other teams to surface common pain points, architect solutions, establish conventions, and evangelize application development and operations best practices.
- Maintains and evolves the BlackLine trust site to include real time availability and performance information.
- Monitor and plan for capacity and growth.
- Maintain documentation and operational knowledge base.
Years of Experience in Related Field: 8+ years of industry experience. 3+ years of leadership experience
Education: Bachelors degree in Information Technology, Business or related field or equivalent experience.
Technical/Specialized Knowledge, Skills, and Abilities:
- Expertise in reliable and repeatable web application deployment and architecture.
- Someone energized by a fast-paced, iterative approach.
- An ability to balance the urgent needs along with long term strategy.
- Strong ownership, pride of work, and ability to take things across the finish line.
- Of particular interest is a specialty in one or more of the following: Multi-page web apps, API integrations, monitoring/alerting, Public Cloud infrastructure management, distributed systems, cloud networking, or application security.
- Hands-on problem-solving skills and Root Cause Analysis, technical leadership and mentoring qualities.
- Strong written and oral communication skills.
- Manage end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence.
- Lead by example, care for your team, and establish credibility with the quality of the teams' technical execution.
- Participate in and manage on-call rotation for the SRE Team
- Design, write and deliver software to improve the availability, scalability, latency, and efficiency of Blackline’s services.
- Cross-system and full-stack architecture experience and awareness.
- Ability to communicate well with both business owners, Executives and technical staff, at the appropriate levels.
- Prior C#, ASP.NET, Ruby, Go or Java development experience, preferably in an agile SaaS environment.
- Working knowledge of cloud platforms (Microsoft Azure strongly preferred).
- Experience in recruiting and managing a team of experienced Engineers.
- Skill managing and prioritizing troubleshooting of enterprise services with complex interactions between applications, operating systems, network protocols, and client configurations.
- Capable of technical deep-dives into code, networking, operating systems and storage, yet verbally and cognitively agile enough to hold your own in a strategy discussion with leadership team.
- Experience with software development processes and methodologies.
- 5+ years supporting a SaaS/Hosting type critical revenue-generating environment.
- 3+ years of direct supervisory/management responsibility.
- 3+ years experience working in a strict change-controlled, 24/7 environment.
- Proven data center management experience.
- Empathy for working with support teams to identify and remedy pain points.
- Strong intra team and cross functional collaboration skills, working with individuals at all levels across the organization.
- Strong quantitative and qualitative reasoning skills.
- Strong interpersonal, presentation and communication skills.
- Strong organizational skills and detail oriented.
- Experience with compliance activities associated with ISO 27001 and SOC 2.
- Travel as needed to remote office locations for training, implementation, and/or planning as required
- Understanding of ITIL concepts. Certificate in ITIL Foundations or greater is preferred.
- Bachelor's degree in Computer Science or related discipline or equivalent experience
- Relevant Microsoft Certifications – Azure, SQL, Azure DevOps