The Site Reliability Engineer (SRE) plays a pivotal role in ensuring that Blackline’s services/infrastructure are carefully planned and deployed in a time, place, and configuration which is ideal for serving BlackLine’s clients. The SRE role sits at a nexus of capacity planning, technical project execution, product planning, business analysis, site reliability, and software engineering.
The Site Reliability Engineer is responsible for assessing, testing, tracking, predicting and reporting all related performance aspects of a suite of production applications from a performance, responsiveness, capacity and availability perspective.
- Improve and maintain a continuous metric framework that observes and records and trends real time availability data for all of our clients
- Develop and maintain on premise and cloud capacity plans that ensure we are delivering a BlackLine service that is performant and cost effective
- Collaborate with development and other technology teams on requirements definition, observability standards, capacity planning, and process refinement
- Improve the BlackLine SaaS service experience by discovering and highlighting optimization opportunities with existing code to address application availability, performance, observability, efficiency, and security challenges.
- Develop tools and systems to automate the identification, analysis, and remediation of application events, infrastructure issues, or requests.
- Establish and maintain Key Performance Indicators for the overall health of the service and build tools to exercise and evaluate if these KPI’s are being met.
- Works cross-functionally with other teams to surface common pain points, architect solutions, establish conventions, and evangelize application development and operations best practices.
- Transform discoveries into requests to others or action items for you and your team.
- Regularly learn new systems and tools as the BlackLine platform and ecosystem evolves.
- Own and evolve the BlackLine Trust site to include real time availability and performance information
- Contribute knowledge, skills, and personal qualities to a dedicated team of top engineers solving real-life problems in a bleeding-edge, high-performance, and high-traffic environment.
- Publish performance result findings, conclusions, recommendations
- Create second tier level analysis of capacity constraint points and performance and discuss with development teams/infrastructure
- Support integration of performance data into customer experience analytics tools and reporting
- Ensure application and infrastructure capacity management efforts have verifiable capacity data to support business cases
- Monitor industry trends and keep abreast of new tools and technologies.
- Participate in our on-call rotation, act as crisis manager/tier 3 technical support for major incidents, and conduct incident reviews
- Other duties as assigned
- Intermediate to advanced knowledge of at least one of the following programming languages: C#, Visual Basic, PowerShell, Java, Go, Linux Shell, Ruby.
- Knowledge of software development best practices, SDLC, CI/CD.
- Experience deploying high availability systems and software.
- Experience with troubleshooting distributed web applications in a production environment.
- Intermediate level knowledge of IIS and Windows Server or Linux and Apache.
- Experience with infrastructure as a code and platform as a service.
- Experience with configuration management tools Ex Chef, Ansible, Puppet.
- Must possess the ability to handle multiple goals concurrently and function in a fast-paced, demanding, ever changing high growth environment
- Must maintain the highest level of integrity, courtesy and respect while interacting with internal customers, employees and business contacts
- Excellent oral and written communication skills
- Ability to interface with internal technical experts using professional interpersonal skills
- Experience in analyzing datasets to draw conclusions, and graph datasets supporting these conclusions
- Exhibit creative problem-solving, logical troubleshooting and analytical skills
- Basic level proficiency in application load balancing methods (F5 LTM, Windows NLB, etc.)
- Working knowledge of TCP/IP and networking concepts
- Proficiency with statistical concepts; confidence interval, hypothesis testing, sampling
- Operating systems concepts such as CPU, memory, disk queues and graphing/analyzing these over time
- Must possess strong organizational skills and be able to work with minimal oversight
- Ability to understand new technologies quickly and adapt these into daily work and goals
- BS or MS in Computer Science (or equivalent diploma and/or certifications) with 7 – 10 years related experience.
- Prior C#, ASP.NET, Ruby, Go or Java development experience, preferably in an agile SaaS environment.
- Significant experience with open source platforms and technologies.
- Experience with software development processes and methodologies.
- Track record of architecting, developing, implementing robust, distributed online solutions.