Staff Site Reliability Engineer - 13583

Woodland Hills, United States

  • Create and maintain a continuous testing framework that observes and records and trends real-time availability data for all of our clients
  • Develop and maintain on-premise and cloud capacity plans that ensure we are delivering a BlackLine service that is performant and cost-effective
  • Improve the BlackLine SaaS service experience by discovering and highlighting optimization opportunities with existing code to address application availability, performance, observability, efficiency, and security challenges
  • Lead in the development of requirement definitions, capacity planning, and process refinement.
  • Develop tools and systems to automate the identification, analysis, and remediation of application events, infrastructure issues, or requests
  • Establish and maintain Key Performance Indicators for the overall health of the service and build tools to exercise and evaluate if these KPIs are being met
  • Works cross-functionally to surface common pain points, architect solutions, establish conventions, and evangelize application development and operations best practices
  • Regularly learn new systems and tools as the BlackLine platform and ecosystem evolves
  • Contribute knowledge, skills, and personal qualities to a dedicated team of top engineers through mentorship and training, solving real-life problems in a bleeding-edge, high-performance, and high-traffic environment
  • Assessing, testing, tracking, predicting, and reporting all related performance aspects of a suite of production applications from a performance, responsiveness, capacity, and availability perspective
  • Serve as technical lead for large projects, determining objectives and approaches to critical assignments, and may oversee multiple projects concurrently
  • Publish performance result findings, conclusions, and recommendations
  • Support integration of performance data into customer experience analytics tools and reporting
  • Participate in our on-call rotation and conduct incident reviews
  • Other duties as assigned
  • BS or MS in Computer Science (or equivalent diploma and/or certifications) with 7+ years of related experience
  • Advanced knowledge of at least two of the following programming languages: C#, Visual Basic, PowerShell, Java, Go, Linux Shell, Ruby
  • Demonstrated history of developing or operating production web applications and a solid understanding of HTTP(S), HTML, JavaScript, CSS, and XML
  • Significant experience in a lead role on a software development team
  • Baseline understanding of project management process/procedures with experience: agile and waterfall. Experience managing one or more small to medium projects
  • Experience deploying high availability systems and software
  • Experience with troubleshooting distributed web applications in a production environment.
  • Advanced level knowledge of IIS and Windows Server or Linux and Apache
  • Experience with infrastructure as a code and platform as a service
  • Experience with configuration management tools Ex Chef, Ansible, Puppet, or container orchestration platforms like Kubernetes or Docker Swarm
  • Advanced level knowledge in deploying and managing open source observability tools; such as Prometheus, Graphana, Jaeger, or commercial equivalents
  • Capable of producing clean, readable code in a multi-developer team environment
  • Extensive knowledge of managing cloud platforms and cloud native tools
  • Must possess the ability to handle multiple goals concurrently and function in a fast-paced, demanding, ever changing high-growth environment
  • Must maintain the highest level of integrity, courtesy, and respect while interacting with internal customers, employees, and business contacts
  • Ability to effectively communicate (oral and written) in all business relationships and various levels of management in a clear, direct manner
  • Ability to interface with internal technical experts using professional interpersonal skills
  • Experience in analyzing datasets to draw conclusions, and graph datasets supporting these conclusions
  • Intermediate level proficiency in application load balancing methods (F5 LTM, Windows NLB, etc.)
  • Working knowledge of TCP/IP and networking concepts
  • Proficiency with statistical concepts; confidence interval, hypothesis testing, sampling
  • Operating systems concepts such as CPU, memory, disk queues and graphing/analyzing these over time
  • Must possess strong organizational skills and be able to work with minimal oversight
  • Ability to understand new technologies quickly and adapt these into daily work and goals