Job Title: Site Reliability Engineer (SRE)
Location: San Diego, CA
Rate: $60-65/hr
We are seeking an experienced and proactive Site Reliability Engineer (SRE) to join our team in San Diego, CA. This role will be responsible for ensuring high availability, reliability, and recoverability of platforms, leveraging best practices and innovative solutions to manage and optimize our systems. As an SRE, you will work with cross-functional teams to maintain and improve the performance and resiliency of our cloud infrastructure, while also driving key initiatives for incident management, monitoring, and disaster recovery.
Key Responsibilities:High Availability & Reliability:
Exercise best practices to ensure and improve the availability, reliability, and recoverability of platforms and services.
Work with proprietary tools that address weaknesses in incident management and software delivery.
Disaster Recovery & Business Continuity:
Design, implement, and automate disaster recovery (DR) processes and business continuity plans.
Perform regular DR trials and ensure system recovery procedures are optimized.
Capacity Management:
Develop and enforce capacity management practices to ensure scalability and optimal resource allocation for systems.
Service Level Indicators (SLIs) & Reliability:
Evaluate and re-architect SLIs to dynamically account for growth, ensuring that reliability is properly represented.
Cloud Observability & Monitoring:
Develop, maintain, and configure cloud observability systems (e.g., DataDog, GCP logging, RUM, APM) to provide real-time monitoring and insights.
Build flexible monitoring and alerting systems to proactively identify and address issues before they impact production environments.
Performance Optimization:
Develop a framework to assess and optimize system performance, implementing improvements where appropriate.
Partner with development teams to establish application production readiness through thorough testing and release procedures.
Incident Management:
Participate in on-call rotations for incident response, providing timely resolutions to system issues.
Lead postmortem investigations and ensure that root cause analysis is performed for continuous improvement.
Training & Knowledge Sharing:
Participate in ongoing training and knowledge-sharing activities within the engineering teams to ensure a consistent focus on resiliency and system reliability.
Develop documentation with a focus on resiliency practices, ensuring that knowledge is well-documented and shared across teams.
Proactive Resiliency Improvements:
Demonstrate a proactive approach to identifying areas within systems and processes where resiliency improvements can be made.
Reusable Monitors Design:
Design a tiered system for reusable monitors across various environments, utilizing configurations maintained in source control for scalability and consistency.
Compliance-Driven Monitoring:
Design and propose monitoring strategies for both production and non-production environments, ensuring financial responsibility and compliance with standards such as GDPR, HIPAA, etc.
3+ years of experience as a Site Reliability Engineer, DevOps Engineer, or similar role.
Strong experience with cloud platforms (e.g., Google Cloud Platform, AWS, Azure).
Hands-on experience with observability tools such as DataDog, GCP Logging, APM, and RUM.
Familiarity with disaster recovery and business continuity planning.
Expertise in designing, implementing, and optimizing monitoring and alerting systems.
Knowledge of compliance standards (GDPR, HIPAA, etc.) and their impact on cloud infrastructure.
Excellent communication and collaboration skills, with the ability to work across teams and with stakeholders.
Proven ability to optimize system performance and drive improvements proactively.
Experience in developing capacity management and service reliability processes.
Ability to participate in on-call rotations and postmortem investigations.
Experience with software development or application production readiness procedures.
Familiarity with configuration management tools (e.g., Terraform, Ansible).
Experience with version control systems (e.g., Git) for managing configurations and code.
...the BENEFITS of a full-time job? Well, BlingABC has exactly the opportunity you are looking for! By joining our team of amazing teachers, you will have the opportunity to work from home or even while traveling, with NO daily commute required and NO morning classes! We...
...Job Description ABOUT THE JOB Right now Domino's is looking for qualified drivers to staff stores in your area. We're growing so fast it's hard to keep up, and that means Domino's has lots of ways for you to grow (if that's what you want),perhaps to management,...
...the ability to be part of a team while operating independently, and consistently performs... ...this role is responsible for manufacturing chemical products while assuring compliance is... ...for the safe and efficient operation of plant processes and equipment. The person must...
...we could be a match! What your days might look like The SOCAnalyst plays a key role in protecting sensitive data and ensuring the... ...racing. You thrive on the flexibility (and responsibility) of a remote-first business. Our values align, and shape how you show up...
About Us Observe.AI is the fastest way to boost contact center performance with live conversation intelligence. Built on the most accurate AI engine in the industry, Observe.AI uncovers insights from 100% of customer interactions and maximizes frontline team performance...