Job Description

Job Title: Site Reliability Engineer (SRE)
Location: San Diego, CA
Rate: $60-65/hr

Job Description:

We are seeking an experienced and proactive Site Reliability Engineer (SRE) to join our team in San Diego, CA. This role will be responsible for ensuring high availability, reliability, and recoverability of platforms, leveraging best practices and innovative solutions to manage and optimize our systems. As an SRE, you will work with cross-functional teams to maintain and improve the performance and resiliency of our cloud infrastructure, while also driving key initiatives for incident management, monitoring, and disaster recovery.

Key Responsibilities:

High Availability & Reliability:
- Exercise best practices to ensure and improve the availability, reliability, and recoverability of platforms and services.
- Work with proprietary tools that address weaknesses in incident management and software delivery.
Disaster Recovery & Business Continuity:
- Design, implement, and automate disaster recovery (DR) processes and business continuity plans.
- Perform regular DR trials and ensure system recovery procedures are optimized.
Capacity Management:
- Develop and enforce capacity management practices to ensure scalability and optimal resource allocation for systems.
Service Level Indicators (SLIs) & Reliability:
- Evaluate and re-architect SLIs to dynamically account for growth, ensuring that reliability is properly represented.
Cloud Observability & Monitoring:
- Develop, maintain, and configure cloud observability systems (e.g., DataDog, GCP logging, RUM, APM) to provide real-time monitoring and insights.
- Build flexible monitoring and alerting systems to proactively identify and address issues before they impact production environments.
Performance Optimization:
- Develop a framework to assess and optimize system performance, implementing improvements where appropriate.
- Partner with development teams to establish application production readiness through thorough testing and release procedures.
Incident Management:
- Participate in on-call rotations for incident response, providing timely resolutions to system issues.
- Lead postmortem investigations and ensure that root cause analysis is performed for continuous improvement.
Training & Knowledge Sharing:
- Participate in ongoing training and knowledge-sharing activities within the engineering teams to ensure a consistent focus on resiliency and system reliability.
- Develop documentation with a focus on resiliency practices, ensuring that knowledge is well-documented and shared across teams.
Proactive Resiliency Improvements:
- Demonstrate a proactive approach to identifying areas within systems and processes where resiliency improvements can be made.

Observability as Code:

Reusable Monitors Design:
- Design a tiered system for reusable monitors across various environments, utilizing configurations maintained in source control for scalability and consistency.
Compliance-Driven Monitoring:
- Design and propose monitoring strategies for both production and non-production environments, ensuring financial responsibility and compliance with standards such as GDPR, HIPAA, etc.

Qualifications:

3+ years of experience as a Site Reliability Engineer, DevOps Engineer, or similar role.
Strong experience with cloud platforms (e.g., Google Cloud Platform, AWS, Azure).
Hands-on experience with observability tools such as DataDog, GCP Logging, APM, and RUM.
Familiarity with disaster recovery and business continuity planning.
Expertise in designing, implementing, and optimizing monitoring and alerting systems.
Knowledge of compliance standards (GDPR, HIPAA, etc.) and their impact on cloud infrastructure.
Excellent communication and collaboration skills, with the ability to work across teams and with stakeholders.
Proven ability to optimize system performance and drive improvements proactively.
Experience in developing capacity management and service reliability processes.
Ability to participate in on-call rotations and postmortem investigations.

Preferred Qualifications:

Experience with software development or application production readiness procedures.
Familiarity with configuration management tools (e.g., Terraform, Ansible).
Experience with version control systems (e.g., Git) for managing configurations and code.

Job Tags

Flexible hours,

Similar Jobs

BlingABC - New Oriental Education & Technology Group (NYSE: ...

BlingABC is Hiring Full-Time Online ESL Teachers in China! Job at BlingABC - New Oriental Education & Technology Group (NYSE: ...

...the BENEFITS of a full-time job? Well, BlingABC has exactly the opportunity you are looking for! By joining our team of amazing teachers, you will have the opportunity to work from home or even while traveling, with NO daily commute required and NO morning classes! We...

Domino's Franchise

Delivery Driver - 2411 Avenue I Job at Domino's Franchise

...Job Description ABOUT THE JOB Right now Domino's is looking for qualified drivers to staff stores in your area. We're growing so fast it's hard to keep up, and that means Domino's has lots of ways for you to grow (if that's what you want),perhaps to management,...

USALCO

Chemical Operator Job at USALCO

...the ability to be part of a team while operating independently, and consistently performs... ...this role is responsible for manufacturing chemical products while assuring compliance is... ...for the safe and efficient operation of plant processes and equipment. The person must...

Employment Hero

SOC Analyst Job at Employment Hero

...we could be a match! What your days might look like The SOCAnalyst plays a key role in protecting sensitive data and ensuring the... ...racing. You thrive on the flexibility (and responsibility) of a remote-first business. Our values align, and shape how you show up...

Observe.AI

VP, Sales Voice AI Job at Observe.AI

About Us Observe.AI is the fastest way to boost contact center performance with live conversation intelligence. Built on the most accurate AI engine in the industry, Observe.AI uncovers insights from 100% of customer interactions and maximizes frontline team performance...

Site Reliability Engineer (SRE) Job at Innova software Services Inc, San Diego, CA

bUVjZDRuMW9lZUtPb0ZxR3loTE0rbW84VkE9PQ==