The DevOps Managers’ main responsibilities are to build and foster a collaborative team, champion state-of-the-art patterns such as Infrastructure as Code and Immutable Infrastructure, as well as pave the path for the future of our systems. The ideal candidate has the technical chops to dive into the code, but also has the proven ability to provide technical leadership for projects while inspiring a team of DevOps engineers.
This is a technical and managerial position working across our entire stack of AWS services and infrastructure, CI/CD pipelines, and Observability tools, while also partnering closely with our Engineering Managers to ensure we are providing the appropriate compliments to enable their success.
About the DevOps team
Automation first cloud based DevOps team. Our goals include maximizing developer productivity through fine tuned CI/CD pipelines, automated and scalable infrastructure, and robust systems monitoring. We promote a highly collaborative atmosphere both within our team as well as across the other teams within the business. We believe continuous learning and adherence to the DevOps philosophy are the keys to success.
Your Main Tasks
Strategy and leadership - Collaborate on building a vision for where we need to be for DevOps and infrastructure, and assist in managing a plan for how to move towards the vision.
DevOps - Support the engineering teams with infrastructure and tools for automatically building, deploying and running applications with a goal of 100% automation and observability
AWS - Build, evolve and manage AWS infrastructure and services that runs our SaaS product
Security - Ensure our SaaS product and infrastructure is secure and up to date with the latest security measures
Platform reliability and production support - Lead and manage the incident management and escalation process to provide 24/7 support of our SaaS application
What you bring to the table
Expert knowledge of cloud computing and AWS with experience building environments that meet high availability and reliability criteria
Hands on experience with continuous integration, continuous delivery and continuous deployment, experience with Gitlab a plus
Experience with container architecture and systems (such as Docker) and container orchestration tools (such as Nomad or Kubernetes)
Strong knowledge of Microservices and supporting patterns (Service Discovery - Consul, Service Mesh - Envoy, Circuit Breakers, etc)
7+ years Linux system administration
5+ years scripting experience with a focus on Python, Go, and Bash
4+ years leading an operations team and managing DevOps engineers
3+ years experience deploying infrastructure using Terraform or CloudFormation
1+ years of experience managing people
Experience with full Hashicorp stack a plus (Nomad, Consul, Vault, Terraform, Vagrant)
Excellent written and verbal communication skills
Exceptional problem solving and analytical skills
Leading Wellness technology household name is looking for a Principal SRE to join the team. You would build and maintain monitorable, performant, reliable and highly-scalable software systems. Join fast-paced, growing team of engineers tackling challenging problems at scale and headquartered in a brand new headquarters in the heart of Manhattan.
Evangelize best practices for building and operating highly reliable systems
Serve as subject matter expert in observability and monitoring
Consult in system design to meet reliability and capacity requirements
Automate infrastructure and configuration management
Conduct timely post-mortems of production infrastructure incidents
Assist with all aspects of operational security and compliance
Work with Amazon Web Services, Chef, Python, Ubuntu, Nginx, Jenkins, Terraform, Akamai, Elemental
Know when to triage and when to dive down into a root-cause analysis
Experience developing and monitoring mission-critical systems
Substantial experience with a programming language like Python, Golang, Java, C
Working knowledge of a centralized configuration tool like chef, puppet, or ansible
Experience with or interest in learning about streaming applications and media servers
Bonus: experience configuring and monitoring CDNs. We use Akamai, Cloudfront, Cloudflare
Senior Software Engineer, Site Reliability
Leading AI in Fintech company who produce technology that is scalable, robust, and solves the challenges of one of the world’s largest, most successful financial institutions.
Are you a senior SRE who is a thoughtful, collaborative, and dynamic technologist who loves building the infrastructure that helps others do their jobs more effectively and efficiently?
Senior Software Engineer, Site Reliability (SRE) will ensure that the clients services, both internally-critical and customer-facing, have reliability and uptime based on users' expectations. You will work closely with our team of Infrastructure and Application Engineers to come up with scalable solutions.
What You'll Do
You will run and stabilize production services that support critical financial applications and backend processes.
You will monitor, maintain and help scale services
You will manage end-to-end availability and performance of critical services and build automation to prevent problem recurrence.
You will design and build advanced automated operational and deployment frameworks alongside tooling and infrastructure to help engineering teams measure and increase their velocity.
Technologies You’ll Use
Kubernetes, HAProxy, Jenkins, Git, Docker, Kafka, Prometheus, Kibana, Elasticsearch, Grafana, Postgres
Understanding development team needs and evangelizing appropriate open-source and proprietary tools to push the boundaries of automation and productivity
Liaising with developers and QA teams to implement efficient and robust frameworks supporting development, test and release workflows for critical trading applications
Scaling up the CI/CD infrastructure using our hybrid cloud while monitoring its health and effectiveness; drive fixes to underlying problems and optimizations to improve efficiency
Quickly detecting, debugging, and resolving build and process failures for non-code related causes
A Bachelor’s degree in Computer Science, Math, or Physics from a top-tier college or university and at least 6 years or programming experience
Proficiency with Python, Bash and/or Go (expertise in at least 2 of them).
Thorough understanding of Linux fundamentals and C++ compiling/linking/loading process. Tower is primarily a Linux shop.
Deep knowledge of git, git branching, and git workflows
Experience with CI/CD frameworks such as Jenkins, Concourse, Travis, or CircleCI
Experience with CMake, Conda, Gitlab/Github and Docker/Kubernetes (preferred)
SRE, Cloud Platform
NYC, CHI, LDN
Specifically, our engineering solutions empower the firm with the large-scale systems necessary to pursue a breadth and depth of investment strategies. Site Reliability Engineering is an engineering discipline that combines software and systems engineering to build and run large-scale, distributed, and fault-tolerant systems. We focus on practices that drive iterative improvement to our solutions.
Design, develop, and deploy elegant software solutions across the firm
• Partner with business leaders to define priorities and deliver custom solutions
• Receive structured learning on technical and quantitative skills
• Develop under the direct sponsorship of our firm’s CTO and engage with other C-level leadership
• Passion for technology and software development
• Proficiency with one or more object oriented languages (e.g., Java, C++, Python)
• Deep knowledge of distributed service oriented architecture, relational databases, machine learning/deep learning
• Experience building high performance, highly available and scalable systems
• Experience building complex software systems that have been successfully adopted by customers over extended time periods
• Ability to deliver short term results while invest in long term strategic solutions
• Strong written and verbal communications skills
• Bachelor’s, Master’s or PhD degree in Computer Science or equivalent experience
Site Reliability Engineer
Seeking elite Site Reliability Engineers for one of the worlds most successful buyside trading businesses
What you will do:
Build and maintain the world’s most advanced compute, network and storage infrastructure
Build automation to manage and monitor infrastructure services and applications
Ensure and improve the reliability, availability and performance of infrastructure and infrastructure-related applications
What we're looking for:
1-4+ years of industry experience on building and managing Compute, Network, and/or Storage systems and/or components
Proficient in Python
Experience using Linux
Experience in automating routine tasks
Experience in configuration management and infrastructure provisioning (e.g. Chef, Puppet, Ansible, Terraform)
concepts (e.g., DNS, HTTP, TCP, UDP, IP)
Site Reliability Engineer
IoT powerhouse with a stronghold in the fleet management space and significant investment in a growing platform. As an early member of the Site Reliability Team, your role will be crucial in helping design, scale, and manage a growing AWS-backed infrastructure.
What You'll Do:
Automate the provisioning, scaling, and management of our infrastructure using Configuration As Code and Configuration Management
Create deployment pipelines; take code from git to production
Continuously improve the monitoring and alerting capabilities of our platform, enabling us to be proactive instead of reactive
What We're Looking For:
4+ years of professional SRE/DevOps experience, and a demonstrated ability working on high volume production systems
Experience with infrastructure as code and configuration management (Terraform, Nix, Ansible, CloudFormation, Chef, etc...), and with build managers such as Bazel, Pants, Buck
Knowledge of Python, Ruby, or Go, and an understanding of relational and NoSQL databases (PostgreSQL a plus)
Experience with container orchestration framework such as Kubernetes, Docker Swarm
Site Reliability Engineer
The SRE team owns the entire production and test infrastructure to help protect hundreds of millions of dollars under custody. The Senior Site Reliability Engineer will be a key player in building, deploying, and scaling the infrastructure that our other engineering teams use.
What you’ll do:
- Scale and build our infrastructure as we build more products that rely on this infrastructure
- You will build observability into our environment and applications that help us monitor and self-heal when problems come up.
- Own infrastructure projects end-to-end that span multiple teams.
- You will make the right trade-off between reliability and product feature speed – come up with metrics that define the tradeoff, get buy-in from stakeholders and measure against those Terraform, Puppet, Helm • You are able to understand and articulate the design and application of the architecture of the entire system • You have worked with distributed systems, cloud native applications and system design
Site Reliability Engineer - Observability
Site Reliability Engineer - Observability required for a world class Fixed Income Trading platform business.
As a senior member of the engineering org, you will be key to the technical strategy of the firm. You will be heavily involved in the hardening of the OS, developing deployment tooling, and optimizing the platform for better latency. They are looking for someone with significant experience in Linux platform engineering, including regular system administration, kernel optimization, storage and the Linux networking stack.
They have a strong bias towards Infrastructure as code, and will be expected to automate the systems using Python, and work on container orchestration (K8s)