Open Positions

TF.PNG

DevOps/SRE Manager

NYC

The DevOps Managers’ main responsibilities are to build and foster a collaborative team, champion state-of-the-art patterns such as Infrastructure as Code and Immutable Infrastructure, as well as pave the path for the future of our systems. The ideal candidate has the technical chops to dive into the code, but also has the proven ability to provide technical leadership for projects while inspiring a team of DevOps engineers.

This is a technical and managerial position working across our entire stack of AWS services and infrastructure, CI/CD pipelines, and Observability tools, while also partnering closely with our Engineering Managers to ensure we are providing the appropriate compliments to enable their success. 

About the DevOps team

Automation first cloud based DevOps team. Our goals include maximizing developer productivity through fine tuned CI/CD pipelines, automated and scalable infrastructure, and robust systems monitoring. We promote a highly collaborative atmosphere both within our team as well as across the other teams within the business. We believe continuous learning and adherence to the DevOps philosophy are the keys to success.

Your Main Tasks

  • Strategy and leadership - Collaborate on building a vision for where we need to be for DevOps and infrastructure, and assist in managing a plan for how to move towards the vision.

  • DevOps - Support the engineering teams with infrastructure and tools for automatically building, deploying and running applications with a goal of 100% automation and observability

  • AWS - Build, evolve and manage AWS infrastructure and services that runs our SaaS product

  • Security - Ensure our SaaS product and infrastructure is secure and up to date with the latest security measures

  • Platform reliability and production support - Lead and manage the incident management and escalation process to provide 24/7 support of our SaaS application

What you bring to the table

  • Expert knowledge of cloud computing and AWS with experience building environments that meet high availability and reliability criteria

  • Hands on experience with continuous integration, continuous delivery and continuous deployment, experience with Gitlab a plus 

  • Experience with container architecture and systems (such as Docker) and container orchestration tools (such as Nomad or Kubernetes)

  • Strong knowledge of Microservices and supporting patterns (Service Discovery - Consul, Service Mesh - Envoy, Circuit Breakers, etc)

  • 7+ years Linux system administration

  • 5+ years scripting experience with a focus on Python, Go, and Bash

  • 4+ years leading an operations team and managing DevOps engineers

  • 3+ years experience deploying infrastructure using Terraform or CloudFormation

  • 1+ years of experience managing people

  • Experience with full Hashicorp stack a plus (Nomad, Consul, Vault, Terraform, Vagrant)

  • Excellent written and verbal communication skills

  • Exceptional problem solving and analytical skills 

TF.PNG

Principal SRE

New York 

Leading Wellness technology household name is looking for a Principal SRE to join the team. You would build and maintain monitorable, performant, reliable and highly-scalable software systems.  Join fast-paced, growing team of engineers tackling challenging problems at scale and headquartered in a brand new headquarters in the heart of Manhattan. 


THE ROLE:

  • Evangelize best practices for building and operating highly reliable systems

  • Serve as subject matter expert in observability and monitoring

  • Consult in system design to meet reliability and capacity requirements

  • Automate infrastructure and configuration management

  • Conduct timely post-mortems of production infrastructure incidents

  • Assist with all aspects of operational security and compliance

  • Work with Amazon Web Services, Chef, Python, Ubuntu, Nginx, Jenkins, Terraform, Akamai, Elemental

CANDIDATE REQUIREMENTS:

  • Know when to triage and when to dive down into a root-cause analysis

  • Experience developing and monitoring mission-critical systems

  • Substantial experience with a programming language like Python, Golang, Java, C

  • Working knowledge of a centralized configuration tool like chef, puppet, or ansible

  • Experience with or interest in learning about streaming applications and media servers

  • Bonus: experience configuring and monitoring CDNs.  We use Akamai, Cloudfront, Cloudflare

TF.PNG

Senior Software Engineer, Site Reliability

New York

Leading AI in Fintech company who produce technology that is scalable, robust, and solves the challenges of one of the world’s largest, most successful financial institutions.


Are you a senior SRE who is a thoughtful, collaborative, and dynamic technologist who loves building the infrastructure that helps others do their jobs more effectively and efficiently?

Senior Software Engineer, Site Reliability (SRE) will ensure that the clients services, both internally-critical and customer-facing, have reliability and uptime based on users' expectations. You will work closely with our team of Infrastructure and Application Engineers to come up with scalable solutions. 

What You'll Do

    • You will run and stabilize production services that support critical financial applications and backend processes.

    • You will monitor, maintain and help scale services 

    • You will manage end-to-end availability and performance of critical services and build automation to prevent problem recurrence. 

    • You will design and build advanced automated operational and deployment frameworks alongside tooling and infrastructure to help engineering teams measure and increase their velocity.


Technologies You’ll Use

    • Kubernetes, HAProxy, Jenkins, Git, Docker, Kafka, Prometheus, Kibana, Elasticsearch, Grafana, Postgres

TF.PNG

SRE, Tools

New Jersey

Responsibilities

  • Understanding development team needs and evangelizing appropriate open-source and proprietary tools to push the boundaries of automation and productivity

  • Liaising with developers and QA teams to implement efficient and robust frameworks supporting development, test and release workflows for critical trading applications

  • Scaling up the CI/CD infrastructure using our hybrid cloud while monitoring its health and effectiveness; drive fixes to underlying problems and optimizations to improve efficiency

  • Quickly detecting, debugging, and resolving build and process failures for non-code related causes

Qualifications

  • A Bachelor’s degree in Computer Science, Math, or Physics from a top-tier college or university and at least 6 years or programming experience

  • ​Proficiency with Python, Bash and/or Go (expertise in at least 2 of them). 

  • Thorough understanding of Linux fundamentals and C++ compiling/linking/loading process. Tower is primarily a Linux shop.

  • Deep knowledge of git, git branching, and git workflows

  • Experience with CI/CD frameworks such as Jenkins, Concourse, Travis, or CircleCI

  • Experience with CMake, Conda, Gitlab/Github and Docker/Kubernetes (preferred)

TF.PNG

SRE, Cloud Platform

NYC, CHI, LDN

Specifically, our engineering solutions empower the firm with the large-scale systems necessary to pursue a breadth and depth of investment strategies. Site Reliability Engineering is an engineering discipline that combines software and systems engineering to build and run large-scale, distributed, and fault-tolerant systems. We focus on practices that drive iterative improvement to our solutions.

Design, develop, and deploy elegant software solutions across the firm

• Partner with business leaders to define priorities and deliver custom solutions

• Receive structured learning on technical and quantitative skills

• Develop under the direct sponsorship of our firm’s CTO and engage with other C-level leadership


• Passion for technology and software development

• Proficiency with one or more object oriented languages (e.g., Java, C++, Python)

• Deep knowledge of distributed service oriented architecture, relational databases, machine learning/deep learning

• Experience building high performance, highly available and scalable systems

• Experience building complex software systems that have been successfully adopted by customers over extended time periods

• Ability to deliver short term results while invest in long term strategic solutions

• Strong written and verbal communications skills

• Bachelor’s, Master’s or PhD degree in Computer Science or equivalent experience

TF.PNG

Site Reliability Engineer

NYC, Chicago

Seeking elite Site Reliability Engineers for one of the worlds most successful buyside trading businesses 

What you will do:

  • Build and maintain the world’s most advanced compute, network and storage infrastructure

  • Build automation to manage and monitor infrastructure services and applications

  • Ensure and improve the reliability, availability and performance of infrastructure and infrastructure-related applications


What we're looking for:

  • 1-4+ years of industry experience on building and managing Compute, Network, and/or Storage systems and/or components

  • Proficient in Python

  • Experience using Linux

  • Experience in automating routine tasks

  • Experience in configuration management and infrastructure provisioning (e.g. Chef, Puppet, Ansible, Terraform)

  • concepts (e.g., DNS, HTTP, TCP, UDP, IP)

TF.PNG

Site Reliability Engineer

New York

IoT powerhouse with a stronghold in the fleet management space and significant investment in a growing platform. As an early member of the Site Reliability Team, your role will be crucial in helping design, scale, and manage a growing AWS-backed infrastructure.

What You'll Do:

  • Automate the provisioning, scaling, and management of our infrastructure using Configuration As Code and Configuration Management

  • Create deployment pipelines; take code from git to production

  • Continuously improve the monitoring and alerting capabilities of our platform, enabling us to be proactive instead of reactive


What We're Looking For:

  •  4+ years of professional SRE/DevOps experience, and a demonstrated ability working on high volume production systems

  • Experience with infrastructure as code and configuration management (Terraform, Nix, Ansible, CloudFormation, Chef, etc...), and with build managers such as Bazel, Pants, Buck

  • Knowledge of Python, Ruby, or Go, and an understanding of relational and NoSQL databases (PostgreSQL a plus)

  • Experience with container orchestration framework such as Kubernetes, Docker Swarm

TF.PNG

Site Reliability Engineer

New York

The SRE team owns the entire production and test infrastructure to help protect hundreds of millions of dollars under custody. The Senior Site Reliability Engineer will be a key player in building, deploying, and scaling the infrastructure that our other engineering teams use.


What you’ll do:

- Scale and build our infrastructure as we build more products that rely on this infrastructure

- You will build observability into our environment and applications that help us monitor and self-heal when problems come up.

- Own infrastructure projects end-to-end that span multiple teams.

- You will make the right trade-off between reliability and product feature speed – come up with metrics that define the tradeoff, get buy-in from stakeholders and measure against those Terraform, Puppet, Helm • You are able to understand and articulate the design and application of the architecture of the entire system • You have worked with distributed systems, cloud native applications and system design

TF.PNG

Site Reliability Engineer - Observability

New Jersey

Site Reliability Engineer - Observability required for a world class Fixed Income Trading platform business.


As a senior member of the engineering org, you will be key to the technical strategy of the firm. You will be heavily involved in the hardening of the OS, developing deployment tooling, and optimizing the platform for better latency. They are looking for someone with significant experience in Linux platform engineering, including regular system administration, kernel optimization, storage and the Linux networking stack.


They have a strong bias towards Infrastructure as code, and will be expected to automate the systems using Python, and work on container orchestration (K8s)