Member of Technical Staff (Inference)

hcompany·April 10, 2026·0 views

🌍 Remote · WorldwideFull-time

💰 $80,000 – $130,000/yr

GPU Programming CUDA LLM Inference Model Optimization Python Machine Learning Distributed Systems

Job Description

About H

H exists to push the boundaries of superintelligence with agentic AI. By automating complex, multi-step tasks typically performed by humans, AI agents will help unlock full human potential. H is hiring the world's best AI talent, seeking those who are dedicated as much to building safely and responsibly as to advancing disruptive agentic capabilities. We promote a mindset of openness, learning, and collaboration, where everyone has something to contribute.

About the Team

The Inference team develops and enhances the inference stack for serving H-models that power our agent technology. The team focuses on optimizing hardware utilization to reach high throughput, low latency, and cost efficiency in order to deliver a seamless user experience. As a Member of Technical Staff, you will work alongside world-class engineers and researchers to build the infrastructure that powers next-generation AI agents.

Key Responsibilities

Develop scalable, low-latency, and cost-effective inference pipelines
Optimize model performance including memory usage, throughput, and latency using advanced techniques like distributed computing, model compression, quantization, and caching mechanisms
Develop specialized GPU kernels for performance-critical tasks such as attention mechanisms and matrix multiplications
Collaborate with H research teams on model architectures to enhance efficiency during inference
Review state-of-the-art papers and research to improve memory usage, throughput, and latency (Flash Attention, Paged Attention, Continuous Batching, etc.)
Prioritize and implement cutting-edge inference techniques into production systems

Required Qualifications

Technical Skills:

MS or PhD in Computer Science, Machine Learning, or related fields
Proficient in at least one of the following programming languages: Python, Rust, or C/C++
Hands-on experience in GPU programming such as CUDA, OpenAI Triton, or Metal
Demonstrated experience with model compression and quantization techniques

Soft Skills:

Collaborative mindset, thriving in dynamic, multidisciplinary teams
Strong communication and presentation skills
Eager to explore and tackle new technical challenges

Bonus Qualifications

Experience with LLM serving frameworks such as vLLM, TensorRT-LLM, SGLang, or llama.cpp
Experience with CUDA kernel programming and NCCL
Experience with deep learning inference frameworks (PyTorch/ExecuTorch, ONNX Runtime, GGML, etc.)

Location: Remote, Worldwide

💰 Compensation not publicly listed. Market estimate for similar roles: from $80K–$130K USD annually, varying by experience and location.