Artificial Intelligence
AI Performance Engineer
Last updated
AI Performance Engineers optimize the speed, throughput, and resource efficiency of machine learning models from training to production inference. They sit at the intersection of systems engineering, hardware architecture, and ML research — profiling where compute is wasted, redesigning pipelines to eliminate bottlenecks, and making large models fast enough to serve millions of requests at acceptable cost. The role has become critical as enterprises discover that a model that runs in the lab rarely runs economically at scale.
Role at a glance
- Typical education
- Bachelor's or Master's in Computer Science, Computer Engineering, or Electrical Engineering
- Typical experience
- 4-8 years in systems, HPC, or ML engineering
- Key certifications
- None formally required; NVIDIA Deep Learning Institute credentials and CUDA programming portfolios serve as practical signals
- Top employer types
- Hyperscalers (Google, Microsoft, AWS, Meta), AI-native labs (OpenAI, Anthropic), AI chip companies (Nvidia, AMD), large enterprise AI teams
- Growth outlook
- Faster than the ~25% BLS projection for software developers; AI inference optimization postings have grown substantially year over year through 2025
- AI impact (through 2030)
- Strong tailwind with mixed augmentation — AI-assisted auto-tuning tools (TVM, MLIR backends) automate parts of the optimization search, amplifying experienced engineers' output while raising the baseline expectation; practitioners who can direct and extend automated optimization stacks will outpace those doing only manual tuning.
Duties and responsibilities
- Profile training and inference workloads on GPU and TPU clusters to identify compute, memory, and communication bottlenecks
- Apply quantization techniques — INT8, FP8, GPTQ, AWQ — to reduce model size and inference latency without unacceptable accuracy loss
- Implement and tune kernel-level optimizations using CUDA, Triton, or vendor-provided libraries such as cuDNN and cuBLAS
- Configure and benchmark inference runtimes including TensorRT, vLLM, DeepSpeed-Inference, and ONNX Runtime across target hardware
- Design and optimize batching strategies, KV-cache management, and speculative decoding for large language model serving
- Analyze roofline models and Nsight profiles to guide architectural decisions on attention mechanisms and feed-forward layer configurations
- Collaborate with ML researchers to validate that optimization techniques preserve model accuracy on downstream evaluation benchmarks
- Benchmark model serving systems for throughput, latency percentiles (P50/P95/P99), and cost-per-token across cloud and on-premise hardware
- Build automated regression pipelines that detect performance regressions in latency or memory footprint before changes reach production
- Document optimization findings, hardware-specific tuning parameters, and benchmark results for platform and research teams
Overview
AI Performance Engineers solve one of the most expensive problems in applied machine learning: models that are accurate in the lab but too slow or too costly to run in production at scale. As language models have grown from millions to hundreds of billions of parameters, the gap between what a model can theoretically do and what it can economically do has become a critical business constraint. This role exists to close that gap.
The work spans the full stack from hardware to serving infrastructure. On a given day, an AI Performance Engineer might start by pulling Nsight profiles from an overnight training run to understand why GPU utilization dropped below 70% during the backward pass. The diagnosis could point toward a suboptimal all-reduce communication pattern in the distributed training setup, a poorly fused attention kernel, or memory fragmentation in the KV-cache on the inference side. Each diagnosis leads to a different intervention — adjusting the tensor parallelism topology, swapping in a FlashAttention-2 kernel, or tuning the paged attention block size in vLLM.
Inference optimization is where much of the commercial pressure concentrates. Serving a 70-billion-parameter model at acceptable latency and cost-per-token requires quantization (typically INT8 or FP8 for inference), batching strategies that maximize GPU utilization without violating latency SLAs, and sometimes model architecture changes — speculative decoding, mixture-of-experts routing, or draft model approaches — that require close coordination with researchers. The performance engineer validates that these techniques don't degrade downstream task accuracy beyond acceptable tolerances.
Benchmarking is not an afterthought — it's foundational. AI Performance Engineers build and maintain benchmark suites that measure throughput (tokens/second), latency distributions (P50, P95, P99), and cost metrics (tokens per dollar) across hardware configurations. Without rigorous benchmarking, optimization claims are anecdotal. With it, the team can make hardware procurement decisions, capacity plans, and serving architecture choices on real data.
The role requires genuine comfort with ambiguity. Profiling a large model often surfaces multiple overlapping bottlenecks, and the order of operations for addressing them matters — fixing the wrong one first can mask the real constraint. Practitioners who can build a mental model of where compute is actually going, rather than where it seems to be going, are the ones who find the meaningful wins.
Qualifications
Education:
- Bachelor's or Master's in Computer Science, Computer Engineering, Electrical Engineering, or Applied Mathematics
- PhD valued at research-adjacent roles (hyperscaler AI infra teams, chip companies) but not required at most production-focused positions
- Strong self-taught backgrounds with demonstrated GPU programming projects are credible at AI-native companies that prioritize portfolio over credentials
Experience benchmarks:
- 4–8 years of systems, HPC, or ML engineering experience for mid-level roles
- 2+ years specifically focused on model optimization, GPU programming, or inference serving for senior roles
- Demonstrated end-to-end work: taking a model from unoptimized baseline to production-grade serving configuration
Core technical skills:
- Profiling: Nsight Systems, Nsight Compute, PyTorch Profiler, TensorBoard — reading traces and attributing latency to operations
- Quantization: Post-training quantization (PTQ) with GPTQ, AWQ, SmoothQuant; quantization-aware training (QAT) workflows
- Inference runtimes: TensorRT, vLLM, DeepSpeed-Inference, ONNX Runtime, Hugging Face Text Generation Inference (TGI)
- Kernel programming: CUDA C++, Triton — writing and benchmarking custom fused kernels
- Distributed training: Megatron-LM, DeepSpeed ZeRO, FSDP — understanding tensor, pipeline, and data parallelism tradeoffs
- Frameworks: PyTorch (primary), JAX for TPU-heavy environments, TensorFlow for legacy serving systems
Hardware knowledge:
- NVIDIA GPU architecture: Ampere, Hopper — SM counts, memory bandwidth, NVLink topology
- AMD Instinct series, AWS Trainium/Inferentia, Google TPU v4/v5 — enough to benchmark and compare
- Roofline modeling: identifying whether a kernel is compute-bound or memory-bandwidth-bound
Soft skills that matter:
- Systematic debugging instinct — performance work is detective work and hypotheses need to be tested, not assumed
- Ability to communicate hardware-level findings to ML researchers who don't think in CUDA terms
- Enough accuracy-side ML knowledge to assess whether an optimization changes model behavior in ways that matter
Career outlook
The demand for AI Performance Engineers is being driven by a collision of forces that are not slowing down: model sizes are growing, inference costs are under scrutiny from CFOs who approved GPU budgets that turned out to be larger than expected, and the hardware landscape is fragmenting in ways that require expert navigation rather than plug-and-play deployment.
On the cost side, inference has become the dominant expense for mature AI deployments. Training a large model is a one-time or periodic cost; serving it millions of times per day is continuous. A 20% reduction in inference cost-per-token at a major AI company translates directly to nine-figure annual savings. That math means AI Performance Engineers are not a cost center — they are a profit lever, which is why compensation at leading AI labs and hyperscalers is aggressive.
The hardware market is adding pressure and opportunity simultaneously. NVIDIA's H100 and H200 GPUs remain the dominant training and inference accelerators, but AMD, AWS, Google, and a roster of inference-specialized startups are offering credible alternatives for specific workloads. Each new chip requires new optimization work — kernels that run efficiently on Hopper architecture don't automatically transfer to AMD CDNA3 or AWS Inferentia. AI Performance Engineers who can evaluate and optimize across hardware targets are increasingly valuable to companies running multi-cloud or hybrid inference stacks.
The model architecture side is also generating continuous work. The shift from dense transformers to mixture-of-experts models, the adoption of speculative decoding, and the emergence of state-space models like Mamba as transformer alternatives each require a fresh look at the optimization stack. What worked for a dense 7B parameter model doesn't necessarily generalize to a 141B MoE model with sparse routing.
Job growth in this specific title is difficult to measure because the role is new enough that it doesn't appear in BLS occupational categories, but proxies are instructive. AI and ML engineer job postings that include terms like inference optimization, model quantization, and GPU kernel development have grown substantially year over year through 2025, and every major AI lab has expanded its infrastructure and performance teams. The Bureau of Labor Statistics projects software developer and related occupations to grow around 25% through 2032 — AI Performance Engineering, as a premium specialization within that category, is growing faster.
For people currently in HPC, compiler engineering, or embedded systems, this role represents a natural and high-paying career transition. The underlying skills — thinking in memory hierarchies, understanding parallelism, debugging hardware utilization — transfer directly. For ML practitioners who want to go deeper into systems, the investment in CUDA fundamentals and profiling tools pays off quickly in both impact and compensation.
Sample cover letter
Dear Hiring Manager,
I'm applying for the AI Performance Engineer position at [Company]. My background is in GPU systems programming — I spent three years at [Company] building HPC kernels for scientific computing before moving into inference optimization work for large language models 18 months ago.
In my current role I own inference performance for a production serving stack that handles 40 million tokens per day across a 13B and a 70B parameter model. Over the past year I cut cost-per-token by 34% across both models through a combination of GPTQ INT8 quantization on the 70B model, a batching strategy rewrite that improved GPU utilization from 58% to 81% under mixed-priority traffic, and a custom Triton kernel for the attention layer that eliminated a memory bandwidth bottleneck Nsight Compute showed was costing us roughly 12% of throughput.
I've also done the accuracy validation work on both sides — I don't hand an optimized model off to the research team without running it through the evaluation benchmark suite we maintain internally. The quantization work on the 70B model initially showed a 1.8% drop on our hardest reasoning tasks; I worked with the research team on a targeted QAT pass on those task distributions before shipping.
What I'm looking for now is a team working on models at a scale where the hardware topology itself — NVLink bandwidth, inter-node communication, KV-cache memory management under concurrent sessions — is a first-class optimization problem. The serving infrastructure challenges at [Company]'s scale look like exactly that environment.
I'd welcome the chance to talk through the technical details of what your team is working on.
[Your Name]
Frequently asked questions
- What distinguishes an AI Performance Engineer from an ML Engineer?
- ML Engineers typically focus on model development, training pipelines, feature engineering, and getting models to production. AI Performance Engineers specialize in what happens after a model works — making it fast, cheap, and reliable at scale. The job requires deeper familiarity with hardware architecture, CUDA programming, and systems profiling than most ML Engineer roles demand.
- Is CUDA programming required for this role?
- Custom CUDA kernel development is expected at hyperscalers and AI chip companies where squeezing the last 10% of hardware utilization matters. At most enterprise AI teams, fluency with higher-level tools — Triton, TensorRT, vLLM, and framework-level optimizations — is sufficient. Candidates who can drop into CUDA when needed are more competitive for senior roles.
- How is AI hardware diversity affecting this job?
- The market has moved beyond NVIDIA-only deployments. AMD Instinct GPUs, AWS Trainium and Inferentia, Google TPUs, and emerging inference accelerators from Groq and Cerebras each require different optimization strategies. AI Performance Engineers are increasingly expected to benchmark across hardware targets and recommend the right chip for a given model and workload, not just tune for a single vendor.
- What background do most people in this role come from?
- The most common paths are high-performance computing (HPC), compiler engineering, and systems-level ML research. Many practitioners started in GPU programming for scientific computing or graphics before pivoting to AI. A smaller group came through ML research with a focus on efficient model architectures — sparse transformers, mixture-of-experts, quantization-aware training.
- How is AI changing the AI Performance Engineer role itself?
- AI-assisted profiling and auto-tuning tools — including compiler backends like TVM and MLIR-based systems — automate optimization search that was previously done by hand, compressing the time from profiling to deployment. This amplifies the output of experienced engineers but raises the baseline expectation: practitioners who only do manual tuning are losing ground to those who can direct and extend automated optimization stacks.
More in Artificial Intelligence
See all Artificial Intelligence jobs →- AI Operations Manager$115K–$185K
AI Operations Managers oversee the deployment, monitoring, and continuous reliability of machine learning models and AI systems running in production. They bridge the gap between data science teams who build models and engineering teams who maintain infrastructure, ensuring AI systems perform accurately, scale predictably, and comply with governance requirements. The role owns the operational health of an organization's AI portfolio from initial deployment through deprecation.
- AI Policy Analyst$78K–$135K
AI Policy Analysts research, develop, and communicate policy positions on artificial intelligence regulation, ethics, and governance — advising technology companies, government agencies, think tanks, and advocacy organizations on how AI systems should be built, deployed, and overseen. They sit at the intersection of technical understanding and public policy, translating complex AI capabilities and risks into frameworks legislators, regulators, and executives can act on.
- AI Integration Specialist$95K–$155K
AI Integration Specialists design, implement, and maintain the technical bridges between an organization's existing software infrastructure and AI/ML services, APIs, and models. They work at the intersection of software engineering, data architecture, and machine learning operations — translating business requirements into working AI-powered features while ensuring reliability, security, and scalability across production systems.
- AI Privacy Engineer$125K–$210K
AI Privacy Engineers design and implement technical safeguards that protect personal data throughout the machine learning lifecycle — from data ingestion and model training to inference and deployment. They sit at the intersection of privacy law, cryptography, and ML engineering, translating regulatory requirements like GDPR and CCPA into code, architectural patterns, and governance controls that let organizations build AI systems without exposing sensitive information.
- AI Solutions Engineer$115K–$195K
AI Solutions Engineers bridge the gap between cutting-edge machine learning research and production-grade customer deployments. They work alongside sales, product, and data science teams to scope AI use cases, design integration architectures, build proof-of-concept demos, and guide enterprise customers through implementation. The role demands both deep technical fluency in ML frameworks and APIs and the communication skills to translate model behavior into business outcomes for non-technical stakeholders.
- LLM Engineer$135K–$220K
LLM Engineers design, fine-tune, evaluate, and deploy large language models into production systems that power chatbots, copilots, document processing pipelines, and autonomous agents. They sit between research and software engineering — translating model capabilities into reliable, cost-efficient product features while managing inference infrastructure, prompt engineering, and evaluation frameworks at scale.