JobDescription.org

Artificial Intelligence

AI Performance Engineer

Last updated

AI Performance Engineers optimize the speed, throughput, and resource efficiency of machine learning models from training to production inference. They sit at the intersection of systems engineering, hardware architecture, and ML research — profiling where compute is wasted, redesigning pipelines to eliminate bottlenecks, and making large models fast enough to serve millions of requests at acceptable cost. The role has become critical as enterprises discover that a model that runs in the lab rarely runs economically at scale.

Role at a glance

Typical education
Bachelor's or Master's in Computer Science, Computer Engineering, or Electrical Engineering
Typical experience
4-8 years in systems, HPC, or ML engineering
Key certifications
None formally required; NVIDIA Deep Learning Institute credentials and CUDA programming portfolios serve as practical signals
Top employer types
Hyperscalers (Google, Microsoft, AWS, Meta), AI-native labs (OpenAI, Anthropic), AI chip companies (Nvidia, AMD), large enterprise AI teams
Growth outlook
Faster than the ~25% BLS projection for software developers; AI inference optimization postings have grown substantially year over year through 2025
AI impact (through 2030)
Strong tailwind with mixed augmentation — AI-assisted auto-tuning tools (TVM, MLIR backends) automate parts of the optimization search, amplifying experienced engineers' output while raising the baseline expectation; practitioners who can direct and extend automated optimization stacks will outpace those doing only manual tuning.

Duties and responsibilities

  • Profile training and inference workloads on GPU and TPU clusters to identify compute, memory, and communication bottlenecks
  • Apply quantization techniques — INT8, FP8, GPTQ, AWQ — to reduce model size and inference latency without unacceptable accuracy loss
  • Implement and tune kernel-level optimizations using CUDA, Triton, or vendor-provided libraries such as cuDNN and cuBLAS
  • Configure and benchmark inference runtimes including TensorRT, vLLM, DeepSpeed-Inference, and ONNX Runtime across target hardware
  • Design and optimize batching strategies, KV-cache management, and speculative decoding for large language model serving
  • Analyze roofline models and Nsight profiles to guide architectural decisions on attention mechanisms and feed-forward layer configurations
  • Collaborate with ML researchers to validate that optimization techniques preserve model accuracy on downstream evaluation benchmarks
  • Benchmark model serving systems for throughput, latency percentiles (P50/P95/P99), and cost-per-token across cloud and on-premise hardware
  • Build automated regression pipelines that detect performance regressions in latency or memory footprint before changes reach production
  • Document optimization findings, hardware-specific tuning parameters, and benchmark results for platform and research teams

Overview

AI Performance Engineers solve one of the most expensive problems in applied machine learning: models that are accurate in the lab but too slow or too costly to run in production at scale. As language models have grown from millions to hundreds of billions of parameters, the gap between what a model can theoretically do and what it can economically do has become a critical business constraint. This role exists to close that gap.

The work spans the full stack from hardware to serving infrastructure. On a given day, an AI Performance Engineer might start by pulling Nsight profiles from an overnight training run to understand why GPU utilization dropped below 70% during the backward pass. The diagnosis could point toward a suboptimal all-reduce communication pattern in the distributed training setup, a poorly fused attention kernel, or memory fragmentation in the KV-cache on the inference side. Each diagnosis leads to a different intervention — adjusting the tensor parallelism topology, swapping in a FlashAttention-2 kernel, or tuning the paged attention block size in vLLM.

Inference optimization is where much of the commercial pressure concentrates. Serving a 70-billion-parameter model at acceptable latency and cost-per-token requires quantization (typically INT8 or FP8 for inference), batching strategies that maximize GPU utilization without violating latency SLAs, and sometimes model architecture changes — speculative decoding, mixture-of-experts routing, or draft model approaches — that require close coordination with researchers. The performance engineer validates that these techniques don't degrade downstream task accuracy beyond acceptable tolerances.

Benchmarking is not an afterthought — it's foundational. AI Performance Engineers build and maintain benchmark suites that measure throughput (tokens/second), latency distributions (P50, P95, P99), and cost metrics (tokens per dollar) across hardware configurations. Without rigorous benchmarking, optimization claims are anecdotal. With it, the team can make hardware procurement decisions, capacity plans, and serving architecture choices on real data.

The role requires genuine comfort with ambiguity. Profiling a large model often surfaces multiple overlapping bottlenecks, and the order of operations for addressing them matters — fixing the wrong one first can mask the real constraint. Practitioners who can build a mental model of where compute is actually going, rather than where it seems to be going, are the ones who find the meaningful wins.

Qualifications

Education:

  • Bachelor's or Master's in Computer Science, Computer Engineering, Electrical Engineering, or Applied Mathematics
  • PhD valued at research-adjacent roles (hyperscaler AI infra teams, chip companies) but not required at most production-focused positions
  • Strong self-taught backgrounds with demonstrated GPU programming projects are credible at AI-native companies that prioritize portfolio over credentials

Experience benchmarks:

  • 4–8 years of systems, HPC, or ML engineering experience for mid-level roles
  • 2+ years specifically focused on model optimization, GPU programming, or inference serving for senior roles
  • Demonstrated end-to-end work: taking a model from unoptimized baseline to production-grade serving configuration

Core technical skills:

  • Profiling: Nsight Systems, Nsight Compute, PyTorch Profiler, TensorBoard — reading traces and attributing latency to operations
  • Quantization: Post-training quantization (PTQ) with GPTQ, AWQ, SmoothQuant; quantization-aware training (QAT) workflows
  • Inference runtimes: TensorRT, vLLM, DeepSpeed-Inference, ONNX Runtime, Hugging Face Text Generation Inference (TGI)
  • Kernel programming: CUDA C++, Triton — writing and benchmarking custom fused kernels
  • Distributed training: Megatron-LM, DeepSpeed ZeRO, FSDP — understanding tensor, pipeline, and data parallelism tradeoffs
  • Frameworks: PyTorch (primary), JAX for TPU-heavy environments, TensorFlow for legacy serving systems

Hardware knowledge:

  • NVIDIA GPU architecture: Ampere, Hopper — SM counts, memory bandwidth, NVLink topology
  • AMD Instinct series, AWS Trainium/Inferentia, Google TPU v4/v5 — enough to benchmark and compare
  • Roofline modeling: identifying whether a kernel is compute-bound or memory-bandwidth-bound

Soft skills that matter:

  • Systematic debugging instinct — performance work is detective work and hypotheses need to be tested, not assumed
  • Ability to communicate hardware-level findings to ML researchers who don't think in CUDA terms
  • Enough accuracy-side ML knowledge to assess whether an optimization changes model behavior in ways that matter

Career outlook

The demand for AI Performance Engineers is being driven by a collision of forces that are not slowing down: model sizes are growing, inference costs are under scrutiny from CFOs who approved GPU budgets that turned out to be larger than expected, and the hardware landscape is fragmenting in ways that require expert navigation rather than plug-and-play deployment.

On the cost side, inference has become the dominant expense for mature AI deployments. Training a large model is a one-time or periodic cost; serving it millions of times per day is continuous. A 20% reduction in inference cost-per-token at a major AI company translates directly to nine-figure annual savings. That math means AI Performance Engineers are not a cost center — they are a profit lever, which is why compensation at leading AI labs and hyperscalers is aggressive.

The hardware market is adding pressure and opportunity simultaneously. NVIDIA's H100 and H200 GPUs remain the dominant training and inference accelerators, but AMD, AWS, Google, and a roster of inference-specialized startups are offering credible alternatives for specific workloads. Each new chip requires new optimization work — kernels that run efficiently on Hopper architecture don't automatically transfer to AMD CDNA3 or AWS Inferentia. AI Performance Engineers who can evaluate and optimize across hardware targets are increasingly valuable to companies running multi-cloud or hybrid inference stacks.

The model architecture side is also generating continuous work. The shift from dense transformers to mixture-of-experts models, the adoption of speculative decoding, and the emergence of state-space models like Mamba as transformer alternatives each require a fresh look at the optimization stack. What worked for a dense 7B parameter model doesn't necessarily generalize to a 141B MoE model with sparse routing.

Job growth in this specific title is difficult to measure because the role is new enough that it doesn't appear in BLS occupational categories, but proxies are instructive. AI and ML engineer job postings that include terms like inference optimization, model quantization, and GPU kernel development have grown substantially year over year through 2025, and every major AI lab has expanded its infrastructure and performance teams. The Bureau of Labor Statistics projects software developer and related occupations to grow around 25% through 2032 — AI Performance Engineering, as a premium specialization within that category, is growing faster.

For people currently in HPC, compiler engineering, or embedded systems, this role represents a natural and high-paying career transition. The underlying skills — thinking in memory hierarchies, understanding parallelism, debugging hardware utilization — transfer directly. For ML practitioners who want to go deeper into systems, the investment in CUDA fundamentals and profiling tools pays off quickly in both impact and compensation.

Sample cover letter

Dear Hiring Manager,

I'm applying for the AI Performance Engineer position at [Company]. My background is in GPU systems programming — I spent three years at [Company] building HPC kernels for scientific computing before moving into inference optimization work for large language models 18 months ago.

In my current role I own inference performance for a production serving stack that handles 40 million tokens per day across a 13B and a 70B parameter model. Over the past year I cut cost-per-token by 34% across both models through a combination of GPTQ INT8 quantization on the 70B model, a batching strategy rewrite that improved GPU utilization from 58% to 81% under mixed-priority traffic, and a custom Triton kernel for the attention layer that eliminated a memory bandwidth bottleneck Nsight Compute showed was costing us roughly 12% of throughput.

I've also done the accuracy validation work on both sides — I don't hand an optimized model off to the research team without running it through the evaluation benchmark suite we maintain internally. The quantization work on the 70B model initially showed a 1.8% drop on our hardest reasoning tasks; I worked with the research team on a targeted QAT pass on those task distributions before shipping.

What I'm looking for now is a team working on models at a scale where the hardware topology itself — NVLink bandwidth, inter-node communication, KV-cache memory management under concurrent sessions — is a first-class optimization problem. The serving infrastructure challenges at [Company]'s scale look like exactly that environment.

I'd welcome the chance to talk through the technical details of what your team is working on.

[Your Name]

Frequently asked questions

What distinguishes an AI Performance Engineer from an ML Engineer?
ML Engineers typically focus on model development, training pipelines, feature engineering, and getting models to production. AI Performance Engineers specialize in what happens after a model works — making it fast, cheap, and reliable at scale. The job requires deeper familiarity with hardware architecture, CUDA programming, and systems profiling than most ML Engineer roles demand.
Is CUDA programming required for this role?
Custom CUDA kernel development is expected at hyperscalers and AI chip companies where squeezing the last 10% of hardware utilization matters. At most enterprise AI teams, fluency with higher-level tools — Triton, TensorRT, vLLM, and framework-level optimizations — is sufficient. Candidates who can drop into CUDA when needed are more competitive for senior roles.
How is AI hardware diversity affecting this job?
The market has moved beyond NVIDIA-only deployments. AMD Instinct GPUs, AWS Trainium and Inferentia, Google TPUs, and emerging inference accelerators from Groq and Cerebras each require different optimization strategies. AI Performance Engineers are increasingly expected to benchmark across hardware targets and recommend the right chip for a given model and workload, not just tune for a single vendor.
What background do most people in this role come from?
The most common paths are high-performance computing (HPC), compiler engineering, and systems-level ML research. Many practitioners started in GPU programming for scientific computing or graphics before pivoting to AI. A smaller group came through ML research with a focus on efficient model architectures — sparse transformers, mixture-of-experts, quantization-aware training.
How is AI changing the AI Performance Engineer role itself?
AI-assisted profiling and auto-tuning tools — including compiler backends like TVM and MLIR-based systems — automate optimization search that was previously done by hand, compressing the time from profiling to deployment. This amplifies the output of experienced engineers but raises the baseline expectation: practitioners who only do manual tuning are losing ground to those who can direct and extend automated optimization stacks.
See all Artificial Intelligence jobs →