Haojie Ye 叶皓桀
Deep Learning Performance Architect · NVIDIAI work at the intersection of GPU architecture and large-scale AI systems as a Deep Learning Performance Architect at NVIDIA. My team drives performance modeling and architecture pathfinding for next-generation GPUs — shaping the hardware that will power the world's most demanding AI workloads before a single chip is manufactured.
I hold a Ph.D. in Computer Science and Engineering (GPA 4.0/4.0) from the University of Michigan, Ann Arbor, advised by Prof. Trevor Mudge. My doctoral research focused on computer architecture for AI: building hardware-software co-designs for recommendation systems, graph neural networks, and sparse linear algebra that dramatically reduce inference latency and energy cost. I also hold a B.S.E. in Electrical & Computer Engineering from Shanghai Jiao Tong University.
My work has appeared at top computer architecture venues including HPCA, ASPLOS, MICRO, ISCA, VLDB, and PACT.
I specialize in understanding how transformer-based AI models map onto modern GPU hardware — identifying bottlenecks across compute, memory bandwidth, capacity, and interconnect — and translating those findings into concrete architectural guidance for future silicon.
NVIDIA — Deep Learning Performance Architect
Performance Modeling & Architecture Pathfinding · Santa Clara, CA
- GPU Architecture Pathfinding (Rubin & Feynman): A primary contributor to the performance modeling and hardware specification process for NVIDIA's upcoming Rubin and Feynman GPU generations. Deliver actionable tradeoff analyses — across compute, memory hierarchy, and interconnect — that directly influence architectural decisions before tape-out.
- AI Workload Coverage: Model state-of-the-art LLMs and AI workloads from NVIDIA's largest enterprise customers — including xAI (Grok), Alibaba (Qwen series), AWS, Microsoft Azure, Meta (Llama 4), and ByteDance — on next-generation pre-silicon GPU simulators to surface bottlenecks and guide optimization priorities.
- Model Coverage: Llama 4, DeepSeek, Qwen3, KimiK2, Diffusion Transformers (Wan, LTX), and Vision Foundation Models — spanning prefill, decode, and multi-modal inference scenarios at datacenter scale.
- RTX PRO 6000 Blackwell: Contributed to performance modeling and optimization for the NVIDIA RTX PRO 6000 Blackwell Workstation GPU, a flagship professional GPU for AI-intensive workstation deployments.
- Keynote Contributions: Supported GPU performance projections and architectural narratives for NVIDIA's high-profile developer events including GTC 2025 and ComputeX 2025.
NVIDIA — Deep Learning Performance Architect Intern
Performance Modeling & Projection · Santa Clara, CA
- Independent R&D in modeling large language models on future GPU architectures.
- Developed performance projection methodology later adopted into production modeling infrastructure.
Micron Technology — Advanced Hardware Development Intern
Enhanced In-Memory Function Research · Allen, TX
- Independent research on enhanced in-memory compute functions for Micron's next-generation memory modules.
- Contributed to 1 Micron internal journal and 2 U.S. patents.
My academic work builds hardware-software co-designs that close the gap between algorithm complexity and hardware efficiency for AI-scale workloads. Key themes include: recommendation system acceleration, sparse GEMM on systolic arrays, graph pattern mining hardware, and LLM fine-tuning cost modeling. Full list on Google Scholar.
Palermo: Improving the Performance of Oblivious Memory using Protocol-Hardware Co-Design
Palermo addresses the severe performance penalty of Oblivious RAM (ORAM) — a cryptographic primitive for privacy-preserving computation — through a co-designed protocol and hardware that dramatically reduces memory access overhead, enabling practical secure computation on real hardware.
GRACE: A Scalable Graph-Based Approach to Accelerating Recommendation Model Inference
Industrial recommendation models (powering feeds at Meta, TikTok, and similar platforms) are dominated by sparse embedding lookups into multi-terabyte tables. GRACE exploits the graph structure of user-item interactions to accurately predict and prefetch embeddings, cutting inference latency by up to 2.4× with negligible area overhead — a direct path to faster, cheaper recommendation serving.
NDMiner: Accelerating Graph Pattern Mining Using Near Data Processing
Graph pattern mining (subgraph isomorphism, motif counting) is a cornerstone of drug discovery, fraud detection, and knowledge graph reasoning. NDMiner moves computation to where data lives — inside memory — eliminating the memory bandwidth bottleneck that cripples conventional GPU approaches on large irregular graphs.
Mint: An Accelerator for Mining Temporal Motifs
Temporal motif mining — discovering recurring time-ordered interaction patterns in dynamic graphs — is foundational to fraud detection, social network analysis, and biological pathway discovery. Mint introduces the first purpose-built hardware accelerator for this problem, achieving orders-of-magnitude speedup over CPU and GPU baselines.
Sparse-TPU: Adapting Systolic Arrays for Sparse Matrices
Systolic array accelerators (TPUs) are highly efficient for dense GEMM but waste most of their compute budget on zero-valued elements in sparse networks. Sparse-TPU augments systolic arrays with lightweight sparsity-aware dataflow, recovering significant throughput on pruned and sparse neural network models without expensive hardware redesigns.
Everest: GPU-Accelerated System for Mining Temporal Motifs
Everest brings temporal motif mining to GPU scale, building on the algorithmic insights of Mint and delivering a production-grade GPU-accelerated system capable of processing billion-edge temporal graphs that are far beyond the reach of prior CPU-based tools.
Understanding the Performance and Estimating the Cost of LLM Fine-Tuning
Fine-tuning large language models is notoriously expensive and opaque to practitioners. This work provides a systematic characterization of LLM fine-tuning workloads across hardware platforms, offering actionable cost models and bottleneck analyses that help teams make informed infrastructure decisions — directly relevant to any organization training or adapting LLMs.
Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems
As AI models outgrow single-GPU memory, multi-GPU systems with NVLink/NVSwitch become essential. This work characterizes and mitigates remote memory access latency in multi-GPU environments through locality-aware data placement and access pattern optimization — directly applicable to large-model inference and training across GPU clusters.
Ph.D., Computer Science & Engineering
University of Michigan, Ann Arbor · Advised by Prof. Trevor Mudge
GPA 4.0/4.0 · Focus: Computer Architecture for AI workloads
B.S.E., Computer Engineering
B.S.E., Electrical & Computer Engineering
- Piano player and electronic music producer — Vaporwave & ambient experiments
- Experimental film enthusiast — Tarkovsky, Lynch, early Kubrick
- Philosophy of mind and consciousness — from Descartes to Chalmers