Haojie Ye 叶皓桀

Deep Learning Performance Architect · NVIDIA

I work at the intersection of GPU architecture and large-scale AI systems as a Deep Learning Performance Architect at NVIDIA. My team drives performance modeling and architecture pathfinding for next-generation GPUs — shaping the hardware that will power the world's most demanding AI workloads before a single chip is manufactured.

I hold a Ph.D. in Computer Science and Engineering (GPA 4.0/4.0) from the University of Michigan, Ann Arbor, advised by Prof. Trevor Mudge. My doctoral research focused on computer architecture for AI: building hardware-software co-designs for recommendation systems, graph neural networks, and sparse linear algebra that dramatically reduce inference latency and energy cost. I also hold a B.S.E. in Electrical & Computer Engineering from Shanghai Jiao Tong University.

My work has appeared at top computer architecture venues including HPCA, ASPLOS, MICRO, ISCA, VLDB, and PACT.

Recent Updates
2025/05 Event Contributing to NVIDIA ComputeX 2025 keynote — GPU architecture and AI performance leadership.
2025/03 Event Contributor to NVIDIA GTC 2025 keynote on next-generation GPU architecture and AI roadmap.
2025/01 Launch Contributed to the debut of the NVIDIA RTX PRO 6000 Blackwell Workstation GPU — performance modeling and architecture optimization for professional AI workloads.
2025/01 HPCA'25 Paper accepted at HPCA 2025: "Palermo: Improving the Performance of Oblivious Memory using Protocol-Hardware Co-Design."
2024/10 New Role Joined NVIDIA (Santa Clara) as a full-time Deep Learning Performance Architect on the Performance Modeling & Architecture Pathfinding team.
2024/09 PhD Defended Ph.D. dissertation at the University of Michigan — Computer Architecture for emerging AI workloads.
Industry

I specialize in understanding how transformer-based AI models map onto modern GPU hardware — identifying bottlenecks across compute, memory bandwidth, capacity, and interconnect — and translating those findings into concrete architectural guidance for future silicon.

NVIDIA — Deep Learning Performance Architect

Performance Modeling & Architecture Pathfinding · Santa Clara, CA

Oct 2024 – Present
  • GPU Architecture Pathfinding (Rubin & Feynman): A primary contributor to the performance modeling and hardware specification process for NVIDIA's upcoming Rubin and Feynman GPU generations. Deliver actionable tradeoff analyses — across compute, memory hierarchy, and interconnect — that directly influence architectural decisions before tape-out.
  • AI Workload Coverage: Model state-of-the-art LLMs and AI workloads from NVIDIA's largest enterprise customers — including xAI (Grok), Alibaba (Qwen series), AWS, Microsoft Azure, Meta (Llama 4), and ByteDance — on next-generation pre-silicon GPU simulators to surface bottlenecks and guide optimization priorities.
  • Model Coverage: Llama 4, DeepSeek, Qwen3, KimiK2, Diffusion Transformers (Wan, LTX), and Vision Foundation Models — spanning prefill, decode, and multi-modal inference scenarios at datacenter scale.
  • RTX PRO 6000 Blackwell: Contributed to performance modeling and optimization for the NVIDIA RTX PRO 6000 Blackwell Workstation GPU, a flagship professional GPU for AI-intensive workstation deployments.
  • Keynote Contributions: Supported GPU performance projections and architectural narratives for NVIDIA's high-profile developer events including GTC 2025 and ComputeX 2025.
Pre-Silicon Simulation LLM Inference Transformer Architecture GPU Microarchitecture Memory Systems Performance Modeling CUDA HBM / Interconnect

NVIDIA — Deep Learning Performance Architect Intern

Performance Modeling & Projection · Santa Clara, CA

May 2023 – Aug 2023
  • Independent R&D in modeling large language models on future GPU architectures.
  • Developed performance projection methodology later adopted into production modeling infrastructure.

Micron Technology — Advanced Hardware Development Intern

Enhanced In-Memory Function Research · Allen, TX

May 2022 – Aug 2022
  • Independent research on enhanced in-memory compute functions for Micron's next-generation memory modules.
  • Contributed to 1 Micron internal journal and 2 U.S. patents.
Research

My academic work builds hardware-software co-designs that close the gap between algorithm complexity and hardware efficiency for AI-scale workloads. Key themes include: recommendation system acceleration, sparse GEMM on systolic arrays, graph pattern mining hardware, and LLM fine-tuning cost modeling. Full list on Google Scholar.

HPCA'25 Protocol-Hardware Co-Design · Oblivious Memory

Palermo: Improving the Performance of Oblivious Memory using Protocol-Hardware Co-Design

H. Ye, Y. Xia, Y. Chen, K. Chen, Y. Yuan, S. Deng, B. Kasikci, T. Mudge, N. Talati
HPCA 2025 · IEEE Intl. Symposium on High-Performance Computer Architecture

Palermo addresses the severe performance penalty of Oblivious RAM (ORAM) — a cryptographic primitive for privacy-preserving computation — through a co-designed protocol and hardware that dramatically reduces memory access overhead, enabling practical secure computation on real hardware.

ASPLOS'23 Graph-Based Recommendation · Embedding Prefetch

GRACE: A Scalable Graph-Based Approach to Accelerating Recommendation Model Inference

H. Ye, S. Vedula, Y. Chen, Y. Yang, A. Bronstein, R. Dreslinski, T. Mudge, N. Talati
ASPLOS 2023 · ACM Intl. Conference on Architectural Support for Programming Languages and Operating Systems

Industrial recommendation models (powering feeds at Meta, TikTok, and similar platforms) are dominated by sparse embedding lookups into multi-terabyte tables. GRACE exploits the graph structure of user-item interactions to accurately predict and prefetch embeddings, cutting inference latency by up to 2.4× with negligible area overhead — a direct path to faster, cheaper recommendation serving.

ISCA'22 Near-Data Processing · Graph Pattern Mining

NDMiner: Accelerating Graph Pattern Mining Using Near Data Processing

N. Talati, H. Ye, Y. Yang, L. Belayneh, K. Chen, D. Blaauw, T. Mudge, R. Dreslinski
ISCA 2022 · ACM/IEEE Intl. Symposium on Computer Architecture

Graph pattern mining (subgraph isomorphism, motif counting) is a cornerstone of drug discovery, fraud detection, and knowledge graph reasoning. NDMiner moves computation to where data lives — inside memory — eliminating the memory bandwidth bottleneck that cripples conventional GPU approaches on large irregular graphs.

MICRO'22 Temporal Graph Mining · Hardware Accelerator

Mint: An Accelerator for Mining Temporal Motifs

N. Talati, H. Ye, S. Vedula, K. Chen, Y. Chen, D. Liu, D. Blaauw, A. Bronstein, T. Mudge, R. Dreslinski
MICRO 2022 · IEEE/ACM Intl. Symposium on Microarchitecture

Temporal motif mining — discovering recurring time-ordered interaction patterns in dynamic graphs — is foundational to fraud detection, social network analysis, and biological pathway discovery. Mint introduces the first purpose-built hardware accelerator for this problem, achieving orders-of-magnitude speedup over CPU and GPU baselines.

ICS'20 Sparse GEMM · Systolic Arrays · TPU

Sparse-TPU: Adapting Systolic Arrays for Sparse Matrices

X. He, S. Pal, A. Amarnath, S. Feng, D. Park, A. Rovinski, H. Ye, Y. Chen, R. Dreslinski, T. Mudge
ICS 2020 · ACM Intl. Conference on Supercomputing

Systolic array accelerators (TPUs) are highly efficient for dense GEMM but waste most of their compute budget on zero-valued elements in sparse networks. Sparse-TPU augments systolic arrays with lightweight sparsity-aware dataflow, recovering significant throughput on pruned and sparse neural network models without expensive hardware redesigns.

VLDB'24 GPU-Accelerated · Temporal Motif Mining

Everest: GPU-Accelerated System for Mining Temporal Motifs

Y. Yuan, H. Ye, S. Vedula, W. Kaza, N. Talati
VLDB 2024 · Proceedings of the VLDB Endowment

Everest brings temporal motif mining to GPU scale, building on the algorithmic insights of Mint and delivering a production-grade GPU-accelerated system capable of processing billion-edge temporal graphs that are far beyond the reach of prior CPU-based tools.

IISWC'24 LLM Fine-Tuning · Cost Modeling · Performance

Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

Y. Xia, J. Kim, Y. Chen, H. Ye, S. Kundu, C. Hao, N. Talati
IISWC 2024 · IEEE Intl. Symposium on Workload Characterization

Fine-tuning large language models is notoriously expensive and opaque to practitioners. This work provides a systematic characterization of LLM fine-tuning workloads across hardware platforms, offering actionable cost models and bottleneck analyses that help teams make informed infrastructure decisions — directly relevant to any organization training or adapting LLMs.

PACT'22 Multi-GPU · Remote Memory · Latency Optimization

Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems

L. Belayneh, H. Ye, K. Chen, D. Blaauw, T. Mudge, R. Dreslinski, N. Talati
PACT 2022 · ACM Intl. Conference on Parallel Architectures and Compilation Techniques

As AI models outgrow single-GPU memory, multi-GPU systems with NVLink/NVSwitch become essential. This work characterizes and mitigates remote memory access latency in multi-GPU environments through locality-aware data placement and access pattern optimization — directly applicable to large-model inference and training across GPU clusters.

Education

Ph.D., Computer Science & Engineering

University of Michigan, Ann Arbor · Advised by Prof. Trevor Mudge

GPA 4.0/4.0 · Focus: Computer Architecture for AI workloads

2019 – 2024

M.S.E., Computer Science & Engineering

University of Michigan, Ann Arbor

GPA 4.0/4.0

2019 – 2021

B.S.E., Computer Engineering

University of Michigan, Ann Arbor

2017 – 2019

B.S.E., Electrical & Computer Engineering

Shanghai Jiao Tong University

2015 – 2019
Misc
  • Piano player and electronic music producer — Vaporwave & ambient experiments
  • Experimental film enthusiast — Tarkovsky, Lynch, early Kubrick
  • Philosophy of mind and consciousness — from Descartes to Chalmers