Vima Gupta

Logo

PhD student, College of Computing, GaTech

View My GitHub Profile

About Me

I am a third-year CS Ph.D. student at Georgia Tech, advised by Dr. Anand Iyer in the NEXS group. My research focuses on building efficient systems for inference and post-training on large models. My projects encompass runtime scheduling, workload-agnostic pattern analysis and CUDA kernel optimization.

I work on production-scale challenges: multi-model serving, mixture-of-experts inference, and adaptive systems that respond to real workload patterns. I’ve interned at Microsoft Research (2024, 2025) and Cerebras Systems.

I am honored to have received the Adobe Research Women in Technology Scholarship and the EPFL EDIC Fellowship.

CV · Google Scholar · GitHub · LinkedIn

News

Research

Adaptive Model Merging for Multi-Model LLM Serving
Building systems to merge fine-tuned LLMs at runtime, reducing memory footprint while preserving accuracy for agentic and multi-tenant workloads.

LYNX: Efficient Mixture-of-Experts Inference
First system to optimize MoE inference through dynamic expert reduction—workload-agnostic, accuracy-preserving, with custom CUDA kernels. [arXiv] (Under submission)

Publications

VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination
M. Hu, A. Gupta, J. Yuan, V. Gupta, T. Kim, X. Xu, J. Kulkarni, O. Dekel, V. Adve, C. Mendis
OSDI 2026 (to appear)

LYNX: Efficient MoE Inference for Workload-Agnostic LLM Serving
V. Gupta, J.H. Ju, K. Sinha, A. Gavrilovska, A. Iyer
(Under submission) [arXiv]

Understanding Infinity: Neural Network Models of Becoming a Cardinal Principle Knower
V. Gupta, S. Varma — CogSci 2024 [Paper]

Learning to Count: A Neural Network Model of the Successor Function
V. Gupta, S. Varma — CogSci 2022 [Poster]

Performance Analysis of a Visible Light V2V Wireless Communication System
V. Gupta, R. Singhal — IMICPW 2019 (Best Paper Award) [Paper]

Experience

Microsoft Research — Research Intern, Systems Research Group (Summer 2025)
Adaptive KV-aware scheduling for multi-instance, multi-region LLM serving.

Microsoft Research — Research Intern, AI Frameworks Team (Summer 2024)
Adaptive CUDA kernel dispatch for flash-attention; up to 30% latency reduction on H100/A100 clusters.

Cerebras Systems — ML Compilers Intern (Summer 2022)
MLIR-based graph transformations for sparse attention acceleration.

Arm — Physical Design Engineer (2018–2020)
Synthesis and PnR for Arm V7–V9 cores at 4nm, 5nm, 11nm nodes.

Awards

Earlier Work

During my masters, I worked on qubit mapping for trapped-ion quantum computers with Dr. Tom Conte (Master’s thesis). I also co-founded PACE through CREATE-X, building pose-detection systems for remote physical therapy.

Beyond Research

I’m an avid quizzer (placed 3rd nationally as a kid), take Bachata classes, and collect books on mythology and medieval history. I enjoy catching details in films and behind-the-scenes takes on TV shows. Always happy to recommend brunch spots near Georgia Tech.