NVIDIA DGX Spark Case Study:

RAG Inference by Day, Fine-tuning by Night

Authors:

Ryan Shrout

Ryan Shrout

Ken Addison

Ken Addison

June 23, 2026

The rapid maturation of large language models has created a new class of infrastructure challenge for small and mid-size organizations. Cloud-hosted inference APIs offer convenience but raise concerns around data sovereignty, recurring costs, and latency. Meanwhile, conventional desktop GPUs can lack the memory and compute to run production-quality models at the scale these organizations need. The NVIDIA DGX Spark™ occupies a deliberate middle ground as a desktop-form-factor AI supercomputer built on the Grace-Blackwell architecture with 128 GB of unified memory, designed to bring data-center-class AI capability into an office environment.

The DGX Spark addresses these challenges in a single desktop system a small team can own and operate. The dual-use model is central to that case. The same hardware that serves the team through the workday is put to fine-tuning overnight, so an expensive asset earns its keep around the clock instead of sitting idle outside business hours. For a four-to-eight-person group that cannot justify a dedicated MLOps function or a rack of discrete GPUs, DGX Spark is best understood not as a smaller version of rack-scale infrastructure but as an owned local AI computer sized for small-team deployment.

Key Takeaways

One DGX Spark serves up to 8 users on a 30B model by day and fine-tunes a model by night, no data leaving the building.
Support up to 4 users at 7.9s TTFT at ~39 tokens/s.
Overnight LoRA training held ~425 tokens/s, finishing a 500-step run in about 45 minutes.
Spread across 8 users, DGX Spark works out to <$600 per seat, one-time fixed cost.

This paper evaluates the DGX Spark as a dual-use AI workstation through a practical workflow we call the day/night cycle. During business hours, the system serves as a retrieval-augmented generation (RAG) chatbot, fielding concurrent queries from a small team against an ingested knowledge base. Overnight and on weekends, the same hardware pivots to fine-tuning a model with LoRA, incorporating new organizational knowledge and improving response quality over time. The two workloads never overlap; they share the same 128 GB memory envelope but use it for fundamentally different purposes.

Our goal is to characterize end-to-end performance at each stage of this cycle: how many concurrent users can the system serve with acceptable latency during RAG inference, and how quickly can it complete a meaningful fine-tuning run overnight? We present concrete benchmarks on both workloads using the Nemotron-3-Nano-30B-A3B model, document the practical setup requirements (including several non-obvious sharp edges), and assess the viability of the DGX Spark as a self-contained AI platform for teams of roughly four to eight people.

Research commissioned by:

NVIDIA Logo