# Spark TensorRT-LLM Migration

- Article slug: `spark-trtllm-migration-may-2026`
- Before model: `qwen3:8b`
- After model: `Qwen/Qwen3-8B via TensorRT-LLM 1.3.0rc12.post1 (NGC release container)`

## Comparison

| Concurrency | Before avg ms | After avg ms | Before tok/s | After tok/s |
|---|---:|---:|---:|---:|
| 1 | 2541.67 | 6997.54 | 37.76 | 13.72 |
| 2 | 4667.74 | 6797.99 | 38.56 | 28.20 |
| 4 | 8738.93 | 6815.54 | 38.47 | 56.29 |

## Pros
- Spark text is now served through a dedicated TensorRT-LLM endpoint instead of sharing the same Ollama lane used for legacy text workflows.
- The final cutover uses NVIDIA’s official release container on Spark, which is a cleaner and more supportable path than a hand-built pip environment.
- TensorRT-LLM exposes OpenAI-compatible `/v1/chat/completions`, `/health`, and `/v1/models`, which gives the public stack a cleaner operator surface.
- The Spark image path is unchanged, so the text migration does not disturb the existing hosted-image workflow.
- The working path on Spark stayed inside NVIDIA’s current Blackwell-oriented TensorRT-LLM release container instead of the unsupported pip combinations that failed earlier.
- At concurrency 4, aggregate completion throughput climbed to 56.29 tok/s.

## Cons
- The Spark text lane is now a second runtime to operate: separate container, sidecar port, model hydration path, and health checks all add surface area.
- The first-time startup path is heavier than Ollama because it must hydrate a large Hugging Face checkpoint before the API becomes ready.
- The direct pip path was not production-safe on Spark: newer 1.2.x wheels failed to import cleanly, and older fallback builds exposed unsupported-kernel failures on GB10 before the container path succeeded.
- The public Spark text catalog narrowed from 5 models to 1 models during the first cutover.