> ## Documentation Index
> Fetch the complete documentation index at: https://tensorfuse.io/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Deploy Deepseek R1 671B on Serverless GPUs

> Deploy Deepseek R1 671B param model using Tensorfuse

Deepseek-R1 is an advanced large language model designed to handle a wide range of conversational and generative tasks. It
has proven capabilities in various benchmarks and excels in complex reasoning. In this guide, we will walk you through deploying
the **Deepseek-R1 671B parameter** model on your cloud account using Tensorfuse. We will be using H100 GPUs for this example, however
it is super easy to deploy on other GPUs as well (see Tip).

<Tip>
  Although this guide focuses on deploying the 671B param model, you can easily adapt the instructions to deploy *any distilled version*
  of Deepseek-R1, including 2B, 7B, and 70B param models. See the table at the end of the guide or visit [the github repo](https://github.com/tensorfuse/tensorfuse-examples/tree/main/deepseek_r1).
</Tip>

## Why Build with Deepseek-R1

Deepseek-R1 offers:

* **High Performance on Evaluations**: Achieves strong results on industry-standard benchmarks.
* **Advanced Reasoning**: Handles multi-step logical reasoning tasks with minimal context.
* **Multilingual Support**: Pretrained on diverse linguistic data, making it adept at multilingual understanding.
* **Scalable Distilled Models**: Smaller distilled variants (2B, 7B, 32B, 70B) offer cheaper options without compromising on cost.

Below is a quick snapshot of benchmark scores for Deepseek-R1:

| **Benchmark**               | **Deepseek-R1 (671B)** | **Remarks**                          |
| --------------------------- | ---------------------- | ------------------------------------ |
| MMLU                        | 90.8%                  | Near state-of-the-art                |
| AIME 2024 (Pass\@1)         | 79.8%                  | Mathematical and reasoning abilities |
| LiveCodeBench (Pass\@1-COT) | 65.9%                  | Excels at multi-step reasoning       |

The combination of these strengths makes Deepseek-R1 an excellent choice for production-ready applications, from chatbots to enterprise-level data analytics.

***

## Prerequisites

Before you begin, ensure you have configured Tensorfuse on your AWS account. If you haven't done that yet, follow the [Getting Started](/concepts/getting_started_tensorkube) guide.

## Deploying Deepseek-R1-671B with Tensorfuse

Each Tensorkube deployment requires:

1. **Your code** (in this example, vLLM API server code is used from the Docker image).
2. **Your environment** (as a Dockerfile).
3. **A deployment configuration** (`deployment.yaml`).

We will also add **token-based authentication** to our service, compatible with OpenAI client libraries. We will store the authentication token (`VLLM_API_KEY`) as a [Tensorfuse secret](/concepts/secrets). Unlike some other models, **Deepseek-R1 671B** does not require a separate Hugging Face token, so we can skip that step.

### Step 1: Set your API authentication token

Generate a random string that will be used as your API authentication token. Store it as a secret in Tensorfuse using the command below. For the purpose of this demo, we will be using `vllm-key` as your API key.

```bash theme={null}
tensorkube secret create vllm-token VLLM_API_KEY=vllm-key --env default
```

Ensure that in production you use a randomly generated token. You can quickly generate one
using `openssl rand -base64 32` and remember to keep it safe as [Tensorfuse secrets](/concepts/secrets) are opaque.

### Step 2: Prepare the Dockerfile

We will use the official vLLM Openai image as our base image. This image comes with all the necessary
dependencies to run vLLM. The image is present on DockerHub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).

```dockerfile Dockerfile theme={null}

# Dockerfile for Deepseek-R1-671B

FROM tensorfuse/vllm-openai:v0.8.4-patched

# Enable HF Hub Transfer
ENV HF_HUB_ENABLE_HF_TRANSFER 1

# Expose port 80
EXPOSE 80

# Entrypoint with API key
ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1 \
  --dtype bfloat16 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --max-model-len 4096 \
  --port 80 \
  --cpu-offload-gb 80 \
  --gpu-memory-utilization 0.95 \
  --api-key ${VLLM_API_KEY}
# DeepSeek-R1 model configuration
# - Using deepseek-ai/DeepSeek-R1 model with bfloat16 dtype (~1400GB GPU memory required)
# - Running on 8 GPUs with tensor parallelism
# - Max 4096 tokens to avoid OOM errors
# - CPU offload of 80GB needed as 8 H100s are not sufficient
# - Using 95% GPU memory utilization
# - Server runs on port 80
# - API key from environment variable for authentication
```

We’ve configured the vLLM server with numerous CLI flags tailored to our specific use case. [A comprehensive list](https://docs.vllm.ai/en/v0.4.0.post1/serving/openai_compatible_server.html#command-line-arguments-for-the-server) of all
other vLLM flags is available for further reference, and if you have questions about selecting flags for production, the [Tensorfuse Community](https://join.slack.com/t/tensorfusecommunity/shared_invite/zt-30r6ik3dz-Rf7nS76vWKOu6DoKh5Cs5w) is an excellent place to seek guidance.

### Step 3: Deployment config

Although you can deploy tensorfuse apps [using command line](/reference/cli_reference/tensorkube_deploy), it is always recommended to have a config file so
that you can follow a [GitOps approach](https://about.gitlab.com/topics/gitops/) to deployment.

```yaml deployment.yaml theme={null}
# deployment.yaml for Deepseek-R1-671B

gpus: 8
gpu_type: h100
secret:
  - vllm-token
min_scale: 1
readiness:
  httpGet:
    path: /health
    port: 80
```

Don't forget the `readiness` endpoint in your config. Tensorfuse uses this endpoint to ensure that your service is healthy.

<Warning>
  If no `readiness` endpoint is configured, Tensorfuse tries the `/readiness` path on port 80 by default which can cause issues if your app is not listening on that path.
</Warning>

Now you can deploy your service using the following command:

```bash theme={null}
tensorkube deploy --config-file ./deployment.yaml
```

### Step 4: Accessing the deployed app

<Icon icon="rocket" /> Voila! Your **autoscaling** production LLM service is ready. Only authenticated requests will be served by your endpoint.

Once the deployment is successful, you can see the status of your app by running:

```bash theme={null}
tensorkube deployment list
```

And that's it! You have successfully deployed the **world's strongest Open Source Reasoning Model**

<Note>
  Remember to configure a TLS endpoint with a [custom domain](/concepts/custom_domains_with_tls) before going to production.
</Note>

To test it out, replace `YOUR_APP_URL` with the endpoint shown in the output of the above command and run:

```bash theme={null}
curl --request POST \
  --url YOUR_APP_URL/v1/completions \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer vllm-key' \
  --data '{
    "model": "deepseek-ai/DeepSeek-R1",
    "prompt": "Earth to Robotland. What's up?",
    "max_tokens": 200
}'
```

Because vllm is compatible with the OpenAI API, you can use[OpenAI’s client libraries](https://platform.openai.com/docs/api-reference/completions/create)
as well. Here’s a sample snippet using Python:

```python theme={null}
import openai

# Replace with your actual URL and token
base_url = "YOUR_APP_URL/v1"
api_key = "vllm-key"

openai.api_base = base_url
openai.api_key = api_key

response = openai.Completion.create(
    model="deepseek-ai/DeepSeek-R1",
    prompt="Hello, Deepseek R1! How are you today?",
    max_tokens=200
)

print(response)
```

## Deploying other versions of Deepseek-R1

Although this guide has focused on Deepseek-R1 671B, there are smaller distilled variants available. Each variant changes primarily in:
•	Model name in the `Dockerfile` (`--model` flag).
•	GPU resources in `deployment.yaml`.
•	(Optional) `--tensor-parallel-size` depending on your hardware.

Below is a table summarizing the key changes for each variant:

| **Model Variant** | **Dockerfile Model Name**                 | **GPU Type** | **Num GPUs / Tensor parallel size** |
| ----------------- | ----------------------------------------- | ------------ | ----------------------------------- |
| DeepSeek-R1 2B    | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | A10G         | 1                                   |
| DeepSeek-R1 7B    | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B   | A10G         | 1                                   |
| DeepSeek-R1 8B    | deepseek-ai/DeepSeek-R1-Distill-Llama-8B  | A10G         | 1                                   |
| DeepSeek-R1 14B   | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B  | L40S         | 1                                   |
| DeepSeek-R1 32B   | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B  | L4           | 4                                   |
| DeepSeek-R1 70B   | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | L40S         | 4                                   |
| DeepSeek-R1 671B  | deepseek-ai/DeepSeek-R1                   | H100         | 8                                   |

[Click here](https://app.tensorfuse.io/) to get started with Tensorfuse.

You can also directly use the [Deepseek-R1 GitHub repository](https://github.com/tensorfuse/tensorfuse-examples/tree/main/deepseek_r1) for more details and updates on these Dockerfiles.
