> ## Documentation Index
> Fetch the complete documentation index at: https://tensorfuse.io/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Deploy GGUF quants of Deepseek R1 671B on Serverless GPUs

> Deploy GGUF quants of Deepseek R1 671B param model using Tensorfuse

Deepseek-R1 is an advanced large language model designed to handle a wide range of conversational and generative tasks. It
has proven capabilities in various benchmarks and excels in complex reasoning. Deepseek R1 was trained in fp8 precision and has
671B parameters. In fp8 precision each parameter is quantized to 8 bits bringing the memory required to run the DeepSeek to \~700GB.

700GB worth of GPU RAM can only be achieved on a single node of 8xH200 GPUs. Even 8xH100 GPU nodes have a combined GPU memory of
640 GB. Hence, GPU memory becomes a bottleneck for deploying the entire model.

Our friends at [Unsloth AI](https://unsloth.ai/) released GGUF quants of the DeepSeek Model which require only 140 GB of GPU memory. Their quants
have an average quantisation of 1.58 bit (some parameters are 4-bit, some are 8 bit , some are 1 bit). This allows us to deploy our model on variety of GPU
combinations that can fit on a single node. In this guide we will walk you through deploying GGUF quants of DeepSeek R1 671B on your cloud account using Tensorfuse.

We will deploy the 140 GB model on a combination of 4xL40S GPUs. This combination has a GPU memory of 192 GB giving us sufficient space to run
the model, and also have some headroom for KV cache and long enough context length.

<Tip>
  You can deploy other GGUF quant models by modifying the `entrypoint.sh` script below. You can also tinker with the number of GPUs and GPU type to deploy on other GPU combinations.
  If you have more GPU memory than 192 GB, I would also recommend playing around with the `--ctx-size` parameter.
</Tip>

## Prerequisites

Before you begin, ensure you have configured Tensorfuse on your AWS account. If you haven't done that yet, follow the [Getting Started](/concepts/getting_started_tensorkube) guide.

## Deploying Deepseek-R1-671B with Tensorfuse

Each Tensorkube deployment requires:

1. **Your environment** (as a Dockerfile).
2. **Your code** (in this example, the entrypoint.sh script).
3. **A deployment configuration** (`deployment.yaml`).

### Step 1: Prepare the Dockerfile

We will use the official llama.cpp image as our base image. This image comes with all the necessary
dependencies to run llama.cpp. The image is present on the Github Registry as [gerganov/llama.cpp](https://github.com/ggerganov/llama.cpp/pkgs/container/llama.cpp).

We will then set the environment variables required to run the model and install the necessary huggingface dependencies to download the model.

We will then copy our code and set the permissions for the entrypoint script.

```dockerfile Dockerfile theme={null}
FROM ghcr.io/ggerganov/llama.cpp:full-cuda

# Set environment variables
ENV CUDA_VISIBLE_DEVICES=0,1,2,3
ENV GGML_CUDA_MAX_STREAMS=16
ENV GGML_CUDA_MMQ_Y=1
ENV HF_HUB_ENABLE_HF_TRANSFER=1
WORKDIR /app

# Install dependencies
RUN apt-get update && \
    apt-get install -y python3-pip && \
    pip3 install huggingface_hub hf-transfer

# Copy and set permissions
COPY entrypoint.sh .
RUN chmod +x /app/entrypoint.sh

EXPOSE 8080

ENTRYPOINT ["/app/entrypoint.sh"]
```

We’ve configured the vLLM server with numerous CLI flags tailored to our specific use case. [A comprehensive list](https://docs.vllm.ai/en/v0.4.0.post1/serving/openai_compatible_server.html#command-line-arguments-for-the-server) of all
other vLLM flags is available for further reference, and if you have questions about selecting flags for production, the [Tensorfuse Community](https://join.slack.com/t/tensorfusecommunity/shared_invite/zt-30r6ik3dz-Rf7nS76vWKOu6DoKh5Cs5w) is an excellent place to seek guidance.

### Step 2: Prepare the entrypoint script

In this step, we first download the GGUF model using `snapshot_download` from huggingface\_hub. We then start the llama server with the necessary flags to run the model.

<Tip>
  You can deploy other GGUF quant models by modifying the `entrypoint.sh` script below. You will have to change the `repo_id` and `local_dir` flag in the
  `snapshot_download` parameters and change the `--model` flag in the llama-server command.
</Tip>

```bash entrypoint.sh theme={null}
#!/bin/bash
set -e

# Download model shards if missing
if [ ! -d "/app/DeepSeek-R1-GGUF" ]; then
  echo "Downloading model..."
  python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
  repo_id='unsloth/DeepSeek-R1-GGUF',
  local_dir='DeepSeek-R1-GGUF',
  allow_patterns=['*UD-IQ1_S*']
)"
fi

echo "Downloading model finished. Now waiting to start the llama server with optimisations for one batch latency"

# Start server with single-request optimizations
./llama-server \
  --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf\
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 62 \
  --parallel 4 \
  --ctx-size 5120 \
  --mlock \
  --threads 42 \
  --tensor-split 1,1,1,1 \
  --no-mmap \
  --rope-freq-base 1000000 \
  --rope-freq-scale 0.25 \
  --metrics
```

We’ve configured the vLLM server with numerous CLI flags tailored to our specific use case. [A comprehensive list](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md) of all
other `llama-server` flags is available for further reference, and if you have questions about selecting flags for production, the [Tensorfuse Community](https://join.slack.com/t/tensorfusecommunity/shared_invite/zt-30r6ik3dz-Rf7nS76vWKOu6DoKh5Cs5w) is an excellent place to seek guidance.

### Step 3: Deployment config

Although you can deploy tensorfuse apps [using command line](/reference/cli_reference/tensorkube_deploy), it is always recommended to have a config file so
that you can follow a [GitOps approach](https://about.gitlab.com/topics/gitops/) to deployment.

```yaml deployment.yaml theme={null}
# deployment.yaml for Deepseek R1 GGUf quants

gpus: 4
gpu_type: l40s
port: 8080
readiness:
    httpGet:
        path: /health
        port: 8080
```

Don't forget the `readiness` endpoint in your config. Tensorfuse uses this endpoint to ensure that your service is healthy.

`llama-server` exposes readiness by default on the `/health` endpoint. Remember that we have set port to `8080` in deployment.yaml as `llama-server`
runs on that port and we have exposed `8080` in the Dockerfile.

<Warning>
  If no `readiness` endpoint is configured, Tensorfuse tries the `/readiness` path on port 80 by default which can cause issues if your app is not listening on that path.
</Warning>

Now you can deploy your service using the following command:

```bash theme={null}
tensorkube deploy --config-file ./deployment.yaml
```

### Step 4: Accessing the deployed app

<Icon icon="rocket" /> Voila! Your **autoscaling** production llama.cpp service is ready.

Once the deployment is successful, you can see the status of your app by running:

```bash theme={null}
tensorkube deployment list
```

And that's it! You have successfully deployed the **world's strongest Open Source Reasoning Model** in a quantised format.

<Note>
  Remember to configure a TLS endpoint with a [custom domain](/concepts/custom_domains_with_tls) before going to production.
</Note>

To test it out, replace `YOUR_APP_URL` with the endpoint shown in the output of the above command and run:

```bash theme={null}
curl --request POST \
  --url YOUR_APP_URL/v1/completions \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "deepseek-ai/DeepSeek-R1",
    "prompt": "Earth to Robotland. What's up?",
    "max_tokens": 200
}'
```

Because `llama-server` is compatible with the OpenAI API, you can use [OpenAI’s client libraries](https://platform.openai.com/docs/api-reference/completions/create)
as well. Here’s a sample snippet using Python:

```python theme={null}
import openai

# Replace with your actual URL and token
base_url = "YOUR_APP_URL/v1"

openai.api_base = base_url

response = openai.Completion.create(
    model="deepseek-ai/DeepSeek-R1",
    prompt="Hello, Deepseek R1! How are you today?",
    max_tokens=200
)

print(response)
```

[Click here](https://app.tensorfuse.io/) to get started with Tensorfuse.

You can also directly use the [Deepseek-R1 GitHub repository](https://github.com/tensorfuse/tensorfuse-examples/tree/main/deepseek_r1) for more details and updates on these Dockerfiles.
