> ## Documentation Index
> Fetch the complete documentation index at: https://tensorfuse.io/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Deploying FLUX.1-dev on Serverless GPUs

> Deploy serverless GPU applications on your AWS account

FLUX.1 \[dev] is a 12 billion parameter rectified flow transformer that generates images from text. In this guide we will show you how to
deploy the **FLUX.1-dev** model on your cloud account using Tensorfuse. We will be using 1 L40S GPU for this model.

We will use nvidia triton server to serve the model. We will also add **token-based authentication** to our service. We will store the authentication token (`FLUX_API_KEY`) as a [Tensorfuse secret](/concepts/secrets).

## Prerequisites

Before you begin, ensure you have configured Tensorfuse on your AWS account. If you haven't done that yet, follow the [Getting Started](/concepts/getting_started_tensorkube) guide.

## Deploying FLUX.1-dev with Tensorfuse

Each Tensorkube deployment requires:

1. **Your environment** (as a Dockerfile).
2. **Your code** (in this example, the models directory).
3. **A deployment configuration** (`deployment.yaml`).

### Step 1: Prepare the Dockerfile

We will use the official nvidia triton server image as our base image. This image comes with all the necessary
dependencies to run the model. The image tag can be found in nvidia [container catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags)

Additional to base image, we will install couple of python packages, set additonal env and copy the models directory into docker image.

```dockerfile Dockerfile theme={null}
# Use NVIDIA Triton Inference Server as base image
FROM nvcr.io/nvidia/tritonserver:25.01-pyt-python-py3

RUN pip install --no-cache-dir \
    torch \
    diffusers \
    transformers \
    accelerate \
    safetensors \
    Pillow \
    hf_transfer \
    protobuf \
    bitsandbytes \
    sentencepiece \
    numpy


RUN mkdir -p /models/flux/1

COPY models/flux/1/model.py /models/flux/1
COPY models/flux/config.pbtxt /models/flux/config.pbtxt


# Set environment variables
ENV HF_HUB_ENABLE_HF_TRANSFER=1

# Expose Triton gRPC and HTTP ports
EXPOSE 8000
EXPOSE 8001
EXPOSE 8002

# Start Triton Server
CMD ["tritonserver", "--model-repository=/models", "--allow-gpu-metrics=false", "--allow-metrics=false", "--metrics-port=0", "--http-restricted-api=inference:API_KEY=r2JmQNuD" ] 
```

We’ve configured the triton server with couple of CLI flags tailored to our specific use case. We have disable metrics and have added authentication key for inference requets. For more details on authentication, refer to [triton docs](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/inference_protocols.md#limit-endpoint-access-beta)
.If you have questions about selecting flags for production, reach out to the [Tensorfuse Community](https://join.slack.com/t/tensorfusecommunity/shared_invite/zt-30r6ik3dz-Rf7nS76vWKOu6DoKh5Cs5w)

### Step 2: Prepare the models directory

We will use python backend for tritonserver to serve the model. We will create a models directory and add the model.py and config.pbtxt file in it. For more details about triton python backend refer to [triton docs](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/python_backend/README.html#)

```sh theme={null}
mkdir -p models/flux/1
```

```python models/flux/1/model.py theme={null}
import triton_python_backend_utils as pb_utils
import numpy as np
import torch
from diffusers import AutoPipelineForText2Image, FluxPipeline
from io import BytesIO

class TritonPythonModel:
    def initialize(self, args):
        """Load the Stable Diffusion model"""
        self.logger = pb_utils.Logger
        self.model_id = "black-forest-labs/FLUX.1-dev"

        try:
            # Load pipeline with fp16 optimization
            self.pipeline = FluxPipeline.from_pretrained(
                self.model_id,
                torch_dtype=torch.bfloat16,
            ).to("cuda")
            self.logger.log_info("Successfully loaded FLUX.1-dev model")

        except Exception as e:
            self.logger.log_error(f"Error initializing model: {str(e)}")
            raise

    def execute(self, requests):
        """Process requests and generate images"""
        responses = []

        for request in requests:
            try:
                # Get input prompt
                prompt = pb_utils.get_input_tensor_by_name(request, "PROMPT")
                prompt_str = prompt.as_numpy()[0].decode()

                
                # Generate image
                image = self.pipeline(
                     prompt=prompt_str,
                     num_inference_steps=25,
                     guidance_scale=7.5,
                     height=512,
                     width=512
                ).images[0]

                # Convert image to byte array
                img_byte_arr = BytesIO()
                image.save(img_byte_arr, format="PNG")
                img_np = np.frombuffer(img_byte_arr.getvalue(), dtype=np.uint8)

                # Create output tensor
                output_tensor = pb_utils.Tensor(
                    "GENERATED_IMAGE",
                    img_np
                )

                responses.append(pb_utils.InferenceResponse([output_tensor]))
                self.logger.log_info("Successfully generated image")

            except Exception as e:
                self.logger.log_error(f"Error processing request: {str(e)}")
                responses.append(pb_utils.InferenceResponse(error=str(e)))

        return responses

    def finalize(self):
        """Cleanup resources"""
        self.pipeline = None
        torch.cuda.empty_cache()
```

```sh models/flux/config.pbtxt theme={null}
name: "flux"
backend: "python"
max_batch_size: 0

input [
  {
    name: "PROMPT"
    data_type: TYPE_STRING
    dims: [1]
  }
]

output [
  {
    name: "GENERATED_IMAGE"
    data_type: TYPE_UINT8
    dims: [-1]
  }
]
```

### Step 3: Create Secrets

We will create a secret to store the authentication token. We will use this token to authenticate the inference requests.

```sh theme={null}
tensorkube secret create flux-secret FLUX_API_KEY=r2JmQNuD # this token should be same as the one used in dockerfile
```

we also need to create a hugging face secret to download model from huggingface hub

```sh theme={null}
tensorkube secret create hugging-face-secret HUGGING_FACE_HUB_TOKEN=your_token
```

### Step 4: Deployment config

Although you can deploy tensorfuse apps [using command line](/reference/cli_reference/tensorkube_deploy), it is always recommended to have a config file so
that you can follow a [GitOps approach](https://about.gitlab.com/topics/gitops/) to deployment.

```yaml deployment.yaml theme={null}
# deployment.yaml for FLUX.1-dev
gpus: 1 # Number of GPUs
gpu_type: l40s # GPU Type
port: 8000 # Port to expose the service
min_scale: 0
max_scale: 1
secret:
  - hugging-face-secret
  - flux-secret
readiness:
  httpGet:
    path: /v2/health/ready # readiness endpoint for triton server
    port: 8000
```

Now you can deploy your service using the following command:

```sh theme={null}
tensorkube deploy --config deployment.yaml
```

### Step 4: Accessing the deployed app

<Icon icon="rocket" /> Voila! Your **autoscaling** production text to image service using flux.1-dev is ready.

Once the deployment is successful, you can see the status of your app by running:

```bash theme={null}
tensorkube deployment list
```

And that's it! You have successfully deployed the **flux.1-dev** model.

<Note>
  Remember to configure a TLS endpoint with a [custom domain](/concepts/custom_domains_with_tls) before going to production.
</Note>

To test it out, we have a sample client.py python file. Add your deployment url `DEPLOYMENT_URL` in the code and set the `FLUX_API_KEY` as environment variable before running the client.py file.

```python client.py theme={null}
import requests
import json
from io import BytesIO
from PIL import Image
import numpy as np
import os
deployment_url = "<DEPLOYMENT_URL>" # replace with your deployment url, remove trailing slash
api_key = os.getenv("FLUX_API_KEY")
inference_endpoint = f"{deployment_url}/v2/models/flux/versions/1/infer"

request_data = {
    "inputs": [
      {
        "name": "PROMPT",
        "shape": [1],
        "datatype": "BYTES",
        "data": ["Generate a golden retriever with a sunset background"]
      }
    ]
}

headers = {"Content-Type": "application/json", "API_KEY": api_key}

# Send POST request
response = requests.post(inference_endpoint, headers=headers, json=request_data)
if response.status_code != 200:
    print(f"Failed to send request to {inference_endpoint}")
    print(f"Response: {response.text}")
    exit()
response_data = response.json()
image_data = response_data["outputs"][0]["data"]
img_np = np.array(image_data, dtype=np.uint8)
byte_data = img_np.tobytes()
# Wrap the bytes in a BytesIO stream
byte_io = BytesIO(byte_data)

# Save the generated image
generated_image = Image.open(byte_io)
generated_image.save("generated_image.png")
```

Dont forget to install the required python packages before running the client.py file

```requirements.txt theme={null}
pillow
numpy
requests
```

```sh theme={null}
pip install -r requirements.txt
```

```sh theme={null}
python client.py
```

Once you run the client.py file, you will see a generated\_image.png file in your directory. Thats it, you have successfully generated an image using flux.1-dev model.

To get started with Tensorfuse,[Click here](https://app.tensorfuse.io/)

You can also directly use the [Tensorfuse GitHub repository](https://github.com/tensorfuse/tensorfuse-examples/tree/main/text-to-image/flux-1-dev) for more details and updates on these Dockerfiles.
