> ## Documentation Index
> Fetch the complete documentation index at: https://tensorfuse.io/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Finetune Llama 3 70B on your AWS account

> Finetune LoRA adapters for popular models using axolotl styled declarative configs

# Fine-tuning Guide for Tensorfuse

This guide explains how to fine-tune Llama models using Tensorfuse's QLoRA implementation.

## Supported Models

| Model         | GPU Requirements      |
| ------------- | --------------------- |
| Llama 3.1 70B | 4x L40S (Recommended) |
| Llama 3.1 8B  | 1-2x A10G             |

## Dataset Preparation

Tensorfuse accepts datasets in JSONL format, where each line contains a valid JSON object.

The following example shows the format for a conversational dataset using the ChatML format:

```json theme={null}
{
  "messages":
  [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is the capital of France?"
    },
    {
      "role": "assistant",
      "content": "The capital of France is Paris."
    }
  ]
}

```

## Dataset Commands

```bash theme={null}
# Create dataset
tensorkube datasets create --dataset-id my_dataset --path data.jsonl

# List datasets
tensorkube datasets list

# Delete dataset
tensorkube datasets delete --dataset-id my_dataset
```

Once you have created your dataset, you can start fine-tuning your model. But before that, you need to create an authentication token from huggingface.

## Authenticating Huggingface and W\&B

Create required secrets. Tensorkube uses Kubernetes Event Driven Autoscaling (KEDA) under the hood to scale and schedule training runs. Hence, you need to create your
secrets in the `keda` environment:

<Steps>
  <Step title="Access to Llama 3.1">
    Llama-3.3 requires a license agreement. Visit the [Llama 3.1 huggingface repo](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) to ensure that
    you have signed the agreement and have access to the model.
  </Step>

  <Step title="Set huggingface token">
    Get a `WRITE` token from your [huggingface profile](https://huggingface.co/settings/tokens) and store it as a secret in Tensorfuse using the command below.

    ```bash theme={null}
    tensorkube secret create hugging-face-secret HUGGING_FACE_HUB_TOKEN=hf_EkXXrzzZsuoZubXhDQ --env keda
    ```

    Ensure that the key for your secret is `HUGGING_FACE_HUB_TOKEN` as Tensorfuse assumes the same.

    <Note>
      If you dont wish to upload your models to your huggingface account, you can use the `READ` token.
    </Note>
  </Step>

  <Step title="Set your W&B authentication token">
    Weights and Biases (W\&B) is used for logging and monitoring training runs. You need to create a W\&B account and get an API key. Store it as a secret in Tensorfuse using the command below.

    ```bash theme={null}
    tensorkube secret create wb-secret WANDB_API_KEY=7xxxxxxxxxxx4 --env keda
    ```
  </Step>
</Steps>

## Programatic Access

Tensorfuse allows you to interact with the TensorKube cluster using the Python SDK, which provides a straightforward interface for creating fine-tuning jobs.

### Authentication

First, you need to create access keys, which are required to authenticate with the TensorKube cluster deployed in your cloud.

<Note>
  You can skip this step if you are running the training runs from local machine as your default user will have sufficient permissions.
</Note>

Run the following command:

```bash theme={null}
tensorkube train create-user --name <user-name>
```

This will create a new user and provide you with access keys.

Next, export the AWS keys as environment variables where you will be running the Python code:

```bash theme={null}
export AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY>
export AWS_SECRET_ACCESS_KEY=<AWS_SECRET>

```

The following code demonstrates how to create a fine-tuning job using the Python SDK. The create\_fine\_tuning\_job function fine-tunes a LLaMA 70B base model using L40S GPUs.

```python theme={null}
from tensorkube import create_fine_tuning_job

create_fine_tuning_job( # creates a fune tuning job
    job_name="fine-tuning-job", # Job Name. Required
    job_id="unique_id", # Unique Job ID. Required
    gpus=4, # Number of GPUs. Required
    gpu_type="l40s", # GPU Type. Required
    max_scale=1, # Maximum Scale. Required
    base_model='meta-llama/Llama-3.1-70B-Instruct', # Base Model from hugging face. Required
    dataset='dataset-id', # Dataset ID. Required
    epochs=10, # Number of epochs. Required
    secrets=["hugging-face-secret", "wb-secret"], # List of secrets
    micro_batch_size=8, # Micro Batch Size. Optional, default is 8
    lora_r=4, # Lora R. default is 4.  Optional, default is 4
    learning_rate=0.00002, # Learning Rate. Optional, default is 0.00002
    val_set_size=0.1, # Validation Set Size. Optional, default is 0.1
    wandb_entity="ORG_NAME_HERE", # W&B Organisation / account name .Default is None
    hf_org_id="ORG_ID_HERE" # Hugging Face Organisation ID. Default is None
)

```

The create\_fine\_tuning\_job function also accepts additional keyword arguments (\*\*kwargs) that align with the [Axolotl config schema](https://axolotl-ai-cloud.github.io/axolotl/docs/config.html). This means you can pass any Axolotl-supported training parameters directly into the function for fine-tuning customization. For example, if you want to set gradient\_accumulation\_steps, save\_strategy, or lr\_scheduler\_type, you can include them as additional arguments:

```python theme={null}
from tensorkube import create_fine_tuning_job
create_fine_tuning_job(
    job_name="fine-tuning-job",
    job_id="unique_id",
    gpus=4,
    gpu_type="l40s",
    max_scale=1,
    base_model='meta-llama/Llama-3.1-70B-Instruct',
    dataset='dataset-id', # Dataset ID. Required
    epochs=10, # Number of epochs. Required
    secrets=["hugging-face-secret", "wb-secret"], # List of secrets
    micro_batch_size=8, # Micro Batch Size. Optional, default is 8
    lora_r=4, # Lora R. default is 4.  Optional, default is 4
    learning_rate=0.00002, # Learning Rate. Optional, default is 0.00002
    val_set_size=0.1, # Validation Set Size. Optional, default is 0.1
    wandb_entity="ORG_NAME_HERE", # W&B Organisation / account name .Default is None
    hf_org_id="ORG_ID_HERE" # Hugging Face Organisation ID. Default is None
    gradient_accumulation_steps=2,        # From Axolotl config
    peft_use_rslora=True,                 # From Axolotl config
    save_strategy="epoch",                # Save model after each epoch
    lr_scheduler_type="linear"  # Learning rate scheduler
)
```

<Note>
  The LoRA adapter weights stored in the S3 bucket are in `float32` format. To store the adapter weights in `bfloat16` format instead, set the `store_weights_as_bf16` flag to `True`.
</Note>

To know the status of the job, you can use the `get_job_status` function. The function returns the status of the job as `QUEUED`, `PROCESSING`, `COMPLETED`, or `FAILED`.

```python theme={null}
from tensorkube import get_job_status
status = get_job_status( # gets the status of the job
  job_name="fine-tuning-job", # Job Name. Required
  job_id="unique_id" # Unique Job ID. Required
)
```

Once the job is completed, the adapter is uploaded to s3.
If you go to your s3 console you can get your adapters as follows

* find the s3 bucket with prefix `tensorkube-keda-train-bucket`. All your training lora adapters will reside here. We construct adapter id from your `job-id` and the type of gpus used for training so your adapter urls would look like this:-
  `s3://<bucket-name>/lora-adapter/<job_name>/<job_id>`

Below is an example of a training adapter url with job\_name `fine-tuning-job` and job-id `unique_id`, trained on `4`  gpu of type `l40s`

```
s3://tensorkube-keda-train-bucket-d473253e-d692-4a15/lora-adapter/fine-tuning-job/unique_id
```

### Uploading the Adapter to Huggingface

To automatically upload the adapter to Huggingface, make sure that:

* You use a `WRITE` token as the `HUGGING_FACE_HUB_TOKEN` secret.
* You are passing in the `hf_org_id` parameter in the `create_fine_tuning_job` function.

Tensorfuse automatically uploads the adapter to Huggingface once the training is completed *in addition to* uploading it to S3.
Tensorfuse creates an adapter repo that follows the `{HF_ORG_IF}/{job_name}_{job_id}` format. So for the above example the adapter
would get uploaded to `{ORG_ID_HERE}/fine-tuning-job_unique_id`. The repo would be `private` by default.

## Model Deployment

1. Clone Lorax repository:

```bash theme={null}
git clone https://github.com/tensorfuse/vllm
cd vllm/llama_70b_lora
```

2. Use the following command to deploy

<Note>
  The below deploy command deploys lorax instance in default environment. Make sure you have created the hugging-face-secret in default environment. You can create secret in default environment by adding `--env default` flag in the secret creation command.
</Note>

```bash theme={null}
tensorkube deploy --gpus 4 --gpu-type L40S --secret hugging-face-secret
```

This will deploy the base model with the vLLM server.

3. Get your deployment url using `tensorkube deployment list`.

## Inference

You can now use the deployment URL to make inference requests. Here is an example using `curl`. This will query the base model without any adapters.

```bash theme={null}
curl ${ENDPOINT}/generate -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "[INST] Your prompt here [/INST]",
    "parameters": {
      "max_new_tokens": 64
    }
  }'

```

For using the adapter, you can use the following command:

```bash theme={null}
curl --request POST -v \
  --url <endpoint>/v1/chat/completions \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "<ADAPTER_HF_ID>",
  "messages": [
    {
      "role": "user",
      "content": "hello"
    }
  ]
}'
```