# API Overview
Towhee Trainer is designed for training or fine-tune models or [towhee operators](https://towhee.io/tasks/operator). We use a dummy FakeData to take a quick look at Trainer's API and see how to use it.

## Train a pytorch model


```python
import torch
import torchvision.models as models
from towhee.trainer.trainer import Trainer
from towhee.trainer.training_config import TrainingConfig
from torchvision import datasets, transforms
import warnings
warnings.filterwarnings("ignore")

# initialize model
model = models.resnet18()

# define dataset and config
fake_transform = transforms.Compose([transforms.ToTensor()])
train_data = datasets.FakeData(size=2, transform=fake_transform)
eval_data = datasets.FakeData(size=1, transform=fake_transform)

training_config = TrainingConfig(
    output_dir="./train_dummy_torch",
    epoch_num=2,
    batch_size=1,
    print_steps=1,
    tensorboard={
        'log_dir': 'mylogdir'
    }
)

# initialize Trainer
trainer = Trainer(model, training_config, train_dataset=train_data, eval_dataset=eval_data)

# start training
trainer.train()
```

    2023-02-17 11:08:13,383 - 140662514063168 - trainer.py-trainer:274 - WARNING: TrainingConfig(output_dir='./train_dummy_torch', overwrite_output_dir=True, eval_strategy='epoch', eval_steps=None, batch_size=1, val_batch_size=-1, seed=42, epoch_num=2, dataloader_pin_memory=True, dataloader_drop_last=True, dataloader_num_workers=0, lr=5e-05, metric='Accuracy', print_steps=1, load_best_model_at_end=False, early_stopping={'monitor': 'eval_epoch_metric', 'patience': 4, 'mode': 'max'}, model_checkpoint={'every_n_epoch': 1}, tensorboard={'log_dir': 'mylogdir'}, loss='CrossEntropyLoss', optimizer='Adam', lr_scheduler_type='linear', warmup_ratio=0.0, warmup_steps=0, device_str=None, freeze_bn=False)


    epoch=1/2, global_step=1, epoch_loss=6.658931255340576, epoch_metric=0.0
    epoch=1/2, global_step=2, epoch_loss=6.841584205627441, epoch_metric=0.0
    epoch=1/2, eval_global_step=0, eval_epoch_loss=6.466732025146484, eval_epoch_metric=0.0
    epoch=2/2, global_step=3, epoch_loss=5.717026710510254, epoch_metric=0.0
    epoch=2/2, global_step=4, epoch_loss=6.077897071838379, epoch_metric=0.0
    epoch=2/2, eval_global_step=1, eval_epoch_loss=5.8604044914245605, eval_epoch_metric=0.0


## Train a towhee operator


```python
import towhee
warnings.filterwarnings("ignore") 

# initialize towhee operator
op = towhee.ops.image_embedding.timm(model_name='resnet18', num_classes=10).get_op()

# define dataset and config
fake_transform = transforms.Compose([transforms.ToTensor()])
train_data = datasets.FakeData(size=16, transform=fake_transform)
eval_data = datasets.FakeData(size=8, transform=fake_transform)

training_config = TrainingConfig(
    batch_size=8,
    epoch_num=2,
    output_dir='./train_dummy_operator',
    print_steps=1
)

# start training
op.train(
    training_config,
    train_dataset=train_data,
    eval_dataset=eval_data
)
```

    2023-02-17 11:08:15,096 - 140662514063168 - trainer.py-trainer:274 - WARNING: TrainingConfig(output_dir='./train_dummy_operator', overwrite_output_dir=True, eval_strategy='epoch', eval_steps=None, batch_size=8, val_batch_size=-1, seed=42, epoch_num=2, dataloader_pin_memory=True, dataloader_drop_last=True, dataloader_num_workers=0, lr=5e-05, metric='Accuracy', print_steps=1, load_best_model_at_end=False, early_stopping={'monitor': 'eval_epoch_metric', 'patience': 4, 'mode': 'max'}, model_checkpoint={'every_n_epoch': 1}, tensorboard={'log_dir': None, 'comment': ''}, loss='CrossEntropyLoss', optimizer='Adam', lr_scheduler_type='linear', warmup_ratio=0.0, warmup_steps=0, device_str=None, freeze_bn=False)


    epoch=1/2, global_step=1, epoch_loss=2.447136402130127, epoch_metric=0.125
    epoch=1/2, global_step=2, epoch_loss=2.6387686729431152, epoch_metric=0.0625
    epoch=1/2, eval_global_step=0, eval_epoch_loss=2.1911351680755615, eval_epoch_metric=0.25
    epoch=2/2, global_step=3, epoch_loss=1.5445729494094849, epoch_metric=0.0
    epoch=2/2, global_step=4, epoch_loss=1.4133632183074951, epoch_metric=0.125
    epoch=2/2, eval_global_step=1, eval_epoch_loss=1.175207495689392, eval_epoch_metric=0.25


In this script, we did not find the use of Trainer, because the Trainer is used for training in the `train()` method of the timm Operator class. We only need to pass in the parameters of the `train()` interface, which is very convenient.  

Towhee provides a variety of operators. In the `train()` method of the operator class, we can use the Trainer provided by towhee for training, or use other frameworks for training, or use pytorch's native training scripts. In this way, the user can directly call the `train()` interface of the operator without writing the training script by hand.

![](class_operator.png)

# Towhee Trainer framework
Before training, Towhee Trainer needs to pass in Training config, dataset, and various behavior settings during training. Various indicators can be monitored during the training process. After training, you can save the trained model and use some visualization tools for analysis.

![](towhee_trainer_framework.png)

# Training Config

Training config is mainly used to configure various settings in the training process. include:
- Common training hyperparameters such as batch size, epoch, output directory, etc.
- Training device.
- Log mode.
- Some parameters in the learning and optimization process, such as learning rate, optimizer, etc.
- Various callbacks
- Metrics

Generally, only need
```
training_configs = TrainingConfig(
     xxx='some_value_xxx',
     yyy='some_value_yyy'
)
```
and you can instantiate a config and pass it to Trainer().  
You can set up training configs directly in python scripts or with a yaml file.   

For specific parameters, please refer to the [training config guide](https://github.com/towhee-io/examples/blob/main/fine_tune/4_training_configs.ipynb).

# Select training device

In TrainingConfig, there is a parameter `device_str`, which is used to specify the training device. You can have the following options.
- None -> If there is a cuda env in the machine, it will use cuda:0, else cpu.
- "cpu" -> Use cpu only.
- "cuda:2" -> Use the No.2 gpu, the same for other numbers.
- "cuda" -> Use all available gpus, using data parallel. If you want to use several specified gpus to run, you can specify the environment variable `CUDA_VISIBLE_DEVICES` as the number of gpus you need before running your training script.

# Save and Load
For a trainer instance, if you want to resume training, you only need to pass in the `resume_checkpoint_path` parameter in the `train()` interface.
To save the model, use the `save()` method, and to load it, use the `load()` method. Normally, after the end of `train()`, the parameters of the model and training states will be automatically saved to the `output_dir` configured in the training config.


```python
trainer.train(resume_checkpoint_path="./train_dummy_torch/epoch_1")
trainer.save(path="./another_save_path")
trainer.load(path="./another_save_path")
print(trainer.epoch)
```

    2023-02-17 11:08:16,415 - 140662514063168 - trainer.py-trainer:274 - WARNING: TrainingConfig(output_dir='./train_dummy_torch', overwrite_output_dir=True, eval_strategy='epoch', eval_steps=None, batch_size=1, val_batch_size=-1, seed=42, epoch_num=2, dataloader_pin_memory=True, dataloader_drop_last=True, dataloader_num_workers=0, lr=5e-05, metric='Accuracy', print_steps=1, load_best_model_at_end=False, early_stopping={'monitor': 'eval_epoch_metric', 'patience': 4, 'mode': 'max'}, model_checkpoint={'every_n_epoch': 1}, tensorboard={'log_dir': 'mylogdir'}, loss='CrossEntropyLoss', optimizer='Adam', lr_scheduler_type='linear', warmup_ratio=0.0, warmup_steps=0, device_str=None, freeze_bn=False)


    epoch=2/2, global_step=1, epoch_loss=5.717026710510254, epoch_metric=0.0
    epoch=2/2, global_step=2, epoch_loss=6.157863140106201, epoch_metric=0.0
    epoch=2/2, eval_global_step=0, eval_epoch_loss=5.87011194229126, eval_epoch_metric=0.0
    2


If you need to freeze some layers of the model before resume training, you can use LayerFreezer to freeze the model layers. You can use the `by_idx()` or `by_names()` method.


```python
from towhee.trainer.utils.layer_freezer import LayerFreezer
from towhee.models import vit
my_model = vit.create_model()
my_freezer = LayerFreezer(my_model)
my_freezer.show_frozen_layers()
```


    []


```python
my_freezer.by_names(['head'])
my_freezer.show_frozen_layers()
```


    ['head']


# Monitor
## Step print or Progressbar
If `print_steps` is not `None`, it means every n step to print loss and metric on the screen, otherwise there will be a corresponding progress bar displayed on the screen in each epoch, instead of printing a line of information.


## Using tensorboard
Tensorboard is a commonly used tool for monitoring indicators in the training process. You must first ensure that the tensorboard is installed in your environment, otherwise this feature can not be used.  
You can specify the `log_dir` of the tensorboard in config, so that you can record your training in the specified directory.


```python
training_config.tensorboard={'log_dir': 'your_log_dir'}
```

You can just run this command and you can open your browser and go to http://localhost:6006/ to get the tensorboard page.
```
tensorboard --logdir your_log_dir
```

![](tensorboard.png)

If your want to close the tensorboard monitor, just specify `training_config.tensorboard=None`.

# Using callbacks
You can set custom callbacks during the training process to realize various controls over the training process.
You need to inherit the Callback class, and override and implement a custom corresponding control method. There are many methods that can be overridden: `on_batch_begin()`, `on_batch_end()`, `on_epoch_begin()`, `on_epoch_end()`, `on_train_begin()`, `on_train_end()`, `on_train_batch_begin()`, `on_train_batch_end()`, `on_eval_batch_begin()`, `on_eval_batch_end`, `on_eval_begin`, `on_eval_end`.


```python
from towhee.trainer.callback import Callback

class CustomCallback(Callback):
    def on_batch_begin(self, batch, logs) -> None:
        print('on_batch_begin')
        
trainer.add_callback(CustomCallback())
trainer.train(resume_checkpoint_path="./train_dummy_torch/epoch_1")
```

    2023-02-17 11:08:18,367 - 140662514063168 - trainer.py-trainer:274 - WARNING: TrainingConfig(output_dir='./train_dummy_torch', overwrite_output_dir=True, eval_strategy='epoch', eval_steps=None, batch_size=1, val_batch_size=-1, seed=42, epoch_num=2, dataloader_pin_memory=True, dataloader_drop_last=True, dataloader_num_workers=0, lr=5e-05, metric='Accuracy', print_steps=1, load_best_model_at_end=False, early_stopping={'monitor': 'eval_epoch_metric', 'patience': 4, 'mode': 'max'}, model_checkpoint={'every_n_epoch': 1}, tensorboard={'log_dir': 'mylogdir'}, loss='CrossEntropyLoss', optimizer='Adam', lr_scheduler_type='linear', warmup_ratio=0.0, warmup_steps=0, device_str=None, freeze_bn=False)


    on_batch_begin
    epoch=2/2, global_step=1, epoch_loss=5.717026710510254, epoch_metric=0.0
    on_batch_begin
    epoch=2/2, global_step=2, epoch_loss=6.15791130065918, epoch_metric=0.0
    on_batch_begin
    epoch=2/2, eval_global_step=0, eval_epoch_loss=5.870237827301025, eval_epoch_metric=0.0


There are some built-in callbacks, such as EarlyStoppingCallback, ModelCheckpointCallback. We can set some of their parameters in training config to achieve the corresponding control.

# Set metric
During training, we can also perform evaluation, so we need to use a specific metric. We use [torchmetric](https://torchmetrics.readthedocs.io/en/stable/) 0.7.0 as our metric calculation implementation. You can set the metric you specify in the training config, and the available metric can be obtained through method `TMMetrics.get_tm_avaliable_metrics()`.


```python
training_config.metric = 'Accuracy'
```


```python
from towhee.trainer.metrics import TMMetrics
TMMetrics.get_tm_avaliable_metrics()
```


    ['CatMetric',
     'MaxMetric',
     'MeanMetric',
     'MinMetric',
     'SumMetric',
     'PIT',
     'SDR',
     'SI_SDR',
     'SI_SNR',
     'SNR',
     'PermutationInvariantTraining',
     'ScaleInvariantSignalDistortionRatio',
     'ScaleInvariantSignalNoiseRatio',
     'SignalDistortionRatio',
     'SignalNoiseRatio',
     'AUC',
     'AUROC',
     'F1',
     'ROC',
     'Accuracy',
     'AveragePrecision',
     'BinnedAveragePrecision',
     'BinnedPrecisionRecallCurve',
     'BinnedRecallAtFixedPrecision',
     'CalibrationError',
     'CohenKappa',
     'ConfusionMatrix',
     'F1Score',
     'FBeta',
     'FBetaScore',
     'HammingDistance',
     'Hinge',
     'HingeLoss',
     'IoU',
     'JaccardIndex',
     'KLDivergence',
     'MatthewsCorrcoef',
     'MatthewsCorrCoef',
     'Precision',
     'PrecisionRecallCurve',
     'Recall',
     'Specificity',
     'PSNR',
     'SSIM',
     'MultiScaleStructuralSimilarityIndexMeasure',
     'PeakSignalNoiseRatio',
     'StructuralSimilarityIndexMeasure',
     'CosineSimilarity',
     'ExplainedVariance',
     'MeanAbsoluteError',
     'MeanAbsolutePercentageError',
     'MeanSquaredError',
     'MeanSquaredLogError',
     'PearsonCorrcoef',
     'PearsonCorrCoef',
     'R2Score',
     'SpearmanCorrcoef',
     'SpearmanCorrCoef',
     'SymmetricMeanAbsolutePercentageError',
     'TweedieDevianceScore',
     'RetrievalFallOut',
     'RetrievalHitRate',
     'RetrievalMAP',
     'RetrievalMRR',
     'RetrievalNormalizedDCG',
     'RetrievalPrecision',
     'RetrievalRecall',
     'RetrievalRPrecision',
     'WER',
     'BLEUScore',
     'CharErrorRate',
     'CHRFScore',
     'ExtendedEditDistance',
     'MatchErrorRate',
     'SacreBLEUScore',
     'SQuAD',
     'TranslationEditRate',
     'WordErrorRate',
     'WordInfoLost',
     'WordInfoPreserved',
     'MinMaxMetric',
     'MeanAveragePrecision']


# Custom optimizer and loss
In most cases, the corresponding optimizer can be used by specifying the name of the optimizer class in config. But if we implement a custom optimizer, we can use the `set_optimizer()` method to set it. Loss is the same.


```python
from torch import optim

class MyOptimizer(optim.Optimizer):
    def step(self, closure):
        print('my step...')
        
my_optimizer = MyOptimizer(model.parameters(), defaults={})
trainer.set_optimizer(my_optimizer)
type(trainer.optimizer)
```


    __main__.MyOptimizer


```python
class MyTripletLossFunc(torch.nn.Module):
    def forward(self):
        print('forward...')
        return 0

my_loss = MyTripletLossFunc()
trainer.set_loss(my_loss)
type(trainer.loss)
```


    __main__.MyTripletLossFunc


# Custom training step
In some special cases, you may want to customize your own training step, or you have a custom loss calculation process. You can implement them by subclassing the Trainer class and overriding the `train_step()` or `compute_loss()` methods.

When overriding, pay attention to the input and output format of the method to follow the parent class.     

`train_step()` receives `model` and `inputs` as input and returns the `step_logs` dict. The items in `step_logs` dict  are the monitored values printed on the screen, or displayed in the progress bar.


```python
class SubTrainer1(Trainer):
    def train_step(self, model, inputs) -> dict:
        # self.optimizer.zero_grad()
        # y = model(inputs)
        # loss = compute_custom_loss(y)
        # loss.backward()
        # self.optimizer.step()
        # self.lr_scheduler.step()
        step_logs = {"step_loss": ..., "epoch_loss": ..., "epoch_metric": ...}
        return step_logs
    
# train1 = SubTrainer1(...)
# train1.train()
```

`compute_loss()` receives `model` and `inputs` as input and returns the loss which is pytorch tensor with `grad_fn`.


```python
class SubTrainer2(Trainer):
    def compute_loss(self, model, inputs):
        input1, input2, input3 = inputs
        outputs = model(input1, input2)
        loss = ... # loss_function(outputs, input3)
        return loss

# train2 = SubTrainer2(...)
# train2.train()
```

This kind of overwriting after inheritance can be used in many methods of Trainer, such as `train_step()`, `compute_loss()`, `evaluate_step()` , `compute_metric()`, `train()` or `evaluate()`. [Here](https://towhee.io/audio-embedding/nnfp/src/branch/main/train_nnfp.py) is an example which overrides the `train_step()` method.

# More examples
For more practical examples, you can refer to [fine-tune examples](https://github.com/towhee-io/examples/tree/main/fine_tune). They include the actual use of the trainer, and the introduction of fine-tuning of some important operators.