Advanced Training¶
This page covers advanced training topics including resuming training, early stopping, multi-GPU training, and logging with external services.
Resume Training¶
You can resume training from a previously saved checkpoint by passing the path to the checkpoint.pth file using the resume argument. This is useful when training is interrupted or you want to continue fine-tuning an already partially trained model.
The training loop will automatically load: - Model weights - Optimizer state - Learning rate scheduler state - Training epoch number
Resume vs Pretrain Weights
- Use
resume="checkpoint.pth"to continue training with optimizer state - Use
pretrain_weights="checkpoint_best_total.pth"when initializing a model to start fresh training from those weights
Early Stopping¶
Early stopping monitors validation mAP and halts training if improvements remain below a threshold for a set number of epochs. This prevents wasted computation once the model has converged.
Basic Usage¶
Configuration Options¶
| Parameter | Default | Description |
|---|---|---|
early_stopping_patience |
10 | Number of epochs without improvement before stopping |
early_stopping_min_delta |
0.001 | Minimum mAP change to count as improvement |
early_stopping_use_ema |
False | Use EMA model's mAP for comparisons |
Advanced Example¶
model.train(
dataset_dir="path/to/dataset",
epochs=200,
early_stopping=True,
early_stopping_patience=15, # Wait 15 epochs before stopping
early_stopping_min_delta=0.005, # Require 0.5% mAP improvement
early_stopping_use_ema=True # Track EMA model performance
)
How It Works¶
- After each epoch, validation mAP is computed
- If mAP improves by at least
min_delta, the patience counter resets - If mAP doesn't improve, the patience counter increments
- When patience counter reaches
patience, training stops - The best checkpoint is already saved as
checkpoint_best_total.pth
Epoch 10: mAP = 0.450 (best: 0.450) - counter: 0
Epoch 11: mAP = 0.455 (best: 0.455) - counter: 0 (improved)
Epoch 12: mAP = 0.454 (best: 0.455) - counter: 1 (no improvement)
Epoch 13: mAP = 0.453 (best: 0.455) - counter: 2
...
Epoch 22: mAP = 0.452 (best: 0.455) - counter: 10 → STOP
Multi-GPU Training¶
You can fine-tune RF-DETR on multiple GPUs using PyTorch's Distributed Data Parallel (DDP). This splits the workload across GPUs for faster training.
Setup¶
-
Create a training script (
main.py): -
Run with
torch.distributed.launch:
Replace 8 with the number of GPUs you want to use.
Batch Size with Multiple GPUs¶
When using multiple GPUs, your effective batch size is multiplied by the number of GPUs:
Example configurations for effective batch size of 16:
| GPUs | batch_size |
grad_accum_steps |
Effective |
|---|---|---|---|
| 1 | 4 | 4 | 16 |
| 2 | 4 | 2 | 16 |
| 4 | 4 | 1 | 16 |
| 8 | 2 | 1 | 16 |
Adjust for GPU count
When switching between single and multi-GPU training, remember to adjust batch_size and grad_accum_steps to maintain the same effective batch size.
Multi-Node Training¶
For training across multiple machines, use torchrun:
torchrun \
--nproc_per_node=8 \
--nnodes=2 \
--node_rank=0 \
--master_addr="192.168.1.1" \
--master_port=1234 \
main.py
Run this command on each node, changing --node_rank accordingly.
Logging with TensorBoard¶
TensorBoard is a powerful toolkit for visualizing and tracking training metrics.
Setup¶
-
Install the required packages:
-
Enable TensorBoard logging in your training:
Viewing Logs¶
Local environment:
Then open http://localhost:6006/ in your browser.
Google Colab:
Logged Metrics¶
TensorBoard will track:
- Training loss (total and per-component)
- Validation mAP
- Learning rate schedule
- EMA model metrics (when enabled)
Logging with Weights and Biases¶
Weights and Biases (W&B) is a cloud-based platform for experiment tracking and visualization.
Setup¶
-
Install the required packages:
-
Log in to W&B:
You can retrieve your API key at wandb.ai/authorize.
-
Enable W&B logging in your training:
W&B Organization¶
| Parameter | Description |
|---|---|
project |
Groups related experiments together |
run |
Identifies individual training sessions |
If you don't specify a run name, W&B assigns a random one automatically.
Viewing Results¶
Access your runs at wandb.ai. W&B provides:
- Real-time metric visualization
- Experiment comparison
- Hyperparameter tracking
- System metrics (GPU usage, memory)
- Training config logging
Using Both TensorBoard and W&B¶
You can enable both logging systems simultaneously:
model.train(
dataset_dir="path/to/dataset",
epochs=100,
tensorboard=True,
wandb=True,
project="my-project",
run="experiment-001"
)
Memory Optimization¶
Gradient Checkpointing¶
For large models or high resolutions, enable gradient checkpointing to trade compute for memory:
model.train(
dataset_dir="path/to/dataset",
gradient_checkpointing=True,
batch_size=2, # May be able to increase with checkpointing
)
This re-computes activations during the backward pass instead of storing them, reducing memory usage by ~30-40% at the cost of ~20% slower training.
Memory-Efficient Configurations¶
| Memory Level | Configuration |
|---|---|
| Very Low (8GB) | batch_size=1, grad_accum_steps=16, gradient_checkpointing=True, resolution=560 |
| Low (12GB) | batch_size=2, grad_accum_steps=8, gradient_checkpointing=True |
| Medium (16GB) | batch_size=4, grad_accum_steps=4 |
| High (24GB) | batch_size=8, grad_accum_steps=2 |
| Very High (40GB+) | batch_size=16, grad_accum_steps=1, resolution=784 |
Training Tips¶
Learning Rate Tuning¶
- Fine-tuning from COCO weights (default): Use default learning rates (
lr=1e-4,lr_encoder=1.5e-4) - Small dataset (<1000 images): Consider lower
lr(e.g.,5e-5) to prevent overfitting - Large dataset (>10000 images): May benefit from higher
lr(e.g.,2e-4)
Epoch Count¶
| Dataset Size | Recommended Epochs |
|---|---|
| < 500 images | 100-200 |
| 500-2000 images | 50-100 |
| 2000-10000 images | 30-50 |
| > 10000 images | 20-30 |
Use early stopping to automatically determine the optimal stopping point.
Data Augmentation¶
RF-DETR applies built-in augmentations during training:
- Random resizing
- Random cropping
- Color jittering
- Horizontal flipping
These are automatically configured and don't require manual setup.
Troubleshooting¶
Out of Memory (OOM)¶
If you encounter CUDA out of memory errors:
- Reduce
batch_size - Enable
gradient_checkpointing=True - Reduce
resolution - Increase
grad_accum_stepsto maintain effective batch size
Training Too Slow¶
- Increase
batch_size(if memory allows) - Use multiple GPUs with DDP
- Ensure you're using GPU (check
device="cuda") - Consider using a smaller model (e.g.,
RFDETRSmallinstead ofRFDETRLarge)
Loss Not Decreasing¶
- Check that your dataset is correctly formatted
- Verify annotations are correct (bounding boxes in correct format)
- Try reducing the learning rate
- Check for class imbalance in your dataset