Using a supported or custom learning rate scheduler
The learning rate and the loss function both affect the training of machine learning models.
-
The learning rate directly affects how the model’s parameters are updated during the optimization process
-
Parameter updates impact the value of the loss function.
For example, during each iteration of the training process, the model computes the gradient of the loss function with respect to its parameters. The gradient indicates the direction in which the parameters should be adjusted to minimize the loss. The learning rate determines the step size taken in the direction of the gradient.
SambaFlow supports most of the popular learning rate schedulers out of the box. This document discusses the supported schedulers, which work for most use cases, and briefly explores using a custom scheduler.
Supported schedulers
By default, SambaFlow supports the following schedulers:
-
cosine_schedule_with_warmup
— Adjusts the learning rate according to the--warmup_steps
and--max_steps
arguments following a plot. -
polynomial_decay_schedule_with_warmup
— Looks similar to the cosine decay, but uses a linear decay instead of cosine decay (the “polynomial” part of the name is misleading; decay is always linear). -
fixed_lr
— Fixed LR means there is no learning rate schedule at all, and the learning rate remains the what’s specified in--learning_rate
throughout training.
Parameters that affect learning rate and schedulers
Here’s a list of the command line arguments which control learning rate schedule.
Not all models support all arguments. Run python <model> run --help to see which arguments your model supports.
|
-
--learning_rate
. Peak learning rate when employing a schedule with warmup, or the flat learning rate when usingfixed_lr
. -
--lr_schedule
. One ofcosine_schedule_with_warmup
,polynomial_decay_schedule_with_warmup
, orfixed_lr
-
--warmup_steps
. If you are using one of the schedulers with warmup, number of steps over which to linearly increase the learning rate up to the value specified by--learning_rate
. Set this to 0 to specify no warmup at all. -
--max_steps
. the number of steps over which to decay the learning rate. This will also be the number of training steps that the model will finish training at. To set your LR scheduler to decay over a different number of steps than you actually run training for, specify both--max_steps
and--steps_this_run
. Use--max_steps
to control the learning rate and--steps_this_run
to control the number of actual training steps taken. -
--end_lr_ratio
. the fraction of the peak LR to end decay at. The default ratio is different forcosine_schedule_with_warmup
which ends at 0.1x the peak rate, andpolynomial_decay_schedule_with_warmup
, which ends at 0.0
Using a custom scheduler
If you specifically want to experiment with the LR schedule as an independent variable, it likely makes sense to edit learning_rate_schedule.py
and then import their own learning rate. For example for our Transformers model, you’d import your learning rate into gpt2_task.py
.
Learn more
In our model conversion documentation, we explain how to use a custom loss function here