Using a supported or custom learning rate scheduler

The learning rate and the loss function both affect the training of machine learning models.

  • The learning rate directly affects how the model’s parameters are updated during the optimization process

  • Parameter updates impact the value of the loss function.

For example, during each iteration of the training process, the model computes the gradient of the loss function with respect to its parameters. The gradient indicates the direction in which the parameters should be adjusted to minimize the loss. The learning rate determines the step size taken in the direction of the gradient.

SambaFlow supports most of the popular learning rate schedulers out of the box. This document discusses the supported schedulers, which work for most use cases, and briefly explores using a custom scheduler.

Supported schedulers

By default, SambaFlow supports the following schedulers:

  • cosine_schedule_with_warmup — Adjusts the learning rate according to the --warmup_steps and --max_steps arguments following a plot.

  • polynomial_decay_schedule_with_warmup — Looks similar to the cosine decay, but uses a linear decay instead of cosine decay (the “polynomial” part of the name is misleading; decay is always linear).

  • fixed_lr — Fixed LR means there is no learning rate schedule at all, and the learning rate remains the what’s specified in --learning_rate throughout training.

Parameters that affect learning rate and schedulers

Here’s a list of the command line arguments which control learning rate schedule.

Not all models support all arguments. Run python <model> run --help to see which arguments your model supports.
  • --learning_rate. Peak learning rate when employing a schedule with warmup, or the flat learning rate when using fixed_lr.

  • --lr_schedule. One of cosine_schedule_with_warmup, polynomial_decay_schedule_with_warmup, or fixed_lr

  • --warmup_steps. If you are using one of the schedulers with warmup, number of steps over which to linearly increase the learning rate up to the value specified by --learning_rate . Set this to 0 to specify no warmup at all.

  • --max_steps. the number of steps over which to decay the learning rate. This will also be the number of training steps that the model will finish training at. To set your LR scheduler to decay over a different number of steps than you actually run training for, specify both --max_steps and --steps_this_run. Use --max_steps to control the learning rate and --steps_this_run to control the number of actual training steps taken.

  • --end_lr_ratio. the fraction of the peak LR to end decay at. The default ratio is different for cosine_schedule_with_warmup which ends at 0.1x the peak rate, and polynomial_decay_schedule_with_warmup, which ends at 0.0

Using a custom scheduler

If you specifically want to experiment with the LR schedule as an independent variable, it likely makes sense to edit learning_rate_schedule.py and then import their own learning rate. For example for our Transformers model, you’d import your learning rate into gpt2_task.py.

Learn more

In our model conversion documentation, we explain how to use a custom loss function here