samba.optim#

AdamW#

class AdamW(params: Iterator[SambaParameter], lr: float = 0.001, betas: Tuple[float] = (0.9, 0.997), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, max_grad_norm: float | None = None, fp32_params: bool = False, argin_region_name='')#

Implements the AdamW algorithm.

Parameters:
  • params – iterable of parameters to optimize or dicts defining parameter groups

  • lr – learning rate. Defaults to 1e-3.

  • betas – coefficients used for computing running averages of the gradient and its square. Defaults to (0.9, 0.997).

  • eps – term added to the denominator to improve numerical stability. Defaults to 1e-8.

  • weight_decay – weight decay coefficient. This value prevents weights from growing too large, i.e. it regularizes the weights. Defaults to 0.0.

  • amsgrad – whether to use the AMSGrad variant of this algorithm from the external paper On the Convergence of Adam and Beyond. Defaults to False.

  • max_grad_norm – max norm of the gradients. Defaults to None.

  • fp32_params – create a copy of the parameters in FP32 for use on the RDU. Defaults to False.

  • argin_region_name – prefix for the hyperparameters for this optimizer. Allows setting different hyperparameters for different optimizer instances. Defaults to the empty string (“”).

See also

For details see torch.optim.AdamW

add_param_group(param_group)#

Add a param group to the Optimizer s param_groups.

This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the Optimizer as training progresses.

Parameters:
  • param_group (dict) – Specifies what Tensors should be optimized along with group

  • options. (specific optimization) –

cpu(inplace: bool = False)#

For each parameter with a gradient, transfers the optimizer state (momentum) from RDU to the host

Parameters:

inplace – whether to reuse the CPU memory for the optimizer state when transferring from the RDU to the host. Defaults to False.

load_state_dict(state_dict)#

Loads the optimizer state.

Parameters:

state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict().

rdu()#

For each parameter with a gradient, transfers the optimizer state (momentum) from the host to RDU

state_dict()#

Returns the state of the optimizer as a dict.

It contains two entries:

  • state - a dict holding current optimization state. Its content

    differs between optimizer classes.

  • param_groups - a list containing all parameter groups where each

    parameter group is a dict

step(closure=None)#

Performs a single optimization step.

Parameters:

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

zero_grad(set_to_none: bool = False)#

Sets the gradients of all optimized torch.Tensor s to zero.

Parameters:

set_to_none (bool) – instead of setting to zero, set the grads to None. This will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests zero_grad(set_to_none=True) followed by a backward pass, .grads are guaranteed to be None for params that did not receive a gradient. 3. torch.optim optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).

class SparseAdamW(params: List[Tensor], lr: float = 0.001, betas: Tuple[float] = (0.9, 0.997), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False)#

Implements the SparseAdamW algorithm. SparseAdamW implements the same math as AdamW and mostly provides performance considerations for embedding weights. SparseAdamW only updates momentum and velocity when the specific index row is used in that iteration.

Parameters:
  • params – iterable of parameters to optimize or dicts that define parameter groups

  • lr – learning rate: Defaults to 1e-3.

  • betas – coefficients used for computing running averages of gradient and its square. Defaults to (0.9, 0.997).

  • eps – term added to the denominator to improve numerical stability. Defaults to 1e-8.

  • weight_decay – weight decay coefficient. This value prevents weights from growing too large, i.e. it regularizes the weights. Defaults to 0.0.

  • amsgrad

    whether to use the AMSGrad variant of this algorithm from the external paper On the Convergence of Adam and Beyond. Defaults to False.

See also

For details, see torch.optim.AdamW

add_param_group(param_group)#

Add a param group to the Optimizer s param_groups.

This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the Optimizer as training progresses.

Parameters:
  • param_group (dict) – Specifies what Tensors should be optimized along with group

  • options. (specific optimization) –

cpu(inplace: bool = False)#

For each parameter with a gradient, transfers the optimizer state (momentum) from RDU to the host

Parameters:

inplace – whether to reuse the CPU memory for the optimizer state when transferring from the RDU to the host. Defaults to False.

load_state_dict(state_dict)#

Loads the optimizer state.

Parameters:

state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict().

rdu()#

For each parameter with a gradient, transfers the optimizer state (momentum) from the host to RDU

state_dict()#

Returns the state of the optimizer as a dict.

It contains two entries:

  • state - a dict holding current optimization state. Its content

    differs between optimizer classes.

  • param_groups - a list containing all parameter groups where each

    parameter group is a dict

step(closure=None)#

Performs a single optimization step.

Parameters:

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

zero_grad(set_to_none: bool = False)#

Sets the gradients of all optimized torch.Tensor s to zero.

Parameters:

set_to_none (bool) – instead of setting to zero, set the grads to None. This will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests zero_grad(set_to_none=True) followed by a backward pass, .grads are guaranteed to be None for params that did not receive a gradient. 3. torch.optim optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).

SGD#

class SGD(params, lr, *args, momentum=0.0, weight_decay=0.0, argin_region_name='', **kwargs)#

Implements stochastic gradient descent (optionally with momentum).

Parameters:
  • params – iterable of parameters to optimize or dicts defining parameter groups

  • lr – learning rate

  • momentum – momentum factor. This value helps accelerate SGD in the relevant direction and dampens oscillations. A typical value for momentum is 0.9. Defaults to 0.0.

  • weight_decay – weight decay. This value prevents weights from growing too large, i.e. it regularizes the weights. Defaults to 0.0.

  • argin_region_name – prefix for the hyperparameters for this optimizer. Allows setting different hyperparameters for different optimizer instances. Defaults to the empty string (“”).

See also

For details see torch.optim.SGD.

add_param_group(param_group)#

Add a param group to the Optimizer s param_groups.

This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the Optimizer as training progresses.

Parameters:
  • param_group (dict) – Specifies what Tensors should be optimized along with group

  • options. (specific optimization) –

cpu(inplace: bool = False)#

For each parameter with a gradient, transfer the optimizer state (momentum) from RDU to the host

Parameters:

inplace – whether to reuse the CPU memory for the optimizer state when transferring from the RDU to the host. Defaults to False.

load_state_dict(state_dict)#

Loads the optimizer state.

Parameters:

state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict().

rdu()#

For each parameter with a gradient, transfer the optimizer state (momentum) from the host to RDU

state_dict()#

Returns the state of the optimizer as a dict.

It contains two entries:

  • state - a dict holding current optimization state. Its content

    differs between optimizer classes.

  • param_groups - a list containing all parameter groups where each

    parameter group is a dict

step(closure: Callable | None = None)#

Performs a single optimization step.

Parameters:

closure – A closure that reevaluates the model and returns the loss.

zero_grad(set_to_none: bool = False)#

Sets the gradients of all optimized torch.Tensor s to zero.

Parameters:

set_to_none (bool) – instead of setting to zero, set the grads to None. This will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests zero_grad(set_to_none=True) followed by a backward pass, .grads are guaranteed to be None for params that did not receive a gradient. 3. torch.optim optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).