samba.optim#
AdamW#
- class AdamW(params: Iterator[SambaParameter], lr: float = 0.001, betas: Tuple[float] = (0.9, 0.997), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, max_grad_norm: float | None = None, fp32_params: bool = False, argin_region_name='')#
Implements the AdamW algorithm.
- Parameters:
params – iterable of parameters to optimize or dicts defining parameter groups
lr – learning rate. Defaults to 1e-3.
betas – coefficients used for computing running averages of the gradient and its square. Defaults to (0.9, 0.997).
eps – term added to the denominator to improve numerical stability. Defaults to 1e-8.
weight_decay – weight decay coefficient. This value prevents weights from growing too large, i.e. it regularizes the weights. Defaults to 0.0.
amsgrad – whether to use the AMSGrad variant of this algorithm from the external paper On the Convergence of Adam and Beyond. Defaults to
False
.max_grad_norm – max norm of the gradients. Defaults to
None
.fp32_params – create a copy of the parameters in FP32 for use on the RDU. Defaults to
False
.argin_region_name – prefix for the hyperparameters for this optimizer. Allows setting different hyperparameters for different optimizer instances. Defaults to the empty string (“”).
See also
For details see
torch.optim.AdamW
- add_param_group(param_group)#
Add a param group to the
Optimizer
s param_groups.This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the
Optimizer
as training progresses.- Parameters:
param_group (dict) – Specifies what Tensors should be optimized along with group
options. (specific optimization) –
- cpu(inplace: bool = False)#
For each parameter with a gradient, transfers the optimizer state (momentum) from RDU to the host
- Parameters:
inplace – whether to reuse the CPU memory for the optimizer state when transferring from the RDU to the host. Defaults to False.
- load_state_dict(state_dict)#
Loads the optimizer state.
- Parameters:
state_dict (dict) – optimizer state. Should be an object returned from a call to
state_dict()
.
- rdu()#
For each parameter with a gradient, transfers the optimizer state (momentum) from the host to RDU
- state_dict()#
Returns the state of the optimizer as a
dict
.It contains two entries:
- state - a dict holding current optimization state. Its content
differs between optimizer classes.
- param_groups - a list containing all parameter groups where each
parameter group is a dict
- step(closure=None)#
Performs a single optimization step.
- Parameters:
closure (callable, optional) – A closure that reevaluates the model and returns the loss.
- zero_grad(set_to_none: bool = False)#
Sets the gradients of all optimized
torch.Tensor
s to zero.- Parameters:
set_to_none (bool) – instead of setting to zero, set the grads to None. This will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests
zero_grad(set_to_none=True)
followed by a backward pass,.grad
s are guaranteed to be None for params that did not receive a gradient. 3.torch.optim
optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).
- class SparseAdamW(params: List[Tensor], lr: float = 0.001, betas: Tuple[float] = (0.9, 0.997), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False)#
Implements the SparseAdamW algorithm. SparseAdamW implements the same math as AdamW and mostly provides performance considerations for embedding weights. SparseAdamW only updates momentum and velocity when the specific index row is used in that iteration.
- Parameters:
params – iterable of parameters to optimize or dicts that define parameter groups
lr – learning rate: Defaults to 1e-3.
betas – coefficients used for computing running averages of gradient and its square. Defaults to (0.9, 0.997).
eps – term added to the denominator to improve numerical stability. Defaults to 1e-8.
weight_decay – weight decay coefficient. This value prevents weights from growing too large, i.e. it regularizes the weights. Defaults to 0.0.
amsgrad –
whether to use the AMSGrad variant of this algorithm from the external paper On the Convergence of Adam and Beyond. Defaults to
False
.
See also
For details, see
torch.optim.AdamW
- add_param_group(param_group)#
Add a param group to the
Optimizer
s param_groups.This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the
Optimizer
as training progresses.- Parameters:
param_group (dict) – Specifies what Tensors should be optimized along with group
options. (specific optimization) –
- cpu(inplace: bool = False)#
For each parameter with a gradient, transfers the optimizer state (momentum) from RDU to the host
- Parameters:
inplace – whether to reuse the CPU memory for the optimizer state when transferring from the RDU to the host. Defaults to False.
- load_state_dict(state_dict)#
Loads the optimizer state.
- Parameters:
state_dict (dict) – optimizer state. Should be an object returned from a call to
state_dict()
.
- rdu()#
For each parameter with a gradient, transfers the optimizer state (momentum) from the host to RDU
- state_dict()#
Returns the state of the optimizer as a
dict
.It contains two entries:
- state - a dict holding current optimization state. Its content
differs between optimizer classes.
- param_groups - a list containing all parameter groups where each
parameter group is a dict
- step(closure=None)#
Performs a single optimization step.
- Parameters:
closure (callable, optional) – A closure that reevaluates the model and returns the loss.
- zero_grad(set_to_none: bool = False)#
Sets the gradients of all optimized
torch.Tensor
s to zero.- Parameters:
set_to_none (bool) – instead of setting to zero, set the grads to None. This will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests
zero_grad(set_to_none=True)
followed by a backward pass,.grad
s are guaranteed to be None for params that did not receive a gradient. 3.torch.optim
optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).
SGD#
- class SGD(params, lr, *args, momentum=0.0, weight_decay=0.0, argin_region_name='', **kwargs)#
Implements stochastic gradient descent (optionally with momentum).
- Parameters:
params – iterable of parameters to optimize or dicts defining parameter groups
lr – learning rate
momentum – momentum factor. This value helps accelerate SGD in the relevant direction and dampens oscillations. A typical value for
momentum
is 0.9. Defaults to 0.0.weight_decay – weight decay. This value prevents weights from growing too large, i.e. it regularizes the weights. Defaults to 0.0.
argin_region_name – prefix for the hyperparameters for this optimizer. Allows setting different hyperparameters for different optimizer instances. Defaults to the empty string (“”).
See also
For details see
torch.optim.SGD
.- add_param_group(param_group)#
Add a param group to the
Optimizer
s param_groups.This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the
Optimizer
as training progresses.- Parameters:
param_group (dict) – Specifies what Tensors should be optimized along with group
options. (specific optimization) –
- cpu(inplace: bool = False)#
For each parameter with a gradient, transfer the optimizer state (momentum) from RDU to the host
- Parameters:
inplace – whether to reuse the CPU memory for the optimizer state when transferring from the RDU to the host. Defaults to False.
- load_state_dict(state_dict)#
Loads the optimizer state.
- Parameters:
state_dict (dict) – optimizer state. Should be an object returned from a call to
state_dict()
.
- rdu()#
For each parameter with a gradient, transfer the optimizer state (momentum) from the host to RDU
- state_dict()#
Returns the state of the optimizer as a
dict
.It contains two entries:
- state - a dict holding current optimization state. Its content
differs between optimizer classes.
- param_groups - a list containing all parameter groups where each
parameter group is a dict
- step(closure: Callable | None = None)#
Performs a single optimization step.
- Parameters:
closure – A closure that reevaluates the model and returns the loss.
- zero_grad(set_to_none: bool = False)#
Sets the gradients of all optimized
torch.Tensor
s to zero.- Parameters:
set_to_none (bool) – instead of setting to zero, set the grads to None. This will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests
zero_grad(set_to_none=True)
followed by a backward pass,.grad
s are guaranteed to be None for params that did not receive a gradient. 3.torch.optim
optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).