TensorFlow 1 version | View source on GitHub |
Optimizer that implements the FTRL algorithm.
Inherits From: Optimizer
tf.keras.optimizers.Ftrl(
learning_rate=0.001, learning_rate_power=-0.5, initial_accumulator_value=0.1,
l1_regularization_strength=0.0, l2_regularization_strength=0.0,
name='Ftrl', l2_shrinkage_regularization_strength=0.0, beta=0.0,
**kwargs
)
See Algorithm 1 of this paper. This version has support for both online L2 (the L2 penalty given in the paper above) and shrinkage-type L2 (which is the addition of an L2 penalty to the loss function).
Initialization:
$$t = 0$$
$$n_{0} = 0$$
$$\sigma_{0} = 0$$
$$z_{0} = 0$$
Update (
$$i$$
is variable index,
$$\alpha$$
is the learning rate):
$$t = t + 1$$
$$n_{t,i} = n_{t-1,i} + g_{t,i}^{2}$$
$$\sigma_{t,i} = (\sqrt{n_{t,i} } - \sqrt{n_{t-1,i} }) / \alpha$$
$$z_{t,i} = z_{t-1,i} + g_{t,i} - \sigma_{t,i} * w_{t,i}$$
$$w_{t,i} = - ((\beta+\sqrt{n_{t,i} }) / \alpha + 2 * \lambda_{2})^{-1} *
(z_{i} - sgn(z_{i}) * \lambda_{1}) if \abs{z_{i} } > \lambda_{i}
else 0$$
Check the documentation for the l2_shrinkage_regularization_strength parameter for more details when shrinkage is enabled, in which case gradient is replaced with gradient_with_shrinkage.
Args | |
---|---|
learning_rate
|
A Tensor , floating point value, or a schedule that is a
tf.keras.optimizers.schedules.LearningRateSchedule . The learning rate.
|
learning_rate_power
|
A float value, must be less or equal to zero. Controls how the learning rate decreases during training. Use zero for a fixed learning rate. |
initial_accumulator_value
|
The starting value for accumulators. Only zero or positive values are allowed. |
l1_regularization_strength
|
A float value, must be greater than or equal to zero. |
l2_regularization_strength
|
A float value, must be greater than or equal to zero. |
name
|
Optional name prefix for the operations created when applying
gradients. Defaults to "Ftrl" .
|
l2_shrinkage_regularization_strength
|
A float value, must be greater than or equal to zero. This differs from L2 above in that the L2 above is a stabilization penalty, whereas this L2 shrinkage is a magnitude penalty. When input is sparse shrinkage will only happen on the active weights. |
beta
|
A float value, representing the beta value from the paper. |
**kwargs
|
Keyword arguments. Allowed to be one of
"clipnorm" or "clipvalue" .
"clipnorm" (float) clips gradients by norm; "clipvalue" (float) clips
gradients by value.
|
Reference:
Raises | |
---|---|
ValueError
|
in case of any invalid argument. |