Skipping L2 regularization on sparse biases
Summary:
# Motivations
As explained in the [link](https://stats.stackexchange.com/questions/86991/reason-for-not-shrinking-the-bias-intercept-term-in-regression/161689#161689), regularizing biases will cause mis-calibration of predicted probabilities.
In SparseNN, the unary processor may use 1d embedding tables for the sparse features to serve as biases.
In this diff, the regularization term is automatically skipped for the 1d sparse parameters to avoid regularizing biases.
# Experiments
Experiments were conducted to verify that it has no significant impact on the NE to skip the regularization on 1d sparse parameters.
Baseline.1 (no L2 regularization): f193105372
Baseline.2 (L2 regularization in prod): f193105522
Treatment (skipping L2 regularization on 1d sparse params): f193105708
{F239859690}
Test Plan:
Experiments were conducted to verify that it has no significant impact on the NE to skip the regularization on 1d sparse parameters using a canary package: `aml.dper2.canary:9efc576b35b24361bb600dcbf94d31ea`.
Baseline.1 (no L2 regularization): f193105372
Baseline.2 (L2 regularization in prod): f193105522
Treatment (skipping L2 regularization on 1d sparse params): f193105708
Reviewed By: zhongyx12
Differential Revision: D21757902
fbshipit-source-id: ced126e1eab270669b9981c9ecc287dfc9dee995