Click through rate CTR prediction is a crucial task in online displayadvertising and the key part is to learn important feature interactions. Themainstream models are embedding based neural networks that provide end to endtraining by incorporating hybrid components to model both low order andhigh order feature interactions. These models, however, slow down theprediction inference by at least hundreds of times due to the deep neuralnetwork DNN component. Considering the challenge of deploying embedding basedneural networks for online advertising, we propose to prune the redundantparameters for the first time to accelerate the inference and reduce therun time memory usage. Most notably, we can accelerate the inference by 46X onCriteo dataset and 27X on Avazu dataset without loss on the predictionaccuracy.
In addition, the deep model acceleration makes an efficient modelensemble possible with low latency and significant gains on the performance. Generalized linear models, such as linear1; linear2, are scalable and interpretable and have achieved great successes. However, they are limited in their prediction power due to the lack of mechanisms to learn feature interactions. There is considerable work done towards learning feature interactions and they can be roughly grouped into two categories, namely, shallow models and embedding based neural networks. Shallow models include Factorization Machine FM FM, Field aware Factorization Machine FFM FFM and Field weighted Factorization Machine FwFM fwfm. Factorization machine FM FM models quadratic feature interactions by matrix decomposition.
Although theoretically possible to model any orders of feature interactions, FM is mainly used in modeling linear and quadratic feature interactions in practice. Field aware Factorization Machine FFM FFMidentified the importance of fields and proposed to learn several latent vectors for a given feature to model its different interaction effects with other features from different fields. This greatly improves the prediction performance, but the number of parameters is also significantly increased. Field weighted Factorization Machine FwFM fwfm was proposed to model different field interactions in a much more memory efficient way. As a consequence, the prediction performance is as good as FFM while the number of parameters is much smaller.
The embedding based neural networks provide a more powerful non linear modeling by using DNNs. Wide and Deep deepwide proposed to train a joint network that combines a linear model and a DNN model to learn both low order and high order feature interactions. However, the cross features in the linear model still require expertise feature engineering and cannot be easily adapted to new datasets. DeepFM deepfm handled this issue by modeling low order feature interactions through the FM component instead of the linear model. Since then, various embedding based neural networks have been proposed to learn high order feature interactions: Deep and Cross Network DCN deepcross models cross features of bounded degrees in terms of layer depth; Neural Factorization Machines NFM NFM divises a bilinear interaction pooling to connect the embedding vectors with the DNN component; eXtreme Deep Factorization Machine XDeepFM xdeepfm incorporates a Compressed Interaction Network CIN and a DNN to automatically learn high order feature interactions in both explicit and implicit manners. Other relevant work includes PNN; deepcrossing; autoint.
Despite the advances of DNN components in the embedding based neural networks, the prediction inference is slowed by hundreds of times compared to the shallow models, leading to unrealistic latency for the real time ad serving system. To handle this issue, we propose a novel field weighted embedding based neural network DeepFwFM that is particularly suitable for fast and accurate inference. The model itself combines a FwFM component and a vanilla DNN component into a unified model, which is effective to learn both low order and high order feature interactions. More importantly, DeepFwFM shows the unique advantage in structural pruning to greatly reduce the inference time using such a combination, while the other structures may fail in either deep model accelerations or accurate predictions. We support our statement through extensive pruning experiments, which shows that DeepFwFM obtains the best performance not only in predictions, but also in terms of deep model accelerations. Moreover, we observe that a moderate sparsity improves the state of the art result by applying a compact and sufficient structure.
In addition, we can achieve 46X speed ups on Criteo dataset and 27X speed ups on Avazu dataset without loss on AUC. The deep model accelerations enables the fast predictions in large scale ad serving systems and also makes deep model ensemble possible within a limited prediction time. Consequently, we can further improve the performance through integrating the predictions of several sparse DeepFwFM models, which still outperform the corresponding baselines due to the powerful prediction performance and a low latency. We have made code available at ayneDW/sDeepFwFM. Clearly, the FM component is replaced with the FwFM component in DeepFwFM.
DeepFM attempts to model the low order feature interactions through the weight 1 connections, which is however inefficient due to the incapability to adapt to the local geometry. The inclusion of the field matrix R in DeepFwFM resembles an adaptive preconditioned algorithm to speed up the convergence by approximating the Hessian matrix Li16. In addition, it also shows the potential for further structural pruning. With respect to the updates in the linear terms, the joint structure reduces the number of parameters from m to nk, which speeds up the training and increases the robustness of the model. Regarding the modeling of low order feature interactions, the state of the art XDeepFM model proposes to include a compressed interaction network CIN to improve the learning power by explicitly approximating a fixed order polynomial in high dimensions. However, the major drawback is that the CIN has an even larger time complexity than the DNN component as discussed in xdeepfm, resulting in expensive computations in large scale ad systems.
Moreover, inspired from the successful second order optimization algorithms, such as Newton’s method, Quasi Newton method and L BFGS Liu89onthe, we argue that it is sufficient to model low order feature interactions using the second order FwFM. Further advancements in the learning of low order terms in the embedding based neural networks may lead to too much cost and only show marginal improvements.