User response prediction is a crucial component for personalized information retrieval and filtering scenarios , such as recommender system and web search. The data in user response prediction is mostly in a multi field categorical format and transformed into sparse representations via one hot encoding. Due to the sparsity problems in representation and optimization, most research focuses on feature engineering and shallow modeling. Recently, deep neural networks have attracted research attention on such a problem for their high capacity and end to end training scheme.
In this paper, we study user response prediction in the scenario of click prediction. We first analyze a coupled gradient issue in latent vector based models and propose kernel product to learn field aware feature interactions. Then we discuss an insensitive gradient issue in DNN based models and propose Product based Neural Network PNN which adopts a feature extractor to explore feature interactions. Generalizing the kernel product to a net in net architecture, we further propose Product network In Network PIN which can generalize previous models. Extensive experiments on 4 industrial datasets and 1 contest dataset demonstrate that our models consistently outperform 8 baselines on both AUC and log loss. Besides, PIN makes great CTR improvement relatively 34.
67% in online A/B test. The core of personalized serviceis to estimate the probability that a user will “like”, “click”, or “purchase” an item, given features about the user, the item, and the context Menonet al. , 2011. This probability indicates the user’s interest in the specific item and influences the subsequent decision making such as item ranking Xueet al. , 2004 and ad bidding Zhanget al. , 2014b.
Taking online advertising as an example, the estimated Click Through Rate CTR will be utilized to calculate a bid price in an ad auction to improve the advertisers’ budget efficiency and the platform revenue Zhanget al. , 2014b; Perlich et al. , 2012; Renet al. , 2016. Hence, it is much desirable to gain accurate prediction to not only improve the user experience, but also boost the volume and profit for the service providers.
Each field is represented as a binary vector, of which only 1 entry corresponding to the input is set as 1 while others are 0. The dimension of a vector is determined by its field size, i. e. , the number of unique categories111For clarity, we use “category” instead of “feature” to represent a certain value in a categorical field. For consistency with previous literature, we preserve “feature” in some terminologies, e.
g. , feature combination, feature interaction, and feature representation. in that field. The one hot vectors of these fields are then concatenated together in a predefined order. Without loss of generality, user response prediction can be regarded as a binary classification problem, and 1/0 are used to denote positive/negative responses respectively Richardsonet al.
, 2007; Graepel et al. , 2010. A main challenge of such a problem is sparsity. For parametric models, they usually convert the sparse binary input into dense representations e. g.
, weights, latent vectors, and then search for a separable hyperplane. Fig. 1 shows the model decomposition. In this paper, we mainly focus on modeling and training,thus we exclude preliminary feature engineering Cuiet al. , 2011.
Many machine learning models are leveraged or proposed to work on such a problem,including linear models, latent vector based models, tree models, and DNN based models. Linear models, such as Logistic Regression LR Leeet al. , 2012 and Bayesian Probit Regression Graepel et al. , 2010, are easy to implement and with high efficiency. A typical latent vector based model is Factorization Machine FM Rendle, 2010.
FM uses weights and latent vectors to represent categories. According to their parametric representations, LR has a linear feature extractor, and FM has a bi linear222Although FM has higher order formulations Rendle, 2010, due to the efficiency and practical performance, FM is usually implemented with second order interactions. feature extractor. The prediction of LR and FM are simply based on the sum over weights, thus their classifiers are linear. FM works well on sparse data, and inspires a lot of extensions, including Field aware FM FFM Juanet al.
, 2016. FFM introduces field aware latent vectors, which gain FFM higher capacity and better performance. However, FFM is restricted by space complexity. Inspired by FFM, we find a coupled gradient issue of latent vector based models and refine feature interactions333In Rendle, 2010, the cross features learned by FM are called feature interactions. as field aware feature interactions. To solve this issue as well as saving memory, we propose kernel product methods and derive Kernel FM KFM and Network in FM NIFM.
Trees and DNNs are potent function approximators. Tree models, such as Gradient Boosting Decision Tree GBDT Chen and Guestrin, 2016, are popular in various data science contests as well as industrial applications. GBDT explores very high order feature combinations in a non parametric way, yet its exploration ability is restricted when feature space becomes extremely high dimensional and sparse. DNN has also been preliminarily studied in information system literature Zhanget al. , 2016; Covingtonet al. , 2016; Shanet al.
, 2016; Quet al. , 2016. In Zhanget al. , 2016, FM supported Neural Network FNN is proposed. FNN has a pre trained embedding444We use “latent vector” in shallow models, and “embedding vector” in DNN models. layer and several fully connected layers.
Since the embedding layer indeed performs a linear transformation, FNN mainly extracts linear information from the input. Inspired by Shalev Shwartz et al. , 2017, we find an insensitive gradient issue that fully connected layers cannot fit such target functions perfectly. From the model decomposition perspective, the above models are restricted by inferior feature extractors or weak classifiers. Incorporating product operations in DNN, we propose Product based Neural Network PNN. PNN consists of an embedding layer, a product layer, and a DNN classifier.
The product layer serves as the feature extractor which can make up for the deficiency of DNN in modeling feature interactions. We take FM, KFM, and NIFM as feature extractors in PNN, leading to Inner Product based Neural Network IPNN, Kernel Product based Neural Network KPNN, and Product network In Network PIN. CTR estimation is a fundamental task in personalized advertising and recommender systems, and we take CTR estimation as the working example to evaluate our models. Extensive experiments on 4 large scale real world datasets and 1 contest dataset demonstrate the consistent superiority of our models over 8 baselines Leeet al. , 2012; Rendle, 2010; Liuet al. , 2015; Zhanget al.
, 2016; Juanet al. , 2016; Guoet al. , 2017; Chen and Guestrin, 2016; Xiaoet al. , 2017 on both AUC and log loss. Besides, PIN makes great CTR improvement 34.
67% in online A/B test. To sum up, our contributions can be highlighted as follows:From the modeling perspective, linear Logistic Regression LR Leeet al. , 2012; Renet al. , 2016, bi linear Factorization Machine FM Rendle, 2010 and Gradient Boosting Decision Tree GBDT Heet al. , 2014 are widely used in industrial applications. As illustrated in Fig.
2, LR extracts linear information from the input, FM further extracts bi linear information, while GBDT explores feature combinations in a non parametric way. From the training perspective, many adaptive optimization algorithms can speed up training of sparse data, including Follow the Regularized Leader FTRL McMahanet al. , 2013, Adaptive Moment Estimation Adam Kingma and Ba, 2014, etc. These algorithms follow a per coordinate learning rate scheme, making them converge much faster than stochastic gradient descent SGD. From the representation perspective, latent vectors are expressive in representing categorical data.
In FM, the side information and user/item identifiers are represented by low dimensional latent vectors, and the feature interactions are modeled as the inner product of latent vectors. As an extension of FM, Field aware FM FFM Juanet al. , 2016 enables each category to have multiple latent vectors. From the classification perspective, powerful function approximators like GBDT and DNN are more suitable for continuous input. Therefore, in many contests, the winning solutions take FM/FFM as feature extractors to process discrete data, and use the latent vectors or interactions as the input of successive classifiers e. g.
, GBDT, DNN. According to model decomposition Fig. 1, latent vector based models make predictions simply based on the sum of interactions. This weakness motivates the DNN variants of latent vector based models. The input to DNN is usually dense and numerical, while the case of multi field categorical data has not been well studied.
FM supported Neural Network FNN Zhanget al. , 2016 Fig. 3a has an embedding layer and a DNN classifier. Besides, FNN uses FM to pre train the embedding layer. Other models use DNN to improve FM, e. g.
, Neural Collaborative Filtering NCF Heet al. , 2017, Neural FM NFM He and Chua, 2017 , Attentional FM AFM Xiaoet al. , 2017. NCF uses DNN to solve collaborative filtering problem. NFM extends NCF to more general recommendation scenarios. Based on NFM, AFM uses attentive mechanism to improve feature interactions, and becomes a state of the art model.
Wide and Deep Learning WDL Cheng et al. , 2016 trains a wide model and a deep model jointly. The wide part uses LR to “memorize”, meanwhile, the deep part uses DNN to “generalize”. Compared with single models, WDL achieves better AUC in offline evaluations and higher CTR in online A/B test. WDL requires human efforts for feature engineering on the input to the wide part, thus is not end to end. DeepFM Guoet al.
, 2017, as illustrated in Fig. 3c, can both utilize the strengths of WDL and avoid expertise in feature engineering. It replaces the wide part of the WDL with FM. Besides, the FM component and the deep component share same embedding parameters. DeepFM is regarded as one state of the art model in user response prediction. Network In Network NIN Lin et al.
, 2013 is originally proposed in CNN. NIN builds micro neural networks between convolutional layers to abstract the data within the receptive field. Multilayer perceptron as a potent function approximator is used in micro neural networks of NIN. GoogLeNet Szegedy et al. , 2015 makes use of the micro neural networks suggested in Lin et al.
, 2013 and achieves great success. NIN is powerful in modeling local dependencies. In this paper, we borrow the idea of NIN, and propose to explore inter field feature interactions with flexible micro networks. In user response prediction, the input data contains multiple fields, e. g. , WEEKDAY, GENDER, CITY.
A field contains multiple categoriesand takes one category in each data instance. Table 1 shows 4 data examples, each of which contains 3 fields,and each field takes a single value. For example, a Male customer located in London buys some beer on Tuesday. From this record we can extract a useful feature combination: “Male and London and Tuesday implies True”. The efficacy of feature combinations a.
k. a. , cross features has already been proved Menonet al. , 2011; Rendle, 2010. In FM, the 2nd order combinations are called feature interactions.
FM makes an implicit assumption that a field interacts with different fields in the same manner, which may not be realistic. Conversely, the latent vector vTue is updated in the direction of vMale. To summarize, FM uses the same latent vectors in different types of inter field interactions, which is an over simplification and degrades the model capacity. We call this problem a coupled gradient issue.