Analysis of the Use of Background Distribution for Naive Bayes Classifiers

Daniel Andrade; Akihiro Tamura; Masaaki Tsuchida

doi:10.1515/jisys-2017-0016

Open Access Published by De Gruyter July 20, 2017

Analysis of the Use of Background Distribution for Naive Bayes Classifiers

Daniel Andrade , Akihiro Tamura and Masaaki Tsuchida

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2017-0016

Abstract

The naive Bayes classifier is a popular classifier, as it is easy to train, requires no cross-validation for parameter tuning, and can be easily extended due to its generative model. Moreover, recently it was shown that the word probabilities (background distribution) estimated from large unlabeled corpora could be used to improve the parameter estimation of naive Bayes. However, previous methods do not explicitly allow to control how much the background distribution can influence the estimation of naive Bayes parameters. In contrast, we investigate an extension of the graphical model of naive Bayes such that a word is either generated from a background distribution or from a class-specific word distribution. We theoretically analyze this model and show the connection to Jelinek-Mercer smoothing. Experiments using four standard text classification data sets show that the proposed method can statistically significantly outperform previous methods that use the same background distribution.

Keywords: Naive Bayes; semisupervised classification; empirical Bayes

MSC 2010: 62P99

1 Introduction

The naive Bayes classifier is still a popular method for classification, especially in text classification [23]. One advantage of the naive Bayes classifier is that it has the interpretation of a generative model that can be easily extended to model more complex relations [3] and can easily be extended for semisupervised learning [21].

The work in Ref. [20] showed that a background distribution (word uni-gram distribution) estimated from large unlabeled data can help to improve the naive Bayes classifier. However, as their method does not explicitly allow to control how much the background distribution can influence the estimation of naive Bayes parameters, their usage of the background distribution might be suboptimal.

In this work, we propose a different approach that explicitly models the background distribution by extending the generative model of naive Bayes. The proposed model assumes that any word in the document is sampled either from a class word distribution θ_z or from a background distribution γ. To decide whether a word is sampled from the distribution θ_z or from the distribution γ, we introduce a binary indicator variable d, one for each word in the document. Despite the simplicity of our model, our experiments show that our method can statistically significantly outperform previous methods such as [20] when the training data are small.

Furthermore, we theoretically interpret and analyze several properties of our model. One property of our proposed model suggests that our model’s usage of the background distribution has the effect of down-weighting high-frequent like “a” or “the” (stop-words) and other irrelevant words (see Section 5.1). Especially, if the number of training data instances is small, these words might by chance occur more often in one class than the other. As a consequence, irrelevant words are not spread evenly over all classes z and can degenerate a classifier’s performance. Another property of our model shows that the background distribution becomes more necessary when the number of labeled data is small (see Section 5.2).^[1]

The remaining of the paper is structured as follows. The next chapter describes the related work, followed by Section 3, briefly reviewing naive Bayes. Next, in Section 4 and Section 4.1, we explain our proposed model and explain how the hyperparameter can be learned using empirical Bayes. In Section 5, we proof some theoretic properties of our model. Finally, in Section 6, we present our experiments, and we conclude our work in Section 7.

2 Related Work

Our work is related to Jelinek-Mercer smoothing [as we will show in Equation (5)], which was proposed in the context of information retrieval [29]. In contrast, we theoretically analyze and experimentally evaluate the smoothing in the context of text classification. Furthermore, the work in Ref. [29] did not exploit the underlying probabilistic model for finding parameter estimates.

Feature weighting for naive Bayes can also be achieved by replacing the maximum likelihood (ML) estimate of the class conditional word probabilities by weighted frequency counts (tf-idf) and correlation-based feature selection [14, 16, 26]. Such ad hoc weighting schemes can be incorporated in our framework by using a word prior that is proportional to the desired weight.

The work in Ref. [21] proposed an extension of naive Bayes that models each class by a mixture of several word distributions (components) rather than only one word distribution θ_z. They assign to each class exclusively a fixed number of components. This way, the assumption that the documents of one class are generated from only one multinomial distribution is relaxed, and it is possible to model subtopics within one class. However, as one component is assigned only to one class, their model does not allow to share word distributions across different classes. As a consequence, their model is not able to model words that occur frequently in documents independently of their class. The method in Ref. [21] forms one of our baselines.

The idea of sharing word distributions across different classes (topics) is realized in latent Dirichlet allocation (LDA) [1]. However, it is known that this standard topic model is negatively influenced by high-frequent words (common words) and therefore is filtered out manually or weighted with the use of priors [25].

Several works proposed to extend the standard topic model to explicitly model common words. Therefore, as in our model, they suggest to use binary “switching variables” to include a background distribution. The extended model is applied to information retrieval [2], topic-aspect modeling [22], summarization [5], template extraction [19], and also text classification with labeled words instead of labeled documents [9]. However, to the best of our knowledge, we are the first to analyze the effects when combining this idea with a naive Bayes classifier and comparing it to other proposed extensions of naive Bayes.

The work in Ref. [8] proposed a different extension of naive Bayes that also uses additionally a background distribution. Instead of modeling the word occurrence probability using a multinomial distribution, they model it by applying the soft-max function (exponential normalization) to adjusted log-word frequencies. For each class c, they estimate the log-frequency deviation from the log-background frequency distribution. Furthermore, to enforce sparsity, they place a Laplace prior on the log-frequency deviations. In Section 6, we will compare their method to our proposed method.

More recently, Su et al. [24] and Lucas and Downey [20] also proposed two semisupervised naive Bayes methods that use the background distribution. Both approaches reestimate the conditional word probabilities p(w|c) for a class c by using the background distribution p(w) estimated from unlabeled documents. Su et al. [24] estimated p(w|c) by using p(c|w) estimated only from the labeled documents and p(w) estimated from the unlabeled documents. In contrast, Lucas and Downey [20] optimized p(w|c) subject to the constraint that p(w) should equal p(w|c₁)·p(c₁)+p(w|c₂)·p(c₂), where p(w|c₁) and p(w|c₂) are the probabilities that a word comes from a document that has label c₁ and label c₂, respectively. Their experiments showed that their method is superior to the method in Ref. [24]. We therefore adapt the method in Ref. [20] as another baseline for our experiments in Section 6.

The naive Bayes classifier is a generative model (a Bayesian network) that makes the strong assumption that the features (words) are conditionally independent given the class of the document. A natural direction for improvement is therefore to relax these independence assumptions by considering also class conditional bi-gram probabilities as, for example, in Ref. [15] or learning a binary decision tree [27]. In contrast, our focus here is on extending the generative model to allow the incorporation of a background distribution estimated from unlabeled documents.

Departing from the generative model of the naive Bayes classifier, several other improvements to naive Bayes classifier have been proposed: sample weighting [13], feature selection [30], and feature weighting [12, 31]. The latter is most relevant, as our proposed generative model also has the effect of feature weighting. In particular, in Ref. [31], the authors proposed the gain ratio weighted naive Bayes text classifier (GRWNB). ^[2] Their experiments show that GRWNB (combined with the multinomial naive Bayes classifier) is statistically significantly better than other state-of-the art feature weighting approaches such as χ² feature weighting. We therefore choose the GRWNB (with the binomial naive Bayes classifier) as another baseline in our experiments (see Section 6). However, we emphasize that, in contrast to all other baselines and our proposed method, GRWNB is not a valid generative model anymore.

3 Naive Bayes Model

Given the class z of a document, the naive Bayes model assumes that each word w in the document is independently generated from a distribution θ_w_{|_z} . A popular choice for this distribution is the categorical distribution. ^[3] Let us represent a text document t as (w₁, …, w_k) where w_j is the word in the jth position of the document. Under this model, the joint probability of the document with given class z is

(1)p(w1, …, wk|z, θ⋅|z)=∏j=1kθwj|z,

where θ_·|z is the parameter vector of the categorical distribution, with ∑wθw|z=1.

Let us denote by θ all parameters {θ_w_|z }_w,_z, for all words w and all classes z. Given a collection of texts with known classes D={(t₁,z₁), …, (t_n,Z_n)}, we can estimate the parameters θ_w_|z by

argmaxθp(θ|D)=argmaxθp(θ)⋅p(D|θ)=argmaxθp(θ)⋅∏i=1np(ti, zi|θ)=argmaxθp(θ)⋅∏i=1np(ti|zi, θ),

using the usual iid-assumption, and that z_i is independent from θ. Furthermore, using Equation (1), we get ^[4]

argmaxθp(θ|D)=argmaxθp(θ)⋅∏i=1n∏j=1kθwj|zi.

For simplicity, let us assume that p(θ) is constant, and then the above expression is maximized by

(2)θw|z=Nw,z∑w′∈VNw′,z,

where we define N_w,_z as the number of times word w occurs in documents belonging to topic z, and V denotes the set of words in the corpus. Finally, assuming some class probability p(z), a new document is classified using argmax_zp(z|w₁, …, w_k). The class probability p(z) is set using the ML estimate.

4 Proposed Extension of Naive Bayes Generative Model

We now describe our extension of the naive Bayes model as displayed in Figure 1. Under the proposed model, the joint probability of the text document with words w₁, …, w_k, hidden variables d₁, …, d_k and class z is

(3)p(w1, …, wk, d1, …, dk, z|δ, γ, θ)=p(z)⋅∏j=1kp(wj|z, dj, γ, θ)⋅p(dj|δ),

where the word probability p(w|z,d,γ,θ) is defined as follows:

p(w|z, d, γ, θ)={θw|zif d=1,γwif d=0.

Figure 1:

Proposed Model in Plate Notation.

The variables d_j are binary random variables that indicate whether the word w_j is drawn from the class word distribution θ_w_|z or from the background distribution γ. The distributions θ_w_|z and γ are two categorical distributions with ∑w∈Vθw|z=1, and ∑w∈Vγw=1. The variables d_j are hidden variables that cannot be observed from the training documents. To acquire the probability of a training document (w₁, …, w_k, z), we sum over all values for d₁, …, d_k, leading to

p(w1, …, wk, z)=∑d1,….,dkp(z)⋅∏j=1kp(wj|z, dj)⋅p(dj)=p(z)⋅∏j=1k∑djp(wj|z, dj)⋅p(dj),

where, to make the notation more compact, we skipped the conditioning of p(w_j|z,d_j) on γ and θ and p(d_j) on δ, respectively.

We assume that the prior probability p(d_j) is independent from the class of the document and independent from the word position j. Therefore, we define δ:=p(d_j=1), which is constant for all words. This way, the joint probability of the document with class z, i.e. p(w₁, …, w_k, z), can be expressed as follows:

(4)p(z)⋅∏j=1k((1−δ)⋅γwj+δ⋅θwj|z).

For a class z, we estimate the word distribution θ_w_|z as before using Equation (2). The class prior p(z) is also estimated in the same way as for the naive Bayes classifier.

Finally, to classify a new document w₁, …, w_k, we use

p(z|w1, …, wk)∝p(z)⋅∏j=1k((1−δ)⋅γwj+δ⋅θwj|z).

We set the background distribution γ to the word frequency distribution of the whole corpus (labeled+unlabeled documents).

We note that, with this choice of the background distribution, the result is actually identical to a naive Bayes model with the class-dependent word probabilities θ′w|z set to

(5)θ′w|z:=(1−δ)⋅γw+δ⋅θw|z,

which corresponds to Jelinek-Mercer smoothing [29].

In the next section, we will discuss how the hyperparameter δ can be learned.

4.1 Learning δ

Note that we expect δ is in the interval [0, 1]. If δ is 1, the model reduces to the original naive Bayes classifier. If δ is 0, the model does not use the class word distribution θ at all.

Let D′={t₁, …, t_n_′} be the collection of all documents. ^[5] We suggest to set δ such that if there are many words in D′ that can be better explained by the background distribution rather than any θ_w_|z , the parameter δ is closer to 0. We can achieve this by choosing the parameter δ* that maximizes the data probability p(D′) under our proposed model for fixed parameters θ_w_|z and γ. This means that δ*:=argmax_δp(D′|δ,γ), which equals

argmaxδ∏i=1n′∑zip(zi)∏j=1ki((1−δ)γwj+δθwj|z).

To find an approximate solution to this problem, we use the EM algorithm, considering all class labels z_i and all indicator variables d_j as unobserved. Instead of maximizing directly the log likelihood log p(D′|δ,γ), the EM algorithm maximizes in each step the following function:

Eq[logp(D′, h|δ, γ)]=∑hq(h)logp(D′, h|δ, γ)

where h corresponds to the hidden variables z_i and d_j, for all documents i and all words j; q is a probability distribution over these hidden variables. First, setting an initial δ, for example, 0.5, we iterate the following two steps:

E-step: Setting q to p(h|D′, δ, γ).
M-step: Finding a new δ*, which maximizes E_q[log p(D′, h|δ, γ)].

The process repeats with the E-step, setting δ to δ*. In each step, δ* is guaranteed to increase the objective function log p(D′|δ) until convergence [6]. The δ* that maximizes E_q[log p(D′, h|δ, γ)] is given by

(6)δ∗=∑i=1n′∑j=1kiqj(dj=1)∑i=1n′∑j=1kiqj(dj=0)+∑j=1kiqj(dj=1),

where the denominator equals N′, the total number of words (or, more precisely, word occurrences) in the corpus. For the E-step, we need to calculate q_j, which is

(7)qj(dj)=∑zp(z)⋅p(dj|δ)⋅p(wj|dj, z, γ)∏j′≠j∑dj′p(dj′|δ)⋅p(wj′|dj′, z, γ),

where j′≠j denotes all word indices from 1 to k, except j. Note that, if the document is a labeled document (training document) with class z′, we do not need to sum overall z but instead use only z′ in the above equation. Equations (6) and (7) are derived in Appendix A. In practice, the above estimate will be dependent on the ratio of labeled and unlabeled documents. In the extreme case, where there are no unlabeled documents, the estimate of δ will become 1, as we will show in Section 5.3. On the contrary, we found experimentally that, if the number of unlabeled documents is much larger than the labeled data, the estimate of δ will tend to 0. To make the estimate less dependent on the ratio, we therefore replace Equation (6) by

δ∗=∑i=1n′λ(i)∑j=1kiqj(dj=1)∑i=1n′λ(i)(∑j=1kiqj(dj=0)+∑j=1kiqj(dj=1)),

where λ(i) is set to NlNu if document i is unlabeled and 1 if document i is labeled; N_l is the total number of words in the labeled data and N_u is the total number of words in the unlabeled data.

5 Theoretical Analysis

In this section, we investigate the effect of the background distribution γ (see Section 5.1), the divergence of the background distribution estimated from the labeled and unlabeled data (see Section 5.2), and the effect of our model on the (labeled) training data likelihood (see Section 5.3).

5.1 Inverse Feature Weighting by Background Distribution

We argue that the background distribution γ has the effect of a feature weighting, whereas low probability under the background distribution means a high feature weight.

First recall that we used argmax_zp(z|t) as our decision criterion for classifying text document t. Therefore, to choose class z₁ it is only important that p(z₁|t) is larger than p(z₂|t) for any other class z₂. Therefore, the decision criteria is whether the log ratio

logp(z1|t)p(z2|t)

is larger than 0 for any other class z₂. Using Equation (4), we can rewrite this ratio as

logp(z1)⋅∏j=1k((1−δ)⋅γwj+δ⋅θwj|z1)p(z2)⋅∏j=1k((1−δ)⋅γwj+δ⋅θwj|z2)=logp(z1)p(z2)+∑j=1kloggwj ,

For the case where θw|z1≠θw|z2, we can prove (see Appendix B) that

(8)|loggw(b)|>|loggw(a)|⇔b<a,

where a and b are two different values for γ_w. This shows that increasing γ_w results in decreasing the word’s weight for classification.

5.2 Divergence of Estimated Background Distributions

As before, we assume that the background distribution γ_w was estimated from the whole corpus (labeled+unlabeled documents). Let us denote by p˜(w) the estimate of γ_w that only uses the labeled documents. In this subsection, we show that the larger the difference between p˜(w) and γ_w is, the higher (in expectation) is the probability that a word will be drawn from γ_w rather than from any class word distribution θ_w_|z .

Formally, we show that

(9)DKL(γ||p˜A)<DKL(γ||p˜B)⇔Eγ[logp(dj=0|wj, θ˜zA)p(dj=1|wj, θ˜zA)]<Eγ[logp(djl=0|wj, θ˜zB)p(dj=1|wj, θ˜zB)],

where θ˜w|zA and θ˜w|zB are the two estimates of p(w|z) based on two different (labeled) training data sets A and B; p˜A(w) and p˜B(w) are the corresponding estimates of p(w), i.e.

p˜A(w)=∑zθ˜w|zA⋅p(z), and p˜B(w)=∑zθ˜w|zB⋅p(z).

The proof is deferred to Appendix C. We can think of p˜1 and p˜2 as the two estimates of γ_w that use different sizes of (labeled) training data sets. In general, the estimate from the larger training set will be closer to γ_w. Therefore, we can interpret the statement in (9), as that for larger training sets we can expect that the need for the background distribution drops.

This also draws an interesting parallel to the work in Refs. [24] and [20], which both try to use γ_w to make up for the bad estimate of p˜(w), when the training data are small.

5.3 No Lower Training Data Likelihood

It is interesting to see that the background distribution of our model does not help to better explain the training data, in the sense that our model does “not” achieve a higher training data likelihood than naive Bayes. Therefore, as the naive Bayes our model does not suffer from overfitting, the contribution of the background distribution comes solely from being able to better explain the unlabeled data. To prove this, we consider the δ that is optimal with respect to the log likelihood of the training data. This means δ*=argmax_δ log p(D/δ), which equals

argmaxδ∑i=1n∑j=1kilog(γwj+δ(θwj|z−γwj)).

The first derivative of the above term with respect to δ is

A′(δ):=∑i=1n∑j=1ki(θwj|z−γwj)γwj+δ⋅(θwj|z−γwj).

Furthermore, assuming that θ_w_|z is not identical to γ_w, we can see that the second derivative is (strictly) negative. Therefore, the function A(δ) is concave and there must be exactly one value δ* for which the function A(δ) reaches its maximum. As we have shown in Appendix D, this maximum is reached if δ* is 1. Recall that our model is identical to naive Bayes if δ equals 1.

6 Experiments

For our experiments, we used four standard corpora, 20 Newsgroups, Ohsumed, RCV1, and Reuters-21578, written in English. For the corpus 20 Newsgroups, we used the standard split of training and test data as suggested in http://people.csail.mit.edu/jrennie/20newsgroups/ with a total of 18,846 documents. For the corpus Ohsumed [10], we also used the standard split of training and test data with a total of 13,929 documents. RCV1 refers to the RCV1 corpus, with a total of 806,422 documents, as described in Ref. [18]. The Reuters-21578, described in Ref. [28], contains a total of 10,369 documents. ^[6] For 20 Newsgroups, we used all 20 classes, with the class proportions ranging from 3.3% to 5.3%. For Ohsumed, we used all 23 classes, with the class proportions ranging from 1.0% to 28.6%. For RCV1 and Reuters-21578, we used the top 5 classes as shown in Table 1.

Table 1:

Class Label Proportion for the Top 5 Classes in RCV1 (Left) and Reuters-21578 (Right).

RCV1		Reuters-21578
Class	Proportion	Class	Proportion
CCAT	46.4%	Earn	36.4%
GCAT	29.1%	Acq	21.3%
MCAT	24.8%	Money-fx	6.6%
ECAT	14.6%	Crude	5.4%
GPOL	7.1%	Trade	5.0%

We preprocessed the corpora using tokenization and stemming with Senna [4].

For 20 Newsgroups and Ohsumed, we used the whole test data. For RCV1 and Reuters-21578, we used as test data a random sample of 4000 documents of the whole corpus. For all four corpora, we used as training data a random sample of 50, 100, 300, 500, and 1000 documents of the remaining corpus.^[7] We ensure that the randomly sampled set of training data contains at least one document of each class.

We did not use a stop-word list, as it is partly domain dependent and therefore needs to be adjusted manually. As baseline methods, we used five methods: feature marginals-naive Bayes (FM-NB) [20], SAGE-naive Bayes (SAGE-NB) [8], EM-naive Bayes (EM-NB) [21], naive Bayes, which is identical to our method when δ is set to 1.0, and the GRWNB [31].

For GRWNB, we used Algorithm 1 of Ref. [31] with the binomial naive Bayes classifier.

For EM-NB, the hyperparameter is determined using two-fold cross-validation of the training data. For FM-NB, we also tried to change the Laplace smoothing with the Dirichlet prior but did not get any improvement. ^[8] For SAGE-NB, we used the original implementation. ^[9]

For our experimental setting, we assume that we have only access to the class label information from the training data and know all the test instances but without class label information. This experimental setting is the same as the one used in the original work of FM-NB [20]. ^[10] The estimation of the background distribution γ_w, which is used by our method and also by the baseline methods FM-NB and SAGE-NB, is calculated using the whole corpus. For our proposed method and the naive Bayes baseline, we placed a uniform Dirichlet prior over θ to prevent 0 word probabilities. Prior counts for each word, corresponding to the Dirichlet prior, are set to 1|V′|, where V′ is the set of vocabulary in the whole corpus. For our proposed method, we learn the hyperparameter δ using EM as described in Section 4.1 and stop if the change of δ is less than 0.0001.

For evaluation, we used the break-even point of precision and recall as defined in Ref. [17].^[11] The results of the (macro-averaged) break-even points are shown in Table 2.

Table 2:

(Macro-averaged) Break-even Point for the Proposed Method and Previous Methods using Three Different Corpora Trained with 50, 100, 300, 400, and 1000 Training Documents. Bold Font Marks Highest Break-even Point. Asterisk Marks Statistical Significance using Pairwise Comparison.

Method	Size 50	Size 100	Size 300	Size 500	Size 1000
20 Newsgroups
Proposed	0.218*	0.294*	0.468	0.530	0.544
FM-NB	0.092	0.121	0.245	0.319	0.409
SAGE-NB	0.120	0.190	0.289	0.342	0.426
EM-NB	0.096	0.128	0.308	0.438	0.569
Naive Bayes	0.097	0.137	0.255	0.334	0.431
GRWNB	0.167	0.272	0.481*	0.532	0.597
RCV1
Proposed	0.569*	0.609*	0.716	0.743	0.770
FM-NB	0.529	0.548	0.704	0.730	0.756
SAGE-NB	0.545	0.572	0.710	0.712	0.724
EM-NB	0.467	0.493	0.667	0.712	0.772
Naive Bayes	0.465	0.481	0.587	0.618	0.686
GRWNB	0.523	0.505	0.669	0.695	0.728
Reuters-21578
Proposed	0.622*	0.713*	0.757	0.759	0.765
FM-NB	0.572	0.683	0.760	0.768	0.787
SAGE-NB	0.571	0.682	0.763	0.724	0.740
EM-NB	0.441	0.575	0.733	0.744	0.793
Naive Bayes	0.436	0.504	0.584	0.678	0.720
GRWNB	0.603	0.681	0.722	0.746	0.739
Ohsumed
Proposed	0.194	0.260	0.362	0.400	0.467
FM-NB	0.135	0.163	0.217	0.263	0.319
SAGE-NB	0.188	0.240	0.351	0.404*	0.479*
EM-NB	0.189	0.236	0.319	0.363	0.437
Naive Bayes	0.132	0.150	0.189	0.228	0.288
GRWNB	0.196*	0.257	0.336	0.384	0.456

Furthermore, we performed a pairwise comparison between all methods using the micro sign-test (see, for example, Ref. [28]) with the decision boundary induced by the break-even point. If method A was better than each other method B with p<0.01, we consider the result as statistically significant and marked the result with an asterisk in Table 2.

As we can see in Table 2, our proposed method statistically significantly improves over other proposed semisupervised naive Bayes methods when the training data are small (≤100).

In addition to the micro sign-test for one classification task, we also test for statistical significance across all classification tasks (i.e. different class labels and different corpora). We follow the guidelines in Ref. [7] to test whether the proposed method performs statistically significantly better than other NB methods across different classification tasks. First, we use the variation of the Friedman test proposed by Iman and Davenport [11] to check the null hypotheses that all classifiers perform equal. We use the Friedman test to compare the average ranking with respect to the break-even points, which are listed in Table 3 for different training data sizes. The resulting test statistics for training data sizes 50, 100, 300, 500, and 1000 are 36.4, 51.3, 49.1, 36.6, and 25.8, respectively. The test statistic is F-distributed with 6−1=5 and (6−1)·(53−1)=260 degrees of freedom. For all training data sizes (50, 100, 300, 500, and 1000), the test statistics is larger than the critical value of the corresponding F-distribution at significance level 0.05. We can therefore reject the null hypotheses. For post hoc test, we proceed with the Bonferroni-Dunn test with one control. This can be tested by comparing the average rank difference as explained in Ref. [7]. The critical difference (CD) for significance of at least p<0.05 is 0.94 and for p<0.10 is 0.85. Overall, we can conclude that our proposed method improves classification performance for small training data sizes (50–300). In particular, our proposed method improves statistically significant at p<0.05 over all other generative naive Bayes methods that use the background distribution (i.e. FM-NB and SAGE-NB).

Table 3:

Rankings of the Proposed Method and All Baselines for the Five Experimental Settings (Training Data Size in {50, 100, 300, 500, 1000}) Over All Corpora and All Binary Classifications Tasks. The CD Values for the Post Hoc Bonferroni-Dunn Test are 0.94 (p<0.05) and 0.85 (p<0.10). Bold Font Marks Highest Ranking. Asterisk Marks the Statistical Significance at p<0.10 (for Details, see Text).

Method	Ranking
Method	Size 50	Size 100	Size 300	Size 500	Size 1000
Proposed	2.11	1.88	1.89*	2.07	2.49
FM-NB	4.66	4.90	4.71	4.56	4.52
SAGE-NB	3.61	3.58	3.08	3.25	3.19
EM-NB	2.57	2.66	3.18	3.09	2.91
Naive Bayes	5.14	5.20	5.40	5.30	5.15
GRWNB	2.91	2.78	2.75	2.74	2.75

Finally, we inspect some of the δ’s that were actually learned by our proposed method. Recall that the hyperparameter δ denotes the probability that a word is generated from a class word distribution θ_w_|z rather than from the background distribution γ_w. In general, we should expect that, for small labeled training data, the distribution p˜(w) that is induced by θ_w_|z gives a bad estimate of γ_w. Consequently, as we have proven in Section 5.2, we expect that, for small labeled training data, the use of the background distribution γ becomes more important, leading to a small value for δ. Indeed, inspecting some of the values for δ in Table 4, we can experimentally confirm this.

Table 4:

Actual δ Values that were Learned with Empirical Bayes when Using Labeled Training Data of Size 100 and 1000, Respectively, Together with all Unlabeled Documents.

Corpus (class)	Size 100	Size 1000
Newsgroups (alt.atheism)	0.568	0.577
RCV1 (CCAT)	0.603	0.711
Reuters-21578 (acq)	0.635	0.777
Ohsumed (C01)	0.526	0.532

7 Conclusions

We analyzed a generative model for classification that models a word w being generated either from a class word distribution θ_w_|z or from a background distribution γ_w. The decision whether a word is sampled from γ_w or θ_w_|z is performed individually for each word by a Bernoulli trial with parameter δ. We explained that the hyperparameter δ can be learned using empirical Bayes (see Section 4.1).

Although similar models have been proposed in the past (see Section 2), we are the first to analyze and evaluate its usages in the context of text classification. Our theoretical and experimental analysis suggests that γ_w helps to model the misfitting of the unlabeled data to each class word distribution and can also be interpreted as a kind of feature weighting (see Section 5.1).

Furthermore, our experiments, in Section 6, using three standard text classification corpora showed that the proposed method can statistically significantly outperform previous NB extensions such as [20] and [8] that also incorporate the background distribution. Based on these results, we therefore think that there is strong evidence that a generative model for text classification should incorporate the background distribution as suggested in this paper.

Acknowledgments

We would like to thank the four anonymous reviewers for their helpful comments and corrections.

Appendix

A Derivation of EM Equations

In this appendix, we derive Equations (6) and (7). Let f(δ):=E_q[log p(D′, h|δ,γ)], where q is set to p(h|D′,δ,γ). Then, we have

f(δ)=Eq[∑i=1n′logp(ti, hi|δ, γ)]=∑i=1n′Eq[logp(ti, hi|δ, γ)]=∑i=1n′Eqi[logp(ti, hi|δ, γ)],

where h_i are the hidden variables for document i, and q_i corresponds to marginal distribution p(h_i|D′, δ, γ). Note that the first equation holds, as we assume that each document is generated independently given all model parameters. Next, using p(ti, hi|δ, γ)=p(zi)⋅∏j=1kip(dj|δ)⋅p(wj|dj, zi, γ), we get

f(δ)=∑i=1n′Eqi[logp(zi)+log∏j=1kip(dj)⋅p(wj|dj, zi,)]=∑i=1n′Eqi[logp(zi)]+∑i=1n′Eqi[log∏j=1kip(dj)⋅p(wj|dj, zi)]=C+∑i=1n′Eqi[log∏j=1kip(dj)⋅p(wj|dj, zi)]=C+∑i=1n′Eqi[∑j=1kilogp(dj)+∑j=1kilogp(wj|dj, zi)]=C+∑i=1n′Eqi[∑j=1kilogp(dj)]+∑i=1n′Eqi[∑j=1kilogp(wj|dj, zi)].

The first term, as well as the last term, is a constant, as we consider q and all other parameters, except δ fixed. Therefore, we have for some constant C′:

f(δ)=∑i=1n′Eqi[∑j=1kilogp(dj)]+C′=∑i=1n′∑j=1kiEqi[logp(dj)]+C′=∑i=1n′∑j=1kiEqj[logp(dj)]+C′=∑i=1n′∑j=1ki∑djqj(dj)logp(dj)+C′=∑i=1n′∑j=1kiqj(dj=0)log(1−δ)+qj(dj=1)logδ+C′

where we note the change from q_i to q_j in the third line; q_j corresponds to the marginal distribution q(d_j). In the final line, we used that δ=p(d_j=1) for all j. Setting df(δ)dδ to zero, we get that f(δ) is maximized by setting δ to

δ∗=∑i=1n′∑j=1kiqj(dj=1)∑i=1n′∑j=1kiqj(dj=0)+∑j=1kiqj(dj=1),

where the denominator equals N′, the total number of words (or more precisely, word occurrences) in the corpus. For the E-step, we need to calculate q_j, which is

B Proof of Feature Weighting

In this appendix, we proof Equation (8).

Proof. Let a and b be two different values for γ_w such that b<a. We now need to prove that |log g_w(b)|>|log g_w(a)|.

In the first case, let us assume that θw|z1>θw|z2. We then have

In the above proof, line 4 to 5 follows by using that θw|z1>θw|z2, therefore, g_w(b) and g_w(a) are larger than 1. Furthermore, line 2 to 3 follows by first defining a′:=(1–δ)a, b′=(1–δ)b, and θ′1:=δθw|z1,θ′2:=δθw|z2. Then, line 3 can be written as

where we factored out (1–δ)δ and used that 0<δ<1.

In the second case, θw|z1<θw|z2, is analogously to before, where we have

θw|z1<θw|z2⇔gw(b)<gw(a)⇒|loggw(b)|>|loggw(a)|.

In the above proof, line 2 to 3 follows by using that θw|z1<θw|z2, therefore, g_w(b) and g_w(a) are smaller than 1.

Finally, noting that the case for a<b is analogous, we can conclude that

b<a⇔|loggw(b)|>|loggw(a)|.

C Proof of Divergence from Background Distribution

In this section, we proof the statement of Equation (9). To facilitate notation, let us write short θ˜z for the word probabilities {θ˜w|z}w∈V.

Proof. First, using Equation (3), we get that

where a variable with index “≠j” denotes all the variables with indices 1 to k, except index j. ^[12] Next, we inspect the two cases d_j=1 and d_j=0:

p(dj=1, wj|θ˜z)=∑zp(z)p(wj|z, dj=1, θ˜z)p(dj=1)=∑zp(z)θ˜wj|zδ=δ⋅p˜(wj),

and analogously, we have

p(dj=0, wj|θ˜z)=∑zp(z)γwj(1−δ)=(1−δ)⋅γwj.

Putting these results together, we get that

Eγ[logp(dj=0|wj, θ˜z)p(dj=1|wj, θ˜z)]=Eγ[logp(dj=0, wj|θ˜z)p(dj=1, wj|θ˜z)]=∑wγw⋅logp(dj=0, wj=w|θ˜z)p(dj=1, wj=w|θ˜z)=∑wγw⋅logγwp˜(w)+log1−δδ=DKL(γ||p˜)+log1−δδ

□

D ML Estimate of δ Using only Labeled Training Data

In this section, we continue the proof of Section 5.3 and show that the ML estimate of δ* is 1.

Proof. Let δ be set to 1, then we have

A′(δ)=∑i=1n′∑j=1ki1−γwjθwj|z=∑w∑zNw,z⋅(1−γwθw|z)=∑w∑zNw,z−∑w∑zNw,zγwθw|z=N−∑w∑zNw,zγwθw|z,

where we defined N_w,_z as the number of times word w occurs in topic z, and N is the total number of words. Next, we use that θ is the ML estimate of naive Bayes, i.e.

θw|z=Nw,zNz,

where N_z is the total number of words occurring in class z. Therefore, we get

A′(δ)=N−∑w∑zNzγw=N−∑zNz=0.

Bibliography

[1] D. M. Blei, A. Y. Ng and M. I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res.3 (2003), 993–1022.Search in Google Scholar

[2] C. Chemudugunta, P. Smyth and M. Steyvers, Modeling general and specific aspects of documents with a probabilistic topic model, in: NIPS, 19, pp. 241–248, 2006.Search in Google Scholar

[3] J. Cheng and R. Greiner, Comparing Bayesian network classifiers, in: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers, Inc., pp. 101–108, 1999.Search in Google Scholar

[4] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa, Natural language processing (almost) from scratch, J. Mach. Learn. Res.12 (2011), 2493–2537.Search in Google Scholar

[5] H. Daumé III and D. Marcu, Bayesian query-focused summarization, in: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 305–312, 2006.10.3115/1220175.1220214Search in Google Scholar

[6] A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc.39 (1977), 1–38.10.1111/j.2517-6161.1977.tb01600.xSearch in Google Scholar

[7] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res.7 (2006), 1–30.Search in Google Scholar

[8] J. Eisenstein, A. Ahmed and E. P. Xing, Sparse additive generative models of text, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1041–1048, 2011.Search in Google Scholar

[9] E. C. Y. He and K. L. J. Zhao, A weakly supervised Bayesian model for violence detection in social media, in: International Joint Conference on Natural Language Processing (IJCNLP), 2013.Search in Google Scholar

[10] W. Hersh, C. Buckley, T. J. Leone and D. Hickam, OHSUMED: an interactive retrieval evaluation and new large test collection for research, in: SIGIR, Springer, pp. 192–201, 1994.10.1007/978-1-4471-2099-5_20Search in Google Scholar

[11] R. L. Iman and J. M. Davenport, Approximations of the critical region of the Fbietkan statistic, Commun. Stat. Theory Methods9 (1980), 571–595.10.1080/03610928008827904Search in Google Scholar

[12] L. Jiang, D. Wang and Z. Cai, Discriminatively weighted naive Bayes and its application in text classification, Int. J. Artif. Intell. Tools21 (2012), 1250007.10.1142/S0218213011004770Search in Google Scholar

[13] L. Jiang, Z. Cai, H. Zhang and D. Wang, naive Bayes text classifiers: a locally weighted learning approach, J. Exp. Theor. Artif. Intell.25 (2013), 273–286.10.1080/0952813X.2012.721010Search in Google Scholar

[14] L. Jiang, C. Li, S. Wang and L. Zhang, Deep feature weighting for naive Bayes and its application to text classification, Eng. Appl. Artif. Intell.52 (2016), 26–39.10.1016/j.engappai.2016.02.002Search in Google Scholar

[15] L. Jiang, S. Wang, C. Li and L. Zhang, Structure extended multinomial naive Bayes, Inf. Sci.329 (2016), 346–356.10.1016/j.ins.2015.09.037Search in Google Scholar

[16] Q. Jiang, W. Wang, X. Han, S. Zhang, X. Wang and C. Wang, Deep feature weighting in naive Bayes for Chinese text classification, in: Cloud Computing and Intelligence Systems (CCIS), 2016 4th International Conference on, IEEE, pp. 160–164, 2016.10.1109/CCIS.2016.7790245Search in Google Scholar

[17] T. Joachims, Text categorization with support vector machines: learning with many relevant features, in: European Conference on Machine Learning, 1998.10.1007/BFb0026683Search in Google Scholar

[18] D. D. Lewis, Y. Yang, T. G. Rose and F. Li, Rcv1: a new benchmark collection for text categorization research, J. Mach. Learn. Res.5 (2004), 361–397.Search in Google Scholar

[19] P. Li, J. Jiang and Y. Wang, Generating templates of entity summaries with an entity-aspect model and pattern mining, in: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 640–649, 2010.Search in Google Scholar

[20] M. R. Lucas and D. Downey, Scaling semi-supervised naive Bayes with feature marginals, in: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 343–351, 2013.Search in Google Scholar

[21] K. Nigam, A. K. McCallum, S. Thrun and T. Mitchell, Text classification from labeled and unlabeled documents using EM, Mach. Learn.39 (2000), 103–134.10.1023/A:1007692713085Search in Google Scholar

[22] M. Paul and R. Girju, A two-dimensional topic-aspect model for discovering multi-faceted topics, AAAI51 (2010), 61801.10.1609/aaai.v24i1.7669Search in Google Scholar

[23] J. D. Rennie, L. Shih, J. Teevan and D. R. Karger, Tackling the poor assumptions of naive Bayes text classifiers, in: Proceedings of the International Conference on Machine Learning, 3, pp. 616–623, 2003.Search in Google Scholar

[24] J. Su, J. S. Shirab and S. Matwin, Large scale text classification using semi-supervised multinomial naive Bayes, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 97–104, 2011.Search in Google Scholar

[25] H. M. Wallach, D. M. Mimno and A. McCallum, Rethinking LDA: Why priors matter, in: NIPS, 22, pp. 1973–1981, 2009.Search in Google Scholar

[26] S. Wang, L. Jiang and C. Li, A CFS-Based Feature Weighting Approach to Naive Bayes Text Classifiers, Springer International Publishing, Cham, pp. 555–562, 2014.10.1007/978-3-319-11179-7_70Search in Google Scholar

[27] S. Wang, L. Jiang and C. Li, Adapting naive Bayes tree for text classification, Knowl. Inf. Syst.44 (2015), 77–89.10.1007/s10115-014-0746-ySearch in Google Scholar

[28] Y. Yang and X. Liu, A re-examination of text categorization methods, in: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49, 1999.10.1145/312624.312647Search in Google Scholar

[29] C. Zhai and J. Lafferty, A study of smoothing methods for language models applied to ad hoc information retrieval, in: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 334–342, 2001.10.1145/383952.384019Search in Google Scholar

[30] L. Zhang, L. Jiang and C. Li, A new feature selection approach to naive Bayes text classifiers, Int. J. Pattern Recogn. Artif. Intell.30 (2016), 1650003.10.1142/S0218001416500038Search in Google Scholar

[31] L. Zhang, L. Jiang, C. Li and G. Kong, Two feature weighting approaches for naive Bayes text classifiers, Knowl. Based Syst.100 (2016), 137–144.10.1016/j.knosys.2016.02.017Search in Google Scholar

Received: 2017-01-27

Published Online: 2017-07-20

Published in Print: 2019-04-24

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Analysis of the Use of Background Distribution for Naive Bayes Classifiers

Abstract

1 Introduction

2 Related Work

3 Naive Bayes Model

4 Proposed Extension of Naive Bayes Generative Model

4.1 Learning δ

5 Theoretical Analysis

5.1 Inverse Feature Weighting by Background Distribution

5.2 Divergence of Estimated Background Distributions

5.3 No Lower Training Data Likelihood

6 Experiments

7 Conclusions

Acknowledgments

Appendix

A Derivation of EM Equations

Bibliography

Journal and Issue

Articles in the same Issue