Abstract

In this paper, we study the problem of protecting privacy in recommender systems. We focus on protecting the items rated by users and propose a novel privacy-preserving matrix factorization algorithm. In our algorithm, the user will submit a fake gradient to make the central server not able to distinguish which items are selected by the user. We make the Kullback–Leibler distance between the real and fake gradient distributions to be small thus hard to be distinguished. Using theories and experiments, we show that our algorithm can be reduced to a time-delay SGD, which can be proved to have a good convergence so that the accuracy will not decline. Our algorithm achieves a good tradeoff between the privacy and accuracy.

1. Introduction

Recommender systems, which help the electronic commerce websites to give more useful suggestions, are becoming more and more important. However, to provide users with appropriate options, the server will collect users’ data, which includes lots of sensitive information.

Data in electronic commerce, economics, supply chain, financial system [110], etc., are generally very sensitive. In the electronic commerce case, it is shown in many studies, such as [11, 12] that user data in recommender systems, shopping records, movies a user has watched, and ratings for the restaurants contain lots of very private information such as political attitudes, sexual orientation, etc. In this paper, we study the privacy-protecting problem in electronic commerce data. Privacy has been an important issue for a long time, not only in the recommender system but also in almost all algorithms in data mining and machine learning.

Differential privacy [13] is a popular method to protect privacy in machine learning algorithms. For recommender systems, there are many works applying differential privacy, such as [1416]. Differential privacy matrix factorization algorithms are introduced in [17, 18], etc. Traditional differential privacy method is centralized, in other words, relying on a trustworthy data collector. When we want the central server not to be able to get privacy information, local differential privacy (LDP) should be used. Every user will add noise to their private data in their own device before being submitted to the central server. Recommender systems with LDP are studied in [1921]. LDP has been used in Google’s Chrome browser [22] and Apple IOS 10 [23] to collect user data.

In local differential privacy, there are two important things to be protected. The first one is which items this user has rated and the second one is the ratings of the user. In some situations, which items have been rated is much more sensitive than the rating itself. For example, shopping record contains a lot of private information, but the ratings can only represent the quality of goods. The work in [19] can only protect the ratings but not both. Shin et al. [17] proposed a novel LDP matrix factorization algorithm to protect both kinds of privacy information based on the work in [24]. Their method is to let the user submit a noisy gradient, whose value is either or . The algorithm is -LDP, and in each round of the training process, and since the output is binary, the adversary can not learn about which items are rated in a single iteration process.

However, if the adversary can get noisy gradients in multiple iterations since the noisy gradients obey the Bernoulli distribution with a mean 0, the items which have not been rated can be identified by a statistical test. The intensity of the privacy protection for the ratings and items after multiple iterations can be guaranteed by composition theorems for LDP [25, 26]. If every iteration is -LDP, after k iterations, the final algorithm is at most -LDP. But these analyses are not a direct guarantee to protect the items rated by the users. We can turn to a new perspective on this question. After performing k iterations, given a sequence with length k denoted by , where is the gradient submitted in iteration i, let be the probability that is a real gradient sequence and let be the probability that is fake. Using these two probabilities, we can consider testing two hypotheses, the sequence is real and the sequence is fake. So now comes the question, how can we make it difficult to distinguish the two situations?

In order to improve the ability of protecting privacy, we want the probability error to be large. Note that the average negative log probability of error is well-known deduced from the Chernoff–Stein lemma.

Theorem 1 (Theorem 11.8.3 in [27]). is a random variable; consider the hypothesis test between two alternatives, and , where , the K-L distance, is finite. Then the average negative log probability of error of this hypothesis testing is .

Using this result, although we can not obtain the distribution of the real sequence, in Section 4, we will show that for the Gaussian noise based differential privacy algorithm, we can estimate the mean value of K-L distance and optimize the value of fake gradient to make the two distributions to be difficult to distinguish.

In this paper, we propose a novel algorithm such that if the item has not been rated by the user, the user will submit a fake gradient. Else, the user can submit the real one, but all the submitted data will eventually be noise added. The paper is organized as follows. In Section 2, we introduce differential privacy briefly as preliminaries. In Section 3, we introduce the framework of the general differential privacy matrix factorization algorithm. And in Section 4, we will show that our algorithm can reduce the average K-L distance between the fake and real gradient distributions, such that it can improve the intensity to protect the privacy items. Meanwhile, we can prove that our algorithm has the form of SGD with time delay, which can be proved that the accuracy of the model will not be reduced by our updating rules so that our algorithm achieves a tradeoff between accuracy and privacy. In Section 5, we use experiments to show the effectiveness of our algorithm. The related work is reviewed in Section 6. In the final section, we conclude.

2. Preliminaries

In this paper, the notations we used are listed in Table 1.

2.1. Differential Privacy

Differential Privacy is first introduced by Dwork et al. [13], the aim of which is to make it difficult for an attacker to obtain privacy from the output data by adding noise.

Definition 1. A randomized algorithm with domain D and range R is -Differential Privacy, if for two adjacent data and for a subset S of range R, it holds that

Note that this definition is to compare the two probability. If , it can be expressed as

If is small, such that it is hard to distinguish whether the output data is come from or . As in [28], one can link differential privacy with mutual-information.

Another way to describe Differential Privacy is to use the distance between distributions. We say a randomized algorithm is Renyi Differential Privacy if for all neighboring and we have

When , is the Kullback-Leibler distance, and when , Renyi Differential Privacy is equal to Differential Privacy. So we can see Differential Privacy is to make the output distributions with different inputs to be indistinguishable(the distributions have small distances).

One may ask how to achieve -Differential Privacy in machine learning process. A basic paradigm to achieve ”-differential privacy is to examine a query -sensitivity in [29].

Definition 2. is a map from the data in the dataset to a vector. The -sensitivity of is .

Using this definition, we have the following theorem in [29].

Theorem 2. If f is a map from to . Then the randomized algorithm whereachieves -Differential Privacy.

This theorem provides a basic method to achieve Differential-Privacy-Machine-Learning.

3. The Framwork of Perturbed Matrix Factorization Algorithm

The program of Matrix Factorization algorithm with privacy protection has been studied by many authors, such as [17, 19].

When minimizing the cost function

We can use gradient descent

The vector is the user profile vector for user i, and is the item profile vector for item j.

Note that we havewhere if else.

In this type of program, the user profile vectors are saved and updated on the users’ own devices. As for the item profile vectors, all the users will send the gradient to the central server, and individual users should perturb their gradient using a random mechanism . Then the central server sums all these gradients to update the item profile vectors . Using this random perturbation, -differential privacy can be achieved by adjusting the distribution of noise.

The whole process is shown in Algorithm 1.

Input: Random mechanism , learning rate , and redefined iteration number k
Output: Item profile matrix V
Randomly initialize for all i and j.
fordo
 Initialize for all j in central server.
fordo
  On user i: sample j uniformly
  from{}.
  ifthen
   
   
   
  end
  else
   Generate a fake gradient of .
   set
   
  end
   for all j.
end
 For all j:
  fordo
  Update on a local device by gradient descent.
end
end

Note that there are two types of private information. One is the ratings of the users and the other one is the items have been rated by the users.

In order to protect the items, one way is to use the random response mechanism introduced in Section 4.1 of [17]. In this method, we generate a such that with probability p, and if the original , we set a fake rating so the fake gradient is by (8), and Gaussian noise is added to the final gradient sent to the central server to protect the ratings of users.

However, it is shown in the discussion of Section 4.1 of [17] that the error caused by these fake ratings is not small, which will influence the final model accuracy. The main reason is that there are many fake gradients, which lead to a great error in the expectation of the sum of gradients.

One way is to solve this problem is to set the fake gradient to be zero. If , the user sends a random variable to the central server. This method is used in [17], where is a Bernoulli random variable with mean value . However, the disadvantage of this method is that the distribution of gradients in the case is very different from the distribution of the real gradient. For example, we can collect some data of sent by the user , and use a statistical test to test if this data obeys the certain distribution of mean 0, then we can know whether .

All in all, we need to strike a balance between privacy and accuracy. We need to provide a fake gradient to make sure the accuracy will not be greatly affected and let these two distributions, the fake one and the real one, to be statistically indistinguishable as far as possible.

4. The Main Results

In this paper, since we are concerned about the items of users, we will focus on considering the statistical distance of and distributions. We propose a novel algorithm to protect items of the users. In our algorithm, the user will submit a noise-added fake gradient in the case. The K-L distance between the real and fake distributions will be small so that they are hard to be distinguished. On the other hand, we will study how will the fake gradients influence the model accuracy. We will show that in our algorithm, the updating rules can be reduced to a time-delay SGD, which will not influence the accuracy.

In our algorithm, the random mechanism we choose is the Gaussian random mechanism, . One of the advantages is that there is a very good composition theorem [26] which gives a much tighter estimate on the multi-iteration privacy loss for Gaussian mechanism-based differential privacy gradient descent algorithm.

Theorem 3 (Theorem 1 in [26]). Let be the gradient bound in privacy gradient descent, there exist two constants and such that the after k iterations, the Gaussian noisy privacy gradient descent algorithm is -differentially private for any if we choose

Generally, is chosen to be a prior bound of the gradient norm, so we do not write it in the algorithm description explicitly.

In the case of the Gaussian random mechanism, it is easy to calculate the K-L distance between distributions. In the following section, we will show that we can find a good choice of the fake gradient.

4.1. Estimating the K-L Distance between Two Distributions

Given a gradient sequence with length k, a probability of can be represented in the following form.

Using this form we can calculate K-L distance.

Given two probability measures and in length k sequence space, we have

In each iteration, the user will sent a perturbed gradient to the central server, which has the following forms:

The we have , where is the gradient calculated from , so .

This is the K-L distance between two Gaussian distributions with the same . We can show that .

From equation (11), if we want to optimize the K-L distance, we need to consider

Although we do not know the distribution of real gradients, this means value can be estimated by sampling. Let be the set of user i such that .

And in our algorithm, for a given item j, all the users will use the same F—in other words, we is independent of . then the above equation is a function of the quadratic form.

In order to minimize this K-L distance, we should set . . However, at time t, the user i can not get the current gradient . However, in the following section, we will show that in our algorithm we can estimate it from the previous gradient .

4.2. Algorithmic Description

In Algorithm 1 with Gaussian random mechanism, we can see that the central server will receive the gradients submitted from the users, whose summation is as follows:

Suppose , is just a Langevin stochastic gradient [30] whose expected value is the total gradient. When , using to update the parameters will generally influence the accuracy of the model. One way to solve this problem is to subtract a value in the central server.

In order to determine the value of to make the part small, we can use the Random Response mechanism.

The random Response mechanism [31] is a well-known method to obtain statistical information on sensitive issues, e.g., the proportion of AIDS. In our algorithm, we will use the Random Response mechanism to count the number of items, which is used for the central server to correct the sum of the gradients.

The procedure of the Random Response mechanism is that the responder will give the true answer with probability , and with probability 1-p, the answer will give an opposite answer.

Theorem 4. (Warner, 1965, in [31]). Suppose the number of the answer of is , and the total number of the responders is n. If , is an unbiased estimate of the ratio with variance , where is the real ratio of items.

The variance is , so if the total number of the users is large enough, with a high probability, .

The whole process is shown in Algorithm 2.

Input: Redefined iteration number k, learning rate , probability p for Random Response and Standard deviation of Gaussian distribution .
Output: Item profile matrix V
For all items j, use the probability p Random Response method to estimate the ratio of the users with as .
Randomly initialize for all i and j.
fordo
 Initialize , for all j = 1, 2, …, n in central server.
fordo
  On user i: sample B items uniformly from{}
  fordo
   
   ifthen
    
    Draw
   end
   else
    ifthen
      
     end
     else
     
    end
    
    Draw
   end
  end
  .
end
fordo
  ifthen
  
  
  end
  else
   
   
   
   
  end
end
fordo
  Update on the local device by gradient descent.
end
end

It is easy to see that, in the central server, the update process has the following forms:where is the sampling stochastic gradient and is the number of terms in sampling.

As for , we know that

Note that since the regularization term bound the norm of matrix and , there exits a small constant to make the loss function L(u, v) to be , that is to say,

Since , is a good approximation of .

So we have the folowing:

One can easily prove that the variances of all these estimations are .

4.3. The Influence of Model Accuracy

We can see the form of updating rule (21) is a stochastic gradient descent with time delay. It can be shown that even if a not small, time delay SGD will still have good convergence.

The convergence of SGD with time delay is proved in [32]. In this paper, Lian proved the convergence of asynchronous stochastic gradient descent which has the same form as equation (21).

Theorem 5. (Theorem 1 in [32]). Assume the loss function is , is the learning rate, B is the batch size, and T is the time delay. Ifafter K iterations, we have with high probability,

Where is the global minimum of f and is the standard deviation of stochastic gradients.

Proof of Theorem 5. In this case, the stochastic gradients sent by the node at time can be written as , where is the time delay of the gradient and is the noise (including noise from the stochastic gradients and the Gaussian noise we added). In our case, is a sub-Gaussian random variable. To simplify the description, we assume is -sub-Gaussian.In order to estimate , we can use lemmas in [33].
Let . With probability , we have the following:This is from Lemma 30 in [33].
With high probability,And with high probability,We have the following:. With probability at least ,The theorem follows.
This theorem has the same form as the convergence theorem of general and SGD, and in our case, we have . So we can show this time delay will not influence the convergence.

4.4. Privacy Loss in the Random Response Mechanism

At the start of our algorithm, we need to use the Random response mechanism to estimate the ratio of , which will cause a privacy loss. However, we can show that since we need a large number of iterations in the machine learning algorithm, the initial privacy loss is insignificant.

It is easy to prove that the Random response mechanism is -Differential Privacy. We know from Theorem 3 that after k iterations. If n is large enough, we can choose a p near 0.5, and when k is large, will be much less then .

Noting that the K-L distance for a length k sequence is O(k), the discussion on the K-L distance is the same.

5. Experiments

We now show the performance of our algorithm. We evaluate three types of privacy gradient descent algorithms:(i)Algorithm 1, the noisy gradient descent with . The users will submit a gradient if , where is a Gaussian random variable.(ii)Algorithm 2, noisy gradient descent with .(iii)Algorithm 3, our algorithm in this paper.

In the case, the only noise in the total gradient is caused by Gaussian noise added to the users’ device. This algorithm will be accurate but has no ability to protect the item's privacy. We will show that the performance of our algorithm is very close to the case and much better than the algorithm using fake ratings.

We test on MovieLens 100k dataset [34]. This version contains 100k ratings of 1682 movies submitted by 982 users. This dataset is very sparse. In order to test the performance in different situations of sparsity, for every user, we choose a set of items to be selected to provide fake gradients. We consider different cases that (50% fake gradient density), (75% fake gradient density) to test the algorithm. We set the profile vector dimension , regularization parameters , learning rate , and use AdaDelta to optimize. The test RMSE is shown in Figures 1 and 2.

After 400 iterations, the test RMSE is listed in Table 2.

We see that when the density of fake rating increases, the test RSME of fake rating algorithm is growing rapidly, and the performance algorithm is very close to the zero mean fake gradient algorithm.

Differential privacy introduced by Dwork [13] is a very strong guarantee to protect privacy. The original version of differential privacy consider a trusted server to provide data to queriers, and the aim is to prevent access to user privacy from queries.

Local differential privacy algorithm, such as RAPPORT [22], is to make sure the central server can not access the privacy of the users. The main technology is to add some noise before submitting the data to the server. In the Chrome browser, Google uses a randomized response mechanism to collect the data of the users’ clicks. Also, there are many works to use local differential privacy to perform machine learning algorithms. For example, Google uses local differential privacy Federated Learning [35] to learn a language model in order to improve the performance of the inputting method.

One of the difficulties in differential privacy machine learning is that when training a model using many iterations, the privacy guarantees will decline rapidly. Differential privacy for multi-iterations is studied in [25, 26] and a much tighter composition theorem is given.

Private recommender system is studied by many authors such as [1720, 36, 37]. References [17, 18] are based on a matrix factorization recommender system. The algorithm is to adding some noise in users’ devices locally to protect privacy. The algorithm in [17] can protect both the ratings and the items of the user. Their work is based on the work in [24], where they propose a new randomization mechanism and show that their mechanism is better when the dimension of data is large.

7. Conclusion

In this paper, we propose a novel privacy matrix factorization algorithm. In our algorithm, we use the Random Response method to estimate the selection ratios of the items, and then we use the average value of the gradients in the previous time as the fake gradient to be sent to the central server. Using our method, we can improve the indistinguishability of the real gradient and fake distributions so that improve the ability to protect user private items. Meanwhile, we show that our algorithms will not cut down the accuracy of the model since the updating rule can be reduced to SGD with time delay, which can be proved to convergence to gradient zero points.

Data Availability

The Movielens-100K, http://files.grouplens.org/datasets/movielens/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work has been supported by the Fundamental Research Funds for the Central Universities (grant number: 2020JBM002) and the National Key Research and Development Program of China (grant no. 2018YFC0831703).