Abstract

The explosion of multiomics data poses new challenges to existing data mining methods. Joint analysis of multiomics data can make the best of the complementary information that is provided by different types of data. Therefore, they can more accurately explore the biological mechanism of diseases. In this article, two forms of joint nonnegative matrix factorization based on the sparse and graph Laplacian regularization (SG-jNMF) method are proposed. In the method, the graph regularization constraint can preserve the local geometric structure of data. -norm regularization can enhance the sparsity among the rows and remove redundant features in the data. First, SG-jNMF1 projects multiomics data into a common subspace and applies the multiomics fusion characteristic matrix to mine the important information closely related to diseases. Second, multiomics data of the same disease are mapped into the common sample space by SG-jNMF2, and the cluster structures are detected clearly. Experimental results show that SG-jNMF can achieve significant improvement in sample clustering compared with existing joint analysis frameworks. SG-jNMF also effectively integrates multiomics data to identify co-differentially expressed genes (Co-DEGs). SG-jNMF provides an efficient integrative analysis method for mining the biological information hidden in heterogeneous multiomics data.

1. Introduction

With the development of state-of-the-art sequencing technology, a large quantity of effective experimental data has been collected. These data may imply some unknown molecular mechanisms. Bioinformatics is faced with the task of analyzing massive omics data. The Cancer Gene Atlas (TCGA, https://tcgadata.nci.nih.gov/tcga/) includes gene expression profile data (GE), DNA methylation data (DM), copy number variation data (CNV), protein expression data, and drug sensitivity data. These data are from approximately 15,000 clinical samples of more than 30 kinds of cancers [1]. These massive data enable researchers to study the mechanisms of cancer production, diagnosis, and treatment at different biological levels.

The joint analysis of multiomics data can make up for lost or unreliable information in single omics data. In recent years, scientists have performed considerable research on the cancer mechanisms based on the joint analysis of cancer multiomics data. For example, Christina et al. integrated the gene expression data and copy number variations of breast cancer, identified possible pathogenic genes, and discovered new subtypes of breast cancer [2]. Wang and Wang used similarity network fusion to jointly analyze mRNA, DM, and microRNA (miRNA) data and identify cancer subtypes further [3]. In the existing joint analysis methods, those based on matrix decomposition are remarkable. Liu et al. integrated mRNA, somatic cell mutation, DNA methylation, and copy number variation data. They established a block constraint-based RPCA model to identify differentially expressed genes (DEGs) [4]. Integration and analysis of these heterogeneous multiomics data provide an in-depth understanding of the pathogenesis of cancer and promote the development of precision medicine. Recently, unsupervised integrative methods based on matrix decomposition have attracted considerable attention among the existing methods for integrating and analyzing multiomics data. Zhang et al. constructed a joint matrix factorization framework (jNMF) to discover multidimensional modules of genomic data [5]. Yang and Michailidis introduced a new method named integrative NMF (iNMF) for heterogeneous multiomics data [6]. Strazar et al. incorporated orthogonality regularization into iNMF (iONMF) to integrate and analyze multiple data sources [7]. Joint nonnegative matrix decomposition meta-analysis (jNMFMA) [8], multiomics factor analysis (MOFA) [9], and Bayesian joint analysis [10] have been successfully applied to the integration and analysis of cancer omics data. To avoid the influence of redundant information, many sparse modeling methods have been proposed. Typical applications are as follows: The weighted sparse representation classifier (WSRC) model combined with global coding (GE) [11] was used to predict interactions between proteins based on protein sequence information. The network regularization sparse logic regression model (NSLR) [12] was used to predict survival risk and discover biomarkers. Sparse coregularization matrix decomposition was used to find mutant driver genes and so on [13].

In recent years, graph/network-based analysis as a powerful data representation tool has been applied to the modeling and analysis of complex systems [14ā€“17]. In general, entities can be regarded as nodes, and the interaction between entities can be regarded as edges in the graph. Graph-based approaches can explore the local subspace structure and obtain the low-dimensional representation of high-dimensional data. Zhang and Ma proposed a subspace clustering algorithm based on a graph to detect the common modules highly correlated with cancer by jointly analyzing the gene expression and protein interaction networks [18]. Mixed-norm Laplacian regularized low-rank representation (MLLRR) was used to cluster samples [19]. Cui proposed an improved graph-based method to predict drug-target interactions [20]. Liu et al. introduced the contributions of deep neural networks, deep graph embedding, and graph neural networks along with the opportunities and challenges they faced [21]. Wu et al. proposed a multigraph learning algorithm called gMGFL that search and choose a group of decision subgraphs as features to move bags and bag labels to the instance [22].

Recently, sparse regularization has played a very important role in data analysis. The -norm, -norm, -norm, etc. are all typical sparse regularization methods. Among these many sparse constraints, -norm regularization stands out in terms of computational time and performance. The -norm can obtain a sparse projection matrix in rows to learn discriminative features in the subspace. Zhang used the -norm constraint on the coefficients to ensure that they are sparse in rows [23]. The -norm was applied to the predictor to ensure that it is robust to noise and outliers [24].

Considering the role of graph regularizations and - norm constraints in matrix factorization, we propose joint nonnegative matrix factorization based on sparse and graph Laplacian regularization (SG-jNMF). SG-jNMF can make the best of the potential associations and complementary information among multiomics data. The main highlights of this approach are as follows.(1)Graph regularization is incorporated into the joint nonnegative matrix factorization model, and undirected graphs are constructed for input data in this method. Local graph regularization can preserve the local geometrical structure of the data space. Therefore, SG-jNMF can use the low-dimensional characteristics of the observed data to find intrinsic laws and improve the performance of the integrated analysis method.(2)-norm regularization can deal with each row of the matrix as a whole and can enhance the sparsity among the rows. Therefore, involving the -norm can remove redundant features and noise in the data and further explore the clear cluster structure.(3)Two forms of SG-jNMF are proposed. SG-jNMF1 projects multiomics data into a fusion feature space. The fusion matrix contains complementary and differential information provided by multiomics data, so that more accurate results can be obtained when identifying Co-DEGs. SG-jNMF2 projects multiomics data into a common sample space, which results in more accurate clustering results.

The rest of this paper is arranged as follows: In Section 2, we start with a brief review of jNMF. Next, we introduce the SG-jNMF method, optimization process, and computational complexity analysis. Section 3 gives out the experimental results of clustering and feature selection. Finally, we summarize the whole paper and give some suggestions for future work in Section 4.

2. Materials and Methods

2.1. Joint Nonnegative Matrix Factorization

The jNMF method was first proposed by Zhang et al. [5]. It can project multiple input data matrices into a common subspace, to integrate the information of each input data for analysis. Each type of genomic data as original data can be denoted as . is the common basis matrix, and is the corresponding coefficient matrix. The objective function of jNMF can be written as

Obviously, jNMF is the same as NMF when . Therefore, jNMF is the generalization model of NMF for multiple input datasets. Similar to NMF, multiplicative update rules are used to minimize the objective function. and are iteratively updated according to the following rules.

The jNMF method can be used to integrate and analyze multiomics data. It decomposes multiomics data matrices into multiple independent coefficient matrices and a common fusion matrix at the same time and projects high-dimensional omics data into low-dimensional spaces. Therefore, the abundant differential and complementary information of cancer multiomics data can be efficiently used, and multiomics datasets are analyzed simultaneously to obtain hidden information with biological significance.

2.2. Joint Nonnegative Matrix Factorization Based on Sparse and Graph Laplacian Regularization

Manifold learning has become a popular research topic in the domain of information science since it was first proposed in science in 2000 [25, 26]. Assuming that the data are uniformly sampled in a high-dimensional space, manifold learning can find the low-dimensional structure in the high-dimensional space and obtain the corresponding embedding mapping. Manifold learning looks for the essence of things from observed phenomena and finds the internal laws of data. The manifold assumption states that data points that are geometrically adjacent usually have similar characteristics. Therefore, an undirected weighted network/graph is constructed. is the vertex set, is the edge set, and is the weight set. Edge weight is associated with in . The graph regularization with is as follows:where is the trace of the matrix, is the graph Laplacian matrix, and . is a diagonal matrix and . Intuitively, the smaller the value is, the closer the two data points are. By minimizing , we can obtain a sufficiently smooth mapping function on the data manifold.

To decrease the influence of noise and outliers on real data, sparse regularization is usually used to penalize the coefficient matrix. The -norm, -norm, and -norm are all typical sparse regularization methods. The solution of -norm is a NP-hard problem. -norm is widely used because it has better optimization solution characteristics than -norm. -norm will tend to produce a small number of features, while the other features are all 0. Therefore, it can be used for feature selection. However, -norm regularization is usually time-consuming. -norm regularization on the coefficient matrix can generate a row sparse result, and the calculation of the -norm is simple and convenient [23]. In this article, the penalty is incorporated in SG-jNMF [27]. The -norm of a matrix is defined as

2.2.1. SG-jNMF1

There are two forms of SG-jNMF methods in this article. As shown in Figure 1, the SG-jNMF1 method projects multiomics data into a common feature space. Graph regularization and a sparse penalty are applied to the fusion feature matrix. The feature matrix is constrained by graph regularization, and as much intrinsic geometric information of the original multiomics data are preserved as possible. The -norm is used to constrain the feature matrix to reduce the influence of outliers and noise, and the objective function of integrating nonnegative matrix decomposition is constructed. The optimization problem can be expressed aswhere is the Laplacian matrix. , where is a symmetric matrix, which is the weight matrix constructed in graph regularization. is a diagonal matrix, and its diagonal elements are equal to the sum of the corresponding row elements or the sum of the column elements of the matrix; i.e., .

With randomly positive initializing matrices and , the following update rules are executed until the algorithm converges:where is a diagonal matrix, the diagonal element is , and is an infinitesimal positive number.

2.2.2. SG-jNMF2

As seen from Figure 1, the SG-jNMF2 method projects multiomics data into a common sample space. Constraints are enforced on the common sample matrix. This method can be used to cluster multiomics data. The model can be shown by the following expression:

Similarly, the algorithm iterates until it converges according to the following rules:where is the Laplacian matrix. , where is a symmetric matrix, which is the weight matrix constructed in graph regularization. is a diagonal matrix, and its diagonal elements are equal to the sum of the corresponding row elements or the sum of the column elements of the matrix; i.e., . is a diagonal matrix, and the diagonal element is . Obviously, the objective functions of the two kinds of SG-jNMF method are both nonconvex. We can obtain the optimal solutions by minimizing the objective functions. The optimization process is shown as follows.

2.3. Optimization of SG-jNMF

Since the optimization processes of the two forms of SG-jNMF method are very similar, we only provide that of the first method. We use the multivariable alternating update rules to solve the optimization problem. Specifically, the following update steps are repeated until the algorithm converges.

2.3.1. Optimization of

When is fixed, the optimization of is performed by minimizing the following objective function:

The corresponding Lagrangian function is as follows:where and are the Lagrangian multipliers of and , respectively. Next, we take the first partial derivative of this Lagrangian function with respect to :

According to the KKT conditions [28], the following updating rule can be obtained:

2.3.2. Optimization of

When is fixed, the optimization of is performed by minimizing the following objective function.

The corresponding Lagrangian function is as follows:and runs to convergence according to the following formula:

2.4. Convergence and Running Time

In this paper, we also demonstrate the convergence of the method through experiments. Taking the pancreatic adenocarcinoma (PAAD) dataset as an example, the convergence of the five methods is shown in Figure 2. The error function used in this article is defined as follows:

Compared with the other four methods, SG-jNMF can converge to the smallest error value with the fastest speed.

Besides, we also tested the running time of the above methods on the PAAD dataset. The means of these five methods running 10 times on a PC are shown in Table 1. As seen in Table 1, iGMFNA has the shortest running time, followed by SG-jNMF. This is due to the introduction of sparse constraints in SG-jNMF. The running time of iNMF, iGMFNA, jNMF, and SG-jNMF methods is satisfactory.

2.5. Computational Complexity Analysis

In this part, we discuss the extra computational complexity of SG-jNMF compared to jNMF. We use big symbol to represent the computational complexity of the algorithm. On the basis of the updating rules (3) and (4), we can easily count the arithmetic operations of each iteration in jNMF. Obviously, the cost for each iteration in jNMF is . It should be noted that is a sparse matrix for SG-jNMF. In addition to the multiplicative updates, constructing a K-nearest neighbor graph requires operations [28]. Assume that the update stops after iterations, and the overall cost for jNMF is . The overall cost for SG-jNMF is .

3. Results and Discussion

3.1. Data Processing

TCGA project includes a lot of gene expression profile data, DNA methylation data, copy number variation data, protein expression data, drug sensitivity data, and so on. In-depth study of these data can help us to master the mechanism of cancer occurrence and development and provide technical support for prevention, diagnosis, and treatment of cancer. In this article, four cancer datasets which are all downloaded from TCGA (https://tcgadata.nci.nih.gov/tcga/), namely, PAAD, esophageal carcinoma (ESCA), cholangiocarcinoma (CHOL), and colon adenocarcinoma (COAD), are used in these experiments. Details are listed in Table 2. To avoid the matrix dimension problem in algorithm execution, the number of genes in the four datasets is aligned to 19,876. First, RPCA is used to reduce the effects of noise and redundant information [29]. Second, the same number of samples and characteristics is retained for multiomics data of the same kind of cancer. Then, the matrices are normalized according to the standard deviation of the data such that each element of the matrix is evaluated between 0 and 1.

3.2. Clustering

When SG-jNMF2 method projects multiomics data into a common sample space, it contains all the sample information provided by the input multiomics data. To assess the clustering performance of this method, SG-jNMF2 is used to cluster the tumor samples on CHOL, PAAD, COAD, and ESCA datasets. There are four methods (iNMF, iONMF, iGMFNA, and jNMF) that perform the same experiments on the same datasets.

3.2.1. Selection of Parameters

For SG-jNMF2, clustering performance is affected by the regularization parameters. In this experiment, we empirically set the same value for with different omics data from the same cancer [30]. Therefore, there are three parameters, , and , that need to be adjusted. is the graph regularization parameter, controls the sparsity of factorization, and is the number of nodes in the undirected graph constructed in the manifold. From Figure 3, when is set to 3, the accuracy on the four datasets reaches a maximum. As seen from Figure 4, should be set to 1,000 on PAAD. When is equal to 0.1, the accuracy on COAD can achieve the maximum. When is equal to , , and 1, the accuracy on CHOL can achieve the maximum. When is equal to , the accuracy on ESCA can achieve the maximum. From Figure 5, when is set from to for PAAD, the accuracy reaches the maximum. For ESCA and COAD, should be set from to . For CHOL, the value of does not matter much.

3.2.2. Evaluation Indicators

Several indicators are used to evaluate the clustering performance of SG-jNMF2: accuracy, recall, precision, and F1-score. Accuracy is defined aswhere N is the total number of samples in the dataset and is a singular function. When is equal to , the value of the function is equal to 1; otherwise, it is equal to 0. maps the clustering label to the real label . The other three indicators used to evaluate clustering performance are defined as follows:where TP means the number of true positives, FP is the number of false positives, and FN denotes the number of false negatives.

3.2.3. Results

In this experiment, each algorithm was run fifty times to reduce the impact of random initialization on the clustering results. We compared the accuracy, recall, precision, and F1-score of the four methods with SG-jNMF2. The mean and variance in the results are shown in Table 3. As seen in Table 3, SG-jNMF2 achieves the highest values on the four indicators mentioned above, except the recall value on the ESCA dataset. The contributions of sparse and graph regularization constraints of the algorithm are listed in Table 4. Performance improvements are measured by , where is the indicator of SG-jNMF and is that of the comparison method. In particular, sparse constraints improve accuracy by 49.70%, and sparse and graph regularization constraints improve accuracy by 78.87% on the PAAD dataset. Recall and F1-score achieve more than 50% improvement on the CHOL dataset. When sparse constraints are introduced, only the recall on ESCA is reduced by 0.53%. The results on other datasets have also improved to varying degrees. In summary, the performance of the integrated NMF in analyzing multiomics data greatly improves by introducing sparse constraints and graph regularization constraints.

3.3. Identifying Co-DEGs

First, three matrices (DM, GE, and CNV of PAAD) are input into the SG-jNMF1 model and are projected into a common feature space. Second, we sum the common feature matrix in rows. Finally, we sort the elements in the sum vector in descending order. The top 100 genes are selected as Co-DEGs. These 100 genes are compared with pancreatic cancer genes exported from GeneCards (URL:http://www.genecards.org). Co-DEGs with relevance scores above 4 are listed in Table 5. CDKN2A is frequently mutated or deleted in many tumors. It plays an important role as a tumor suppressor gene. Studies have shown that the mutation of CDKN2A is closely related to the development of pancreatic cancer in families [31]. It is frequently seen in many tumors that mutation and overexpression of CCDN1 can alter the process of the cell cycle. Wang et al. identified pancreatitis-associated genes and found that CCND1 was involved in the pathway of pancreatic cancer [32]. Research on transcriptome sequencing shows that PTF1A maintains the expression of genes in all cellular processes. Deletion of PTF1A leads to an imbalance, cell damage, and acinar metaplasia, which is directly related to the development of pancreatic cancer [33]. Scientists have explored the effects of GRP on human intestinal and pancreatic peptides. Therefore, SG-jNMF1 can effectively integrate the information of multiomics data to identify Co-DEGs closely related to the disease.

We also use SG-jNMF1 to integrate three gene expression datasets from ESCA, CHOL, and COAD to identify Co-DEGs associated with all three diseases. Partially Co-DEGs and their relevance scores with ESCA, CHOL, and COAD are shown in Table 6. The relevance score of CHEK2 with ESCA is up to 77.66. Allelic variation in CHEK2 has a strong relationship with the risk of esophageal cancer [34]. Relevance score of CHEK2 with COAD is 29.65. The germline variation in CHEK2 is also closely related to the risk of colorectal cancer [35]. Frequent mutations in BRPA have been widely reported in human malignancies, including esophageal cancer, cholangiocarcinoma, and colon cancer [36ā€“38]. This provides a computational method for the study of Co-DEGs in multiple diseases.

4. Conclusions

In this paper, we propose an integrative matrix factorization method (SG-jNMF) used to analyze heterogeneous multiomics data. The novel method jointly projects multiomics data matrices into a common low-dimensional space. Two forms of SG-jNMF enable multiomics data to be analyzed from both the sample and feature perspectives. This integrative analysis method can consider the local association of data and decrease the interference of noise and redundant information in the heterogeneous multiomics data. Experimental results show that the new method is superior to existing methods in analyzing heterogeneous multiomics data. Another significant advantage of SG-jNMF is that it can flexibly handle multiple input data of various types. This flexibility means that the input data can be different types of data (GE, ME, CNV, etc.) for the same disease or the same type of data for different diseases. We can use this method to identify Co-DEGs associated with a particular disease and detect common Co-DEGs associated with several diseases. This provides an efficient calculation method for biological and medical research. Next, we will use the correlation between Co-DEGs to build a gene coexpression correlation network, and further study the function of gene modules and related pathways.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the grants from the National Natural Science Foundation of China, nos. 61902215 and 61702299.