Abstract

Gene expression network is also a type of complex network. It is challenging to analyze the gene expression network through relevant knowledge and algorithms of a complex network. In this paper, the existing characteristics of genes are analyzed from various indexes of the gene expression network to analyze key genes and TOP genes. Firstly, gene chip data are screened, gene data with obvious characteristics are selected, and relevant clustering characteristics are analyzed. Then, the complex gene network structure is established, and gene networks with different threshold shapes and different sizes are selected. Finally, the relevant indexes and PR values after the PageRank algorithm are analyzed for complex networks under different thresholds, thus establishing the TOP gene and PR sequence.

1. Introduction

With the development of gene chip and second-generation sequencing and the emergence of new technologies, the analysis of the human genetic structure has become a reality, the association information between genes is analyzed, being able to express genes related to key genes, and it is possible to analyze the correlation between genes and diseases more effectively. Establishing a biological network can obviously discover and analyze the correlation and influence between expression elements in various biological systems. It makes it possible to study the characteristics of organisms at a certain level [1]. Unlike the research on inherent single molecules, the information at the whole system level can be displayed by establishing networks.

At present, molecular biology network analysis tools in this field mainly include metabolic network [2], protein-protein interaction network [3], gene coexpression network [4], gene regulatory network [5], and signal transmission network [6]. In order to expand the scope and depth of the current research, the transition from a single molecule to a network system level becomes the next major development direction of interactive networks because of, in general, the complex biological phenomena and pathology that cannot be caused by only one factor. Therefore, as a tool to explore complex pathological and life phenomena, the interactive network provides an effective analysis method, to perfect the comprehensive research. These networks have great research value, among them; the gene coexpression network has irreplaceable research characteristics in some aspects. Whether the amount of gene expression can establish the correlation of related diseases, the expression network can establish the complex network, and it is challenging to analyze the gene correlation with the related technologies of the complex network.

The reconstruction algorithm of the gene regulatory network and the exploration of a series of related models began in the middle of the 20th century. Raterproposed the system and characteristics to control the interaction and interaction between all genes in prokaryotic cells exploring the mutual influences and interactions between genes in prokaryotic cells, as well as system relationships and characteristics. In 1969, Kauffman discovered and described the exploration research with epoch-making significance, using popular binary logic rules to stipulate the basic construction model and reconstruction algorithm of gene network [79]. D’Haeseleer et al. [10] were the first to use ordinary differential equations to do some research on time series data. Gardner et al. [11] were the first to prove that, by using their multivariate regression network identification (NIR) algorithm, the steady state measurements can be used to infer the network structure. They studied a data set, overexpression of specific genes in bacterial models with plasmids, when the gene expression level reaches a new steady state value, using its measured values. This method has been suggested for transcriptase data sets including siRNA knockout experiments [12]. In the past decades, numerous computational methods have been proposed to infer gene regulatory networks. These methods can be roughly divided into coexpression-based methods [13], supervised learning method [14], the model-based method [15], and the information theory-based method [16, 17] with lower computational complexity. However, direct correlation and model system dynamics cannot be inferred. Based on the supervised learning method, we mainly use the known rules to infer gene regulatory networks on genome-wide data, such as SEREND [14], GENIES [15], and SIRENE [18], but all need additional information of regulatory relationships to cultivate models. By guiding reasoning from prior information of known rules, we can achieve higher accuracy and be superior to many other methods.

Constructing complex gene expression networks and using large-scale gene expression data sets for network analysis are effective methods to reveal new biological knowledge. However, methods for gene association in the construction of these coexpression networks have not been thoroughly evaluated. Because different methods lead to different coexpression networks with different structures and provide different information, it is very important to select the best gene association method. Zhang et al. [19] proposed the identification of protein complexes in protein interaction network (IPC-RPIN), which effectively fused topological structure, gene expression profile, and GO functional annotation information to achieve gene-protein expression complex. Hua et al. [20] proposed the fusion research of three commonly used reasoning algorithms to establish a genome-scale and high-quality gene coexpression network. After applying this expression network to monocotyledonous plant rice, the network quality has been verified and evaluated through the selected gene function association data sets, which is obviously superior to other methods. Li et al. [21] predicted subcellular location information by integrating time-history gene expression data with the spatial and temporal active protein interaction network (ST-APIN). In order to evaluate the efficiency of the proposed method, the commonly used classical clustering algorithm has obvious advantages in identifying protein complexes in ST-APIN and the other three dynamic PIN.

3.1. Gene Degree

Degree is an important index to describe the attributes of nodes. The degree of nodes refers to the measurement that genes are associated with genes. In directed complex networks, the degree of nodes can be divided into exit degree and entry degree, which indicates the number of genes pointing to other genes and the number of other genes pointing to the gene. The average degree of a complex network reflects the density program of the network, and the degree distribution can describe the importance of different genes through the number of genes and gene connections. When the scale of gene expression network is very large, the degree distribution of genes can fully display the distribution law of genes and can identify and distribute different types of networks.

The distribution of gene complex network degree is described as follows:(A)Gene distribution function : It indicates the proportion of gene x in the gene expression network in the whole expression network.(B)Cumulative degree distribution function: The description degree is not less than the probability distribution of gene x, and the distribution function is as follows:

If the gene degree of the gene expression complex network obeys the distribution function of the power index , it is said that the gene complex network obeys the scale-free network of power-law distribution, while the accumulation degree distribution function obeys the power-law distribution of power .

3.2. Gene Network Density

Gene network density is used to describe the density of connection edges between genes in the gene network and is defined as the ratio of the actual number of gene connections in the gene network to the upper limit of the number of gene connections that can be accommodated. Dense programs and dynamic evolution laws used to measure the relationship between gene expression networks in gene complex networks. There are N gene nodes and L edges in the gene complex network, and the density formula (2) of the gene complex network is defined as follows:

The density range of the gene complex network is [0, 1]. When , the network is all connected. When , there are no connections in the network. In the actual gene network, the network with density 1 does not exist. In most networks, it is around 0.5. The density of large-scale networks is smaller than that of small-scale networks, and the density of networks of different sizes cannot be directly compared. The absolute density formula (3) can be used to compare the network densities of different sizes:

In the above, D represents the diameter of the network, R represents the radius, and S represents the circumference of the network.

3.3. Aggregation Coefficient of Gene Network

Gene network clustering coefficient is used to describe the degree of node neighbors associated with gene nodes in a gene network. In the gene network, the node and the gene network aggregation coefficient represent the probability of interconnection with its adjacent gene nodes. is used to represent the number of neighbors interconnected by gene nodes , and is used to represent the number of interconnected neighbors existing in . represents the maximum number of interconnections, and equation (4) represents the aggregation coefficient of gene node :

The aggregation coefficient of gene nodes is vividly described in the gene complex network, which can be understood as the correlation between one gene node and another gene node, the probability of connection between this gene node and the neighbor of another gene also exists, and the probability comparison of correlation exists, so it has strong aggregation in the gene complex network. The aggregation of the whole gene network is evaluated by the average aggregation coefficient. The average aggregation coefficient of the gene complex network is defined as the average aggregation coefficient of all gene nodes in the gene network. Equation (5) is described as follows:where represents the number of gene nodes, represents the aggregation coefficient of gene node , and its value range is [0, 1]. When , there are no connected gene nodes in the gene network; when, the gene network node is all connected.

The average aggregation coefficient in the gene expression network describes the probability of connection between any associated genes in the gene network and reflects the closeness of the node relationship of the whole gene.

4. Key Gene Determination Method Based on PageRank Algorithm

4.1. Construction of Complex Network for Gene Expression

Gene chip expression matrix reflects different gene expression levels of different sequencing samples, which can be described by matrix, where represents the expression level of the nth gene in m samples:

In this paper, the WGCNA algorithm [22] is used to construct the gene expression network. The WGCNA algorithm is based on a scale-free network. Whether there is a correlation between the two genes can be expressed by the correlation coefficient, forming the following formula: represents the similarity of gene i and gene j in the expression amount of samples in the chip and M represents the similarity matrix in [0, 1]. In the construction of the gene expression matrix, the similarity is calculated through the expression of different genes in all samples. The similarity is an important measurement index to measure the strength between genes and is also the basis for the construction of complex networks. In the WGCNA algorithm, is measured by weighting using a soft threshold, and is used to represent the soft threshold, while the general network structure adopts a hard threshold. The gene expression network using a soft threshold is more prominent, and its strong correlation is more prominent. Equation (8) is as follows:

The weighting coefficient is defined according to scale network, and the correlation between of network node number K and of node occurrence probability in a scale-free network is required to reach more than 0.8. Considering the correlation between genes and genes, the correlation between the individual genes and genes is expressed by the adjacency matrix, and topological overlap measure (TOM) is expressed by the strong relationship between genes and genes:

Then, set the correlation threshold (generally referring to the hard threshold) and compare it with the value of TOM matrix. If the matrix value in TOM is greater than , it indicates that there is a correlation between gene i and gene j (which can be expressed by adjacency matrix ). If the matrix value in TOM is less than , it means that there is no correlation between gene i and gene j (which can be expressed by adjacency matrix ). The generation of gene coexpression network according to the above principles is also the basis of this study and provides the basic conditions for analyzing core genes.

4.2. Convergence of PageRank Algorithm

In the complex network of genes, there is a correlation between each gene and the genes related to it, but the influence of genes related to genes should also be considered. Through the correlation between genes, the PageRank algorithm is used to realize this correlation between genes, thus determining key genes.

Feature vector centrality and its variants are widely used. For example, the most famous PageRank algorithm [23] in the field of web page sorting is the core algorithm of Google search engine. At the initial time, each gene is given the same PR value, and then iteration is carried out, and the current PR value of each gene is divided equally to all genes it points to in each step. The new PR value of each gene is the sum of the PR values it obtains, so that the PR value of the node Pi at time t is defined as the following equation:where is the output of node , gene expresses the initial matrix of the correlation network, and iterates until the value of each node reaches stability.

The PageRank algorithm is applied in many fields. Through continuous iteration, a relatively stable value is reached. As the number of iterations increases, the error of the values generated before and after is expressed. When a reasonable error value is reached, the PageRank algorithm ends, as expressed by the following equation:

Theoretically, is determined according to the user’s experimental results. When the error comparison is required to be small, e comparison is set. When i− > o, e = 0.

The relevant literature of the PageRank algorithm does not give the relevant convergence proof process, only the iterative process is used to explain its convergence, and no relevant mathematical methods are used to explain it. The following author gives the relevant proof process.

Theorem 1. If the sum of the values of all columns of a gene correlation matrix is 1, then the matrix is a convergence matrix.
Proof: Let in which; it is sufficient to prove that in formula (12) converges.
Let . The eigenvalue of M is ,in which the sum of each column is 1 and, which is proved.
According to the disk theorem, has N disks whose N eigenvalues all fall on the plane:in which,.
In the matrix M, N disks of M given according to equation (14) areThe N eigenvalues of M′ all fall in the union set of n disks on the complex plane. The disk eigenvalues are seen as shown in Figure 1.
Therefore, when m = t is a convergence matrix, the proof is complete.
is in the white disk of Figure 1; we can look at N eigenvalues of M′, according to equation (13).
, where , so thatTherefore, when is a convergence matrix, the proof is complete.

4.3. Disk Analysis of Improved PageRank Algorithm

The above PageRank algorithm is based on the traditional method to prove convergence. Because the traditional l PageRank algorithm does not satisfy the characteristics of strong connectivity and traps, that is, some web pages do not have links and loops pointing to other web pages, there are links pointing to them or links not pointing to other web pages. The improved PageRank algorithm is

M satisfies the normalized matrix of the traditional PageRank algorithm. After takes 0.8, the form of A matrix is as follows:

At this time, the convergence of matrix A is proved with as the center of the circle, as shown in Figure 2.

As can be seen from Figure 2, the center of the improved PageRank algorithm is , and the radius is or , so that the disk can pass through the point of 1, but the disk does not pass through the point of −1, so the maximum eigenvalue of the improved PageRank transfer matrix takes 1 and the minimum value cannot take −1. Therefore, when is easier to converge.

4.4. Calculation of PR Value When

Through the above proof, the PageRank algorithm is convergent. How to find the PR value is a problem studied by many scholars, especially in the improved PageRank algorithm, there are a large amount of literature, and some scholars have done relevant work on how to quickly find the PR value and accelerate convergence. This section proposes a PR value calculation method when . When , the PR value is the actual value. Iterative calculation of the PR value through other PageRank algorithms is an approximate value. The accuracy cannot be compared with the PR value when . The following calculation process is given. The formula of the PageRank algorithm is as follows:

Let , when , , the formula became

Namely,

Theorem 2. When , the polynomials are linearly correlated.

Proof. The sum of the former n−1, .
When ,From the above, it can be seen that polynomial holds, and it is deduced that any N−1 correlation is equal to another term, so the above polynomial is linearly correlated.

Theorem 3. When , polynomials are linearly independent.

Proof. Assume that makes the first term represent the term; that is,According to Theorem 1, it is concluded that is equal on both sides:According to Theorem 1, it is concluded that is equal on both sides:n equal to equation (11).According to Theorem 1, the conclusion is .
Similarly, it is possible to obtain that all multiple terms are equal, which is contradictory to the problem, so are linearly independent.
According to the properties of the PageRank algorithm, takes any polynomial n−1 term and combines it with the above:After terms are added, the above N terms are linearly independent. Equation (27) can be transformed into matrix is reversible, and the value of R is

5. Experimental Analysis

5.1. Data Selection

The SVM-REF algorithm was used to select 2105 genes and 265 samples (198 TNBC, 67 no-TNBC). By establishing decision trees, we can see the distribution of decision trees under 2105 genes for each sample, as shown in Figure 3.

As can be seen from Figure 3, GSM1974605 has outliers under 2105 genes, and the GSM1974605 sample can be deleted. After deleting outlier samples, it can be seen that they are roughly divided into two categories, and the distinction is obvious, indicating that 2105 gene selection is reasonably classified under 197 samples.

Selecting soft threshold B plays a very important role in forming TOM gene association matrix. In this paper, the optimal threshold is obtained through continuous iteration of the threshold. Since the scale-free network atlas structure can reach 0.8 or the average connectivity can reach below 100 when , if the above requirements cannot be met, the experimental results will be greatly affected. Figure 4 is the distribution diagram of the threshold value and average connectivity.

As can be seen from Figure 4, at that time, the network map structure could reach more than 0.9, indicating that the threshold value of this interval is ideal and the optimal value is the most 8. However, the average connectivity is less than 100 at 3, so meets the requirements of the above conditions.

The TOM matrix is established. The TOM matrix is an index value representing the correlation between genes and is converted into a connection matrix through the TOM matrix. The relevant thresholds of TOM matrix are screened and the thresholds are introduced to form different complex gene networks after continuous changes. As that threshold value = 0. 2186, a complex network of gene is shown in Figure 5; as that threshold value = 0.2664, a complex network of gene is shown in Figure 6.

As can be seen from Figures 5 and 6, when the threshold B increases, the complex network diagram gradually decreases, the network structure becomes smaller, and the network community will appear. Smaller outlier gene networks will also appear so as B increases to a certain value, the network structure changes from a large-scale complex network to a small-scale network structure.

5.2. Analysis of Complex Network Structure

The analysis of complex network structure analyzes the structural changes of the gene complex network from the indexes of centrality, aggregation coefficient, central potential, and so forth. By introducing threshold to change the structure of the gene complex network, the network structure can be seen to change continuously by increasing the threshold , and the network changes from a large-scale structural model to a small-scale structural model. The specific effect is shown in Figures 710.

From Figures 710, it can be seen that, with the increase of , the structure of the gene network has changed greatly. With the increase of the performance of the network structure, the scale of the whole network is continuously decreasing, resulting in the continuous changes of various complex network parameters, the overall scale is small, and the network community is increasing, resulting in the continuous changes of key gene nodes. Then, PageRank is used to calculate network structures of different sizes, and PageRank values of genes are sorted. Under different threshold settings, the PageRank values of each gene node are shown in Tables 14.

Tables 14 show the PR values of genes in the gene complex network under different values. With the continuous increase of , the number of genes in the gene complex network is continuously decreasing, and the expressed gene network structure is also in different states. The values of gene PR values in different networks are also constantly changing, and the ranking of gene PR values is also constantly changing. In each threshold, it is the calculation of different screening networks, which can be determined according to the size of the network or the number of genes. When the number of selected genes is large, you can choose λ = 0.2345 or λ = 0.2505. If λ is relatively small, the number of selected genes will be relatively large, and there will be more key genes. When λ is larger, the relative number is small, but the key genes are few. Therefore, when choosing, you can choose several suitable values of λ for comparison. For example, a certain gene appears in several tables consecutively.

In Table 1 to Table 4, several TOP gene tables are ranked by PR value, and the top gene is the key gene. The PR values are given in the table, and the sum of the PR values of all genes is 1.

Figure 11 shows the changes of several gene values (CDK13, DSPP, HLA.G, LINC00304, TPX2, FOXM1) and screens several genes with a higher ranking.

As can be seen from Figure 11, as the value and PR value change continuously, the whole PR value increases continuously. As the network scale becomes smaller, the PR value of gene nodes increases continuously. Some genes become outliers because the network scale is constantly changing, and gene nodes will not have new PR values in the later screening process, resulting in PR values of 0. Therefore, different gene complex network structure models will produce different gene structures, and selecting an appropriate gene structure network will obtain different key genes.

The key genes can be analyzed from the above, and the expression significance of the key genes screened will be explained by biological significance. The expression significance of some key genes is shown in Table 5.

There are many key genes that can be regulated, and only some are listed in Table 5. Other regulated genes can be determined according to the ranking of PR values in Table 1 to Table 4.

6. Conclusion

In this paper, by establishing a complex gene network, statistical analysis is carried out through gene nodes, and then PR values are calculated for each gene node and ranked statistically to obtain gene key nodes. Under different thresholds, the PR value changes of different gene nodes are taken, and the appropriate gene network is selected for screening. The method proposed in this paper can transform the expression amount of gene expression network into a complex gene network and then into PR value, thus obtaining key genes. The next step of research work is to identify key genes from RNA-Seq analysis of expression in the second-generation sequencing and to identify key genes in combination with SNP, InDel, and other variants. It can also consider using other methods [24, 25] to solve the key genes.

Data Availability

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Conflicts of Interest

The authors declared that they have no conflicts of interest regarding this work.

Authors’ Contributions

Guobin Chen and Jun Qi contributed equally to this work.

Acknowledgments

This work was supported by the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant no. KJZD-K201902101) and the Open Fund of Chongqing Key Laboratory of Spatial Data Mining, and Big Data Integration for Ecology and Environment, Humanities and Social Sciences Project of Rongzhi College of Chongqing Technology and Business University (Grant no. 20197004). The work was also supported by the project (no. cstc2018jxjl10001) from the Natural Science Foundation of Chongqing, the Project Platform Enhancement of Radiation and Cancer Biology Laboratory from Special Funds for Guiding Local Scientific and Technological Development by the Central Government of China, the Project Integrated Innovation and Application of Key Technologies for Precise Prevention and Treatment of Primary Lung Cancer (no. 2019ZX002) from Chongqing Municipal Health Committee, and the Project Technology Platform Construction of Next Generation Sequencing and Research on Clinical Translation from Chongqing Cancer Institute. The funders only provided financial support and did not have any additional role in the study design, data collection and analysis, decision to publish, or manuscript preparation.