Abstract

Fractal coding techniques are an effective tool for describing image textures. Considering the shortcomings of the existing image super-resolution (SR) method, the large-scale factor reconstruction performance is poor and the texture details are incomplete. In this paper, we propose an SR method based on error compensation and fractal coding. First, quadtree coding is performed on the image, and the similarity between the range block and the domain block is established to determine the fractal code. Then, through this similarity relationship, the attractor is reconstructed by super-resolution fractal decoding to obtain an interpolated image. Finally, the fractal error of the fractal code is estimated by the depth residual network, and the estimated version of the error image is added as an error compensation term to the interpolation image to obtain the final reconstructed image. The network structure is jointly trained by a deep network and a shallow network. Residual learning is introduced to greatly improve the convergence speed and reconstruction accuracy of the network. Experiments with other state-of-the-art methods on the benchmark datasets Set5, Set14, B100, and Urban100 show that our algorithm achieves competitive performance quantitatively and qualitatively, with subtle edges and vivid textures. Large-scale factor images can also be reconstructed better.

1. Introduction

The image imaging process is often affected by factors such as downsampling, noise, and blur. Image SR reconstruction is used to reconstruct a poor-quality low-resolution (LR) image into a high-resolution (HR) image close to the real image. SR is widely used in various fields, such as video surveillance, remote sensing, and medical imaging. Due to the uncertainty of the image degradation model and the nonuniqueness of the reconstruction constraints, SR is essentially an ill-posed problem [1]. There are three main approaches to solving the SR problem: interpolation-based, reconstruction-based, and learning-based approaches.

Interpolation-based methods estimate unknown pixels in HR images from pixels in their known domain, such as bilinear interpolation and bicubic interpolation. They are most widely used for their high speed and simplicity. However, they are prone to blur and jagged details. To compensate for the shortcomings of traditional methods, an edge-directed interpolation method [2] is proposed. Although the edge structure is reconstructed, noise and distortion are prone to occur [3]. The reconstruction-based approach solves this problem by a priori knowledge, including edge a priori [4], gradient prior [5], total variation, and Bayesian models. However, due to the heavy dependence on a priori, the a priori is ineffective for large-scale amplification factors. The basic idea of the learning-based method is to obtain a mapping of LR images to HR images by training samples, thereby predicting HR images. The trained sample library can be either from within the LR image [68] or from an external image [913]. A large number of learning algorithms based on the Markov network [9], local linear embedding method [10], sparse coding [1113], and anchor neighbors [14, 15] have been proposed. However, these methods have limited ability to extract and express features during the learning process. In recent years, the method of deep learning is applied to many fields [16, 17], and many researchers have introduced the convolutional neural network (CNN) into SR reconstruction and learning end-to-end mapping between LR and HR images by relying on external data sets; for example, SRCNN [18], VDSR [19], and EDSR [20] show superiority. In particular, the proposed residual network makes the data transmission between the networks smoother, which makes the depth of the network increase and the reconstruction effect better. However, they still have considerable drawbacks. For example, when the image content does not match the sample data set, edge irregularities are observed in the results [21]. Moreover, this approach tends to produce blurred and overly smooth outputs, lacking detailed information. The current work is only suitable for some small and specific scaling factors, and the reconstruction of large-scale factors will have the disadvantages of ambiguity and lack of detail. To make the predicted image match itself as much as possible, a self-similar-driven SR algorithm (selfExSR) [7] is proposed, which can extend the search space of internal image blocks by geometric changes, thus improving the visual effect. Many researchers have also begun to combine multiple methods to try to reconstruct SR, and Dong et al. [22] apply machine learning methods to image interpolation and embed nonlocal autoregressive modeling (NARM) into sparse representation models. The visual effect is further improved, but as the magnification scale increases, the texture and edge details are not well reconstructed.

Fractal is an effective tool for describing image textures and is widely used in image segmentation [23], classification [24], SR, and other fields. The fractal-based SR method utilizes the locality of the similarity and transitivity of a single image to search for similar image blocks. Using the analogy method, the information of the LR image block is merged to reconstruct the HR image blocks [25]. The similarity can be considered as a contractive fractal transformation operator that performs shrinking and gray-level modifying operations on the image. The attractive fixed point of this operator can estimate the target image by the collage theorem. Therefore, only the transformed is stored in the computer, which requires less storage space than the original image. Wee and Shin [26] proposed a new fast fractal super-resolution algorithm using the feature type orthogonal fractal coding method, which produced good details and low computational complexity but lacked flexibility and adaptability. Later, the combination of fractal technology and other technologies produced good results. Xu et al. [27] proposed a texture enhancement algorithm that uses local fractal analysis to improve the SR effect of images and solves the problem of image SR and enhancement. Yu et al. [28] combined fractal technology with an instance-based approach to propose a super-resolution algorithm that preserves vivid texture details. However, partial shape features (such as fractal scale factor and fractal dimension) and image similarity have not been well considered and some unexpected artifacts may occur. In recent years, the SR method based on fractal technology has only an effective number of documents, all of which are studies on how to preserve texture and structure information. Zhang et al. [3] proposed an SR reconstruction method based on rational fractal interpolation. First, the image is divided into texture regions and nontexture regions, and then the images are interpolated according to the characteristics of the local structures. Yao et al. [29] proposed an adaptive rational fractal interpolation model that reconstructs the relationship between the partial shape dimension and the vertical scale factor. These methods have some improvement in the recovery of the edges, but the fractal dimension does not accurately represent the texture details.

To better improve the reconstruction effect of texture and edge regions and further improve the reconstruction performance of large-scale factors, we applied fractal geometry and CNN to the study of SR. In this paper, we propose an SR method based on fractal coding and residual networks. Fractal image coding can use the spatial information of the image and self-similar structural information to achieve super-resolution of the image. The basic idea is to estimate the fractal code of the original image from its degraded version and decode it at a higher resolution. Natural images generally have self-similar characteristics [27] that are not as strict. The fractal code is only an approximate estimate of the original low-resolution image. The use of fractal codes to achieve image SR recovery inevitably introduces errors and leads to the loss of blockiness and partial detail. To improve this phenomenon, we input the encoded and decoded error image into the depth residual network for estimation and use this as a compensation term to correct the interpolation image to further improve the reconstruction accuracy. Since most of the error images are high-frequency details lost after fractal interpolation, to better learn the details in this part, we propose a method of the convolutional neural network for training. We constructed a deep residual network for joint training of deep networks and shallow networks, increasing the width of the network and learning different features. The introduction of the residual block increases the depth of the network, further improves the information flow gradient, facilitates the recovery of high-frequency information, and improves the reconstruction performance. Through the combination of fractal technology and residual network, the texture of SR results is richer and more realistic.

Our main contributions can be summarized as follows. (1) We propose a fractal coding method based on error compensation. (2) We propose a method to estimate the fractal error of fractal coding using the CNN method. (3) Compared with the state-of-the-art methods, our method does not lead to excessive smoothing and artifacts and exhibits superior performance as the scale factor increases.

The rest of this paper is organized as follows. Section 2 describes the related work. In Section 3, our algorithm is presented. Experiments and discussions are used to evaluate the effectiveness of the algorithm in Section 4. Finally, Section 5 summarizes this article.

Fractal image coding was originally used for image compression and was later applied to image denoising, classification, super-resolution, and so on. Since the process of fractal decoding is independent of resolution and the fractal code contains not only spatial information but also self-similar structural information of the image, fractal image coding can realize image resolution reconstruction [24]. Before discussing the proposed method in depth, the basic principles of traditional fractal image coding are briefly introduced.

2.1. Fractal Image Coding

The process of fractal image compression coding is based on the collage theorem. A set of compression maps is obtained by a given image so that the attractor of the iterated function system is approximated to a given image, and then the corresponding parameters are recorded. Let be the measure space formed by the digital image and t be the measure of the distance between the two images in space. Assume there is a compression map and an image . Ifthen F is the only attractor determined by . For any given initial image , F can be iteratively solved by the following equation:

The distance between the image and F after the ith iteration satisfieswhere S is the compression factor of the compression map,

Let F denote a given image of size to be encoded, where N is usually a power of two. First, F is divided into nonoverlapping sub-blocks of size , which are called range blocks . Thus, image F can be expressed aswhere is the number of range blocks, and all range blocks form the range R. Then, F can be divided into a group of domain blocks of size by a certain step size along the vertical and horizontal levels. The step size is generally 1, 2, or 3, which can overlap in the neighborhood and does not need to cover the entire image F. Usually, the size of the domain block is twice the size of the range block, that is, ; therefore, the number of domain blocks is , and all domain blocks constitute domain D.

In the image encoding phase, for each range block , the best matching domain block needs to be searched in the domain block. As shown in Figure 1, can be approximated by through a contractive transformation operator , which can be considered a collection of shrinking transformations and gray-level transformations. The shrinking transformation is a combination of geometric mappings. We need to convert to a block of the same size as , and the shrinking domain block is represented as . There are eight such transformations, i.e., , including four rotations, one horizontal movement, one vertical movement, and two diagonal movements. The gray-level transformation transforms the contracted domain block into an approximate range block by , where is the contrast factor and is the luminance factor. The adjustment parameters can be determined by minimizing the following collage error over :

The parameters and can be solved by the least squares method and can be obtained aswhere and are the jth pixel values of and , respectively, and b represents the number of all pixels in a range block. To obtain the best match domain block, we use an exhaustive way to search. Therefore, we obtained the fractal code of the range block , where k is the optimal parameter of the geometric mapping and i is the position index of the optimal domain block. The set of fractal codes for all range blocks is called the iterative function system for a given image F.

2.2. Fractal Image Decoding

Fractal decoding is a process of generating the original image’s attractor by using the fractal information of the original image and the initial image. In the fractal decoding stage, we can select any initial image as the initial image, which is generally a blank image. Then, we use the iterative function for decoding until it converges to approximate the original image, as shown in equation (2). Since the similarity between and the original image is often small, there is an error between the decoded image and the original image. When enlarging the image F by the scale factor h, we need to scale the initial image to the size of the image to be reconstructed. It is worth noting that range blocks and domain blocks also need to be scaled using the same fractal code.

3. Proposed Algorithm

In this section, we discuss the details of the proposed algorithm. The proposed algorithm is carried out in the framework of image fractal coding. The flowchart is shown in Figure 2. It is mainly divided into the following steps: first, fractal coding is performed on the LR image, the corresponding fractal code is determined, and the similarity relationship between the range block and the domain block is established. Then, according to the characteristics that the fractal code is independent of the image resolution, the attractor is reconstructed by super-resolution fractal decoding, and then the interpolated image is obtained. Finally, the fractal coded error image is input into the depth residual network for upsampling estimation and the HR error image is used as a compensation term to correct the interpolated image, obtaining the final HR image.

3.1. Quadtree Fractal Coding

The collage error of the traditional fractal coding is shown by equation (2). To reduce the error, a quadtree fractal coding strategy with variable block size can be adopted. The basic idea is to first use a larger size range block to search for each optimal domain block. If the corresponding tile error is greater than a given threshold, the range block is decomposed into four smaller blocks , , , and , which have a transformed relationship with the corresponding domain blocks , , , and , respectively, as shown in Figure 3. Each block is then researched recursively for optimal domain blocks and encoded until all image blocks have been encoded.

3.2. Super-Resolution Fractal Decoding

By fractal coding the LR image, we obtained the transformation relationship between the blocks and with similarity in the image. This similarity between such blocks is independent of the resolution of the image, so when the original LR image F is magnified to h times, the similarity between and remains unchanged. The image that is magnified by h times is recorded as , so it satisfieswhere I is the interpolation operator, and are the corresponding blocks of and in the HR image , and the sizes are and , respectively.

Super-resolution fractal decoding is similar to fractal decoding. Given an arbitrary initial image of the same size as , the initial image is iteratively iterated by the following equation. The fractal interpolation estimate of the F image can be approximated by L iterations:

The decoding process of the above equation is called the super-resolution fractal decoding process. Fractal image coding is a lossy coding because natural images do not have a strict local self-similar structure. Thus, there is an error E between the interpolated image and the original image , which satisfies

Obviously, is a fractal part with strict local self-similarity in and E is the residual part (error image). To obtain an interpolated image with a magnification of h times, an interpolation operator is applied to both ends of equation (9) to obtainwhere is the error between the HR image and the interpolated image . The smaller the error, the more accurate the reconstruction accuracy.

3.3. Error Image Estimation for CNN

To effectively estimate , the traditional method uses a bicubic interpolation algorithm. There are artifacts and blurring that seriously affect the final reconstruction effect. To solve this problem, we propose a CNN-based method to estimate the error image. To reduce the interpolation error, the error estimate is finally added as a compensation term to to obtain a better estimate of the HR image, i.e., .

3.3.1. Network Structure

The proposed method is based on fractal technology to achieve SR image reconstruction. To better estimate the estimation error compensation term more accurately, the network architecture with depth residual is used to estimate. ResNet, the residual learning framework, was first proposed by He et al. [30] to solve the phenomenon of gradient explosion or gradient disappearance [31] due to the increase in network depth. Because of the fast connection of the residual network, the data transmission between the networks is smoother and the underfitting phenomenon caused by the disappearance of the gradient is also improved. Residual networks are heavily applied to reconstruction work, for example, VDSR [19], SRResNet [32], EDSR [20], and RDN [33], through residual learning to increase the depth of the network, improving the effect of image reconstruction. Based on residual learning, we propose a network architecture for estimating error images, as shown in Figure 4. The figure shows a collection architecture with deep and shallow networks with common inputs. A two-way network structure increases the width of the network, can capture different effective features, and improves reconstruction performance. Residual learning in deep networks increases the convergence speed while increasing the depth of the network, effectively retaining the high-frequency characteristics of the network. The shallow network adopts a simple network structure, which improves the training difficulty and helps to restore more features of the image. Through the joint training of the two networks, the detailed HR error image can be recovered more accurately and the reconstruction result can be improved.

The deep network includes a feature extraction phase and a reconstruction phase. The traditional feature extraction methods mostly use first-order and two-step methods to filter the input image, while the deep learning method does not need to manually design the filter but automatically learns from the training data. In the feature extraction stage, take the LR error image as input and forward it through a neural network as a series of feature maps. The feature extraction network consists of a convolutional layer and a plurality of residual blocks. First, we use a convolutional layer to learn shallow features and join the rectified linear unit (ReLU) for nonlinear mapping. The convolutional layer can be expressed aswhere represents the weight of the convolutional layer, represents the bias of the layer, the arithmetic symbol represents the convolution operation, E is the input LR error image, and finally, the feature map of the layer is obtained by the ReLU function. The convolutional layer generates a feature map of 64 channels.

Then, the shallow feature is used as input to multiple residual blocks to learn more high-level features. Each residual block consists of two convolutions with the same kernel size and number of filters. The residual structure is shown in Figure 5. The residual structure is a connection to a standard feedforward convolutional network with a jump around some layers. Each time a layer is bypassed, a residual block is generated, and each layer learns the residual function instead of the original function by inputting the image. The output is composed of the input and the result of two successive convolutions. Each residual block is used as a unit, and the output of each unit is passed to the next unit. Assuming that there are I residual blocks, the ith residual block can be obtained by equation (12), i.e.,where Qi is the ith residual block operation, equivalent to the convolution operation and the ReLU operation in equation (11). Li is the ith residual block, generated entirely by each convolutional layer within the block, and the previous layer can access subsequent layers.

As the number of deep feature maps of the network increases, to reduce the number of feature maps, the outputs of all units are input to the fusion layer of the convolution kernel as to control the output information. The introduction of residual blocks improves the convergence speed of the network, enhances the flow between information, and extracts more image features. Finally, a convolutional layer is added whose output is the input to the reconstruction module. All convolutional layers have the same convolution kernel size . To ensure that the size of the feature map is the same as the size of the input, we set both the stride and the edge pad to 1.

The image reconstruction stage uses pixel shuffle [34] (also known as subpixel convolution) for upsampling of image resolution, which is to enlarge to the size of the HR space. Compared to deconvolutional layers (such as FSRCNN [35]), it introduces fewer checkerboard artifacts [36]. Additionally, our upsampling method does not insert one or more convolutional layers after upsampling, as in the previous technique [20, 33], but instead directly uses an upsampling layer, which would not affect the rebuild performance but would increase the speed significantly. Unlike a conventional convolutional layer, the number of characteristic channels it outputs is , where r is the scale factor.

The shallow network also includes feature extraction and reconstruction phases. EDSR, RDN, etc. use the global residual path to effectively extract shallow and deep information, but this global residual path is a linear stack of several convolutional layers, which is computationally intensive. Yu et al. [37] proves that those linear convolutions are redundant and can be absorbed to some extent in the residual body. Therefore, we combine the idea of [37] and use two networks to jointly train. A shallow Conv layer with a convolution kernel size of is used, and the number of filters is 64. Then, directly input into the pixel shuffle layer for upsampling to obtain the reconstructed image.

3.3.2. Training

The deep network and the shallow network are independent, and the two do not affect each other. The LR image is used as input for SR reconstruction. The final HR error image is obtained by equation (13), i.e.,where E represents the input LR error image, and represent the HR output of the deep and shallow networks parameterized by and , respectively, and is the final HR error image jointly trained.

Given M training image pairs , where E represents the input LR error image and represents the corresponding HR error image. Our network is jointly trained by minimizing the Euclidean loss of the predicted HR image and the real HR error image , i.e.,where represents the weight attenuation applied to the network parameters and represents the compromise parameter.

In summary, the specific steps of the SR method based on fractal interpolation and residual network are as follows (see Algorithm 1).

Input: LR image: F, threshold: , scale factor: h, , .
Output: Reconstructed image: .
(1)Construct range blocks for LR images , each range block size is , ;
(2)Mark all as “unencoded” and add them to the encoding queue C;
(3)for i = 1 to do
(4)Take an “uncoded” from C, size is , exhaustively find the optimal domain block in all image blocks of size and make it the smallest patch error e under the approximation of equation (5);
(5)if and do
(6)  Record the size of the range block and the fractal code W;
(7)  Mark as “encoded”;
(8)else
(9) Decompose into four smaller sub-blocks , , , and , all marked as “uncoded” and added to C;
(10)end if
(11)Delete the original sub-block from C;
(12)end for
(13)Fractal decode the image F according to the fractal code W, and calculate the error E between the decoded image and the original image;
(14)Multiply the coordinates and size of the upper left corner of all range blocks and domain blocks by h times in queue C to obtain a new range and domain block corresponding to each range block ;
(15)Optionally, magnify an initial image of the same size as F by h times and perform super-resolution decoding by equation (6) to obtain fractal interpolation estimation of image F;
(16)Input the error image E into the depth residual network for h-fold reconstruction to obtain an error compensation term ;
(17)Obtain the best estimate of the reconstructed image: .

4. Experiments

4.1. Datasets

To demonstrate the effectiveness of the proposed method, quantitative and qualitative comparisons were made between different scale factors. In the training depth residual network, the external dataset DIV2K [38] issued by Timofte et al. was used. The DIV2K dataset consists of 800 training images, 100 validation images, and 100 test images. Since we use the depth residual network to estimate the error image, the datasets used by the training network are the fractal coded error image. During the experiment, we performed fractal coding processing on 800 training images. The error image pairs of the HR and LR image pairs are used as training samples, and 6 verification images are selected for verification. We used four standard benchmark datasets: Set5 [39], Set14 [40], B100 [41], and Urban100 [7] to compare the performance.

4.2. Parameter Settings

The experimental environment includes hardware devices and software configurations. The test computer is configured as an Intel Core i7-5820K [email protected] GHz x12, a NVIDIA GeForce TITAN X GPU, 16 GB of RAM, a Win10 operating system, and the compile software is MATLAB 2018a.

In the process of training the residual network, Caffe [42] was used to build the network model and optimized by the Adam optimizer. The batch size was set to 256, the momentum was 0.9, and the weight decay parameters were 0.009 (L2 penalty multiplied by 0.009). For weight initialization, we refer to the method proposed by He et al. [43]. The filter of the convolutional layer in the network was randomly initialized by a zero-mean Gaussian distribution with a standard deviation of 0.01. The filter in the deconvolutional layer was initialized by a bilinear interpolation kernel. The initial learning rate for all layers was set to 1e − 3, which was reduced by half for every 100 epochs. The time required to train the network was approximately 7 hours.

For each scale factor, the HR error image was randomly cropped from the training image into image blocks of size , and then the HR image was generated using the bicubic interpolation algorithm to generate corresponding LR training samples. To avoid overfitting and further improve accuracy, the dataset was amplified by horizontal, vertical flip and 90° rotation. Since humans are more sensitive to detailed changes than colors, as with other methods, we also applied our algorithm to the luminance channel in the YCbCr color space and bicubic interpolation to the other two color channels.

The parameters in the fractal coding algorithm were: , , and . As shown in Table 1, the deeper the network, the better the performance. As the number of residual blocks and the number of convolutional layers increased, better performance was easily obtained. Therefore, designing a deep network into a deeper, broader network allows for more layered features. The number of residual blocks was set to 10, and each convolutional layer produced 64 feature maps.

4.3. Comparison with State-of-The-Art Models

To verify the validity of the proposed algorithm, we compared our method with the six state-of-the-art methods qualitatively and quantitatively with three different scale factors (2, 3, and 4). The methods compared include two traditional learning methods: A+ [15] and SelfexSR [21] and five deep learning methods: SRCNN [16], VDSR [17], EDSR [18], RDN [31], and PASSR [44].

To effectively evaluate the benefits and drawbacks of the reconstruction effect, the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) were used to measure the image quality. PSNR is a full reference evaluation method. The larger the value of the PSNR, the better the reconstruction effect of the image. SSIM refers to the similarity of structural information between two images and is also a full reference image quality evaluation index, which measures image similarity from three aspects of brightness, contrast, and structure. The SSIM value range [0, 1], and the larger the value, the better the reconstruction effect:where MAX represents the gray level of the image, generally 255; P and Q are the width and height of the image, respectively; y is the original HR image, is the reconstructed HR image; and are the average gray value and variance in the original HR image, respectively; and are the grayscale mean and variance in the reconstructed image, respectively; is the covariance of the original HR image and the reconstructed HR image; and and are constant.

Table 2 shows the quantitative comparison of all methods for scale factors of 2, 3, and 4 under four datasets. The effect of each method on SR is summarized by the average PSNR and SSIM. During the reconstruction process, the original LR image of any size is taken as input, and the corresponding HR image is directly reconstructed. We experiment with the code provided by the authors of these methods.

It can be seen from the table that the PSNR and SSIM of the proposed method achieved the best results in the comparison method. Bicubic uses a simple interpolation method and did not achieve good performance. The A+ algorithm uses the method of anchor domain regression, and the reconstruction effect is improved. The PSNR is increased by 1∼2 dB compared with bicubic, and the SSIM is improved by approximately 0.03∼0.06. For the first time, the SRCNN introduced deep learning into the SR field, which showed superior performance compared to traditional methods such as bicubic, A+, and SelfExSR. Compared with A+, PSNR increased by approximately 0.1∼0.2 dB. However, SelfExSR was better than the SRCNN in the Urban100 dataset, which uses the self-similarity of the image. Therefore, it is better for Urban100, which mostly contains self-similar structures. VDSR, EDSR, RDN, and PASSR all introduce residual learning, which makes the structure of the network deeper, learns more rich features, and achieves good results. RDN uses a residual-dense block (RDB) to retain a rich set of features, and when the scale factor was 2, it performed best on most datasets. However, when the scale factor was larger (such as 3 and 4), RDN did not have an advantage. PASSR is an SR reconstruction method for stereoscopic images that uses stereo image datasets for training. Therefore, its PSNR and SSIM are optimal on the dataset of Urban100, which contains a large number of stereo images.

As the scale factor increased, the performances of all methods decreased, but our method always maintained the best result when the scale factor was 4, which shows that our method shows better performance for large-scale factors because we use fractal coding to obtain more image information with less data than other methods. Moreover, for Urban100, which mostly contains self-similar structures, the use of larger training layers and image blocks can make better use of the receptive field to obtain more information. Although the proposed method did not show advantages when magnified 2 times, it showed pleasing results for large-scale factors. In general, our method is still competitive compared to state-of-the-art methods.

PSNR and SSIM only evaluate the difference between image pixels, and the quality of reconstruction cannot be fully evaluated. Therefore, based on the quantitative comparison, we combined the visual effects to qualitatively analyze the image quality. Figures 610 show qualitative comparisons of this paper with other advanced methods.

Figures 610 show qualitative comparisons of our method with state-of-the-art methods. Figures 6 and 10 are diagrams of the effects of various methods at a scale factor of 2. We tested the texture retention of “sculptures,” “text,” and “windows”. It can be seen from this that the proposed algorithm is visually competitive and reconstructs more textures. Bicubic is based on the principle of the smoothing hypothesis, the image is very blurred and the reconstruction is not good. Although the A+ algorithm has been improved, there is still a blurring phenomenon. SelfExSR, SRCNN, and VDSR have little difference in the reconstruction of “sculpture” and “text”. However, as can be clearly seen from Figure 10, the texture of the SRCNN image has significant distortion. SelfExSR is slightly better than the SRCNN, but both have artifacts at the edges. VDSR works best, with a more complete line texture. EDSR reconstructed more details in terms of “sculpture” and “text,” with a clearer effect. However, the line texture of the window did not achieve the same effect as VDSR. The proposed algorithm is slightly better than EDSR, more details are extracted, and the distortion is improved.

Figures 7 and 8 are visual representations with a scale factor of 3. The reconstruction performance of all algorithms is degraded. In Figure 7, A+ appears jagged and the SRCNN exhibits more distortion and edge blurring. This phenomenon improved with the increase in network depth in VDSR. Since both the SRCNN and VDSR are preprocessed using bicubic interpolation, significant artifacts are produced. The pattern on the “clothes” in SelfExSR is clearer, but the effect of the collar is not good. The EDSR and the proposed algorithm reconstruct the obvious edge details, which can prove the superiority of the deep network. In Figure 8, neither the bicubic nor the A+ method reconstructed the “baby’s” eyebrows. EDSR reconstructed a more complete eyebrow line, and the proposed algorithm has more eyebrow texture. In the tail of the “fish,” Bicubic, A+, SRCNN, SelfExSR, and RDN have different degrees of artifacts, while the effects of VDSR and Ours are more complete and clearer. For complex textures such as “coral,” our algorithm shows more fine edge information.

Figure 9 shows the effect diagrams with a scale factor of 4. With the increase in the scale factor, the reconstruction ability of the algorithm is greatly reduced. Compared with the original image, the gap is larger and larger, and edge blur and artifacts are more likely to occur. The effect between the SRCNN and VDSR is not much different. The “wall,” “cactus,” and hair of “baboon” are blurred, the edge details are incomplete, and the texture is distorted. The “baboon” image generated by the SelfExSR method has a significant block effect because it utilizes the self-similarity feature inside the image block to group the image blocks. Images with more complex textures, such as img092, Bicubic, A+, SRCNN, and SelfExSR, are more severely jagged. The PASSR is generally clearer than the VDSR, but the textures on the walls are distorted and artifacts. Although our method also divides the image by fractal technology, it uses the deep network to effectively compensate for the error. Therefore, it has achieved considerable visual effects. For example, the thorns on the edge of the “cactus” are better presented, and the “baboon’s” hair not only has no blockiness but is also more complete. In img092, the texture is more realistic and not jagged.

In general, the bicubic method uses a simple interpolation algorithm, and the reconstruction effect is very blurry and smooth. A+ combines sparse representation and anchoring methods to calculate the correlation of dictionary atoms. Although the effect is better than the traditional sparse representation method, it still appears jagged. The SRCNN uses a simple convolutional neural network to improve compared to the traditional method, but the individual images still cannot match the SelfExSR method using the internal example. The VDSR method uses a global residual method to deepen the network structure, which is slightly better than the SRCNN. However, there are artifacts in the edge details, and the texture details cannot be reproduced well. EDSR uses a deeper network structure to learn more image features and reduce edge blurring. RDN uses RDB blocks to fully extract the layered features of the image and restore more textures. PASSR introduces the parallax attention mechanism of the global receiving field, which is convenient for processing stereoscopic images with large visual changes, so it has certain limitations in the application range. The proposed method is based on the advantages of fractal technology in texture description and uses fractal super-resolution technology for magnification reconstruction. Because of the blockiness that may occur due to the self-similarity characteristics inside the image, the error caused by fractal coding is compensated by constructing a deep residual network. The reconstruction performance is greatly improved and more texture details are presented, making the estimated image more realistic and complete.

4.4. Ablation Analysis

To further understand the contribution of the proposed algorithm, we evaluate the different components in the model through experiments. The parameter factors in the experiment are the same as those in Section 4.2. Since the results of different scale factors are similar, only the result of a scale factor of 3 is reported, and the dataset is the Set14.

The traditional fractal-based SR method generally uses a bicubic interpolation algorithm for estimating error compensation. From the analysis of Section 4.3, it can be seen that the bicubic algorithm adopts the principle based on smoothing and cannot effectively estimate a more complete error image. Similarly, if the estimated error image is not effective, the ideal reconstructed image cannot be obtained when it is added to the fractal interpolation image. Table 3 shows the average PSNR obtained using the traditional bicubic method and the depth residual network. It can be seen that reconstructing the error image by constructing the residual network greatly improves the reconstruction quality.

To prove the validity of the parallel structure in the deep residual network, we split the network into a deep network and a shallow network. Table 3 shows an ablation study of different components on the Set14 dataset with scale factor 3. The joint training of the two networks increases the PSNR by 0.18 dB. Because the two-network joint training is equivalent to widening the network, increases the parameters of the network and the number of features, so that the network learns more different features, which can effectively improve the visual effect of reconstruction.

To prove the advantage of ResBlocks, which are deleted to observe its effect, in the experiment, all ResBlocks were replaced with the same number of Conv, and the subsequent fusion layers were removed. The result of the last convolutional layer is then fed directly into the reconstruction network to obtain the final prediction. As shown in Table 3, after the ResBlock is removed, the SR reconstruction performance is degraded because residual learning reduces the weight of the network, improves the information flow and gradient, greatly speeds up the convergence, and achieves good performance.

4.5. Evaluation in Running Time

Finally, we compare the running time with the state-of-the-art methods on the same machine. We use these algorithms to super-resolution 100 images in Urban100 and then record the average PSNR and consumption time of each algorithm. As shown in Figure 11, the X coordinate represents the running time and the Y coordinate represents the average PSNR. The average PSNR of our algorithm is the largest, and its consumption time is at a medium level, which is smaller than EDSR, RDN, SelfexSR, and A+.

5. Conclusions

In this paper, we propose a super-resolution method based on fractal image interpolation and depth residual network, which can effectively recover texture structure details, improve edge blur and distortion problems, and have high SR reconstruction performance. First, the image is fractal coded to establish the similarity between the range block and the domain block. Then, we use the similarity relationship to SR reconstruction of the image with SR fractal decoding. In the fractal decoding process, since the fractal image encoding is lossy compression, there is an error between the decoded image and the original image. To estimate the error compensation term more accurately, the residual network is used to train the mapping relationship between high- and low-resolution error images. In future work, the reconstruction accuracy of the interpolation method will be further improved.

Data Availability

All data in this article are derived from publicly available datasets on the Internet.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

The authors acknowledge the National Natural Science Foundation of China (Grant: 61772319, 61976125, 61976124, 61773244, and 61873177), Shandong Natural Science Foundation of China (Grant: ZR2017MF049), and Yantai Key Research and Development Program of China (Grant: 2017ZH065 and 2019XDHZ081).