Dual-Path Image Reconstruction: Bridging Vision Transformer and Perceptual Compressive Sensing Networks

Over the past few years, notable advancements have been made through the adoption of self-attention mechanisms and perceptual optimization, which have proven to be successful techniques in enhancing the overall quality of image reconstruction. Self-attention mechanisms in Vision Transformers have been widely used in neural networks to capture long-range dependencies in image data, while perceptual optimization has been shown to enhance the perceptual quality of reconstructed images. In this paper, we present a novel approach to image reconstruction by bridging the capabilities of Vision Transformer and Perceptual Compressive Sensing Networks. Specifically, we use a self-attention mechanism to capture the global context of the image and guide the sampling process, while optimizing the perceptual quality of the sampled image using a pretrained perceptual loss function. Our experiments demonstrate that our proposed approach outperforms existing state-of-the-art methods in terms of reconstruction quality and achieves visually pleasing results. Overall, our work contributes to the development of efficient and effective techniques for image sampling and reconstruction, which have potential applications in a wide range of domains, including medical imaging and video processing.


I. INTRODUCTION
C OMPRESSIVE Sensing (CS) is an important technique in the field of signal processing and computer vision.CS is a technique for acquiring and processing signals at a lower rate than required by the Nyquist-Shannon sampling theorem.CS is used for image reconstruction, by reconstructing a high-quality image from a set of low-quality or incomplete observations.It has emerged as an alternative to traditional image compression techniques.The potential of CS and image reconstruction in computer vision lies in their ability to enable high-quality imaging and data acquisition with minimal resources.Indeed, CS can be used to reduce the amount of data required to capture an image, making it possible to store or transmit images more efficiently.It can also be used to reduce the amount of data required for image processing, enabling real-time processing of large datasets.Most CS approaches are CNN-based models, which represent a limitation by the receptive field of the convolution kernels and their non-ability to handle long-term dependencies.
Deep learning models have shown impressive success in various computer vision tasks, but they are not always efficient at processing large and complex images.One possible solution to this problem is to incorporate visual attention mechanisms into deep learning models, inspired by the way humans selectively process relevant information and filter out distractions.Attention mechanisms allow deep learning models to focus on important regions of an image and suppress irrelevant information, leading to improved accuracy and efficiency.
In addition to visual attention, which is a selective process that allows us to focus on important information in the environment while ignoring distractions, we can also find visual perception, which refers to the process of interpreting and making sense of visual information from the environment.This process involves multiple stages of processing, where the basic features of a stimulus (e.g., color, shape, motion) are detected and encoded then integrated into meaningful objects and scenes.Overall, visual attention and visual perception are both important processes in visual recognition.Visual attention allows us to selectively focus on important information in the environment, while visual perception allows us to interpret and make sense of visual information.These processes are intricately linked and work together to support our visual experiences.
In this paper, we propose a novel CS approach for image sampling and reconstruction.Our proposed model combines self-attention and perceptual information to selectively attend to different regions of an image at multiple levels of abstraction.We evaluate the effectiveness of our proposed model using experiments on benchmark datasets and demonstrate that it outperforms existing models in terms of reconstruction quality.
Hence, the main contributions of this paper are : • We propose a framework based on a hybrid architecture that combines the self-attention mechanism provided by vision transformers for image long-range dependencies and global context modeling with the advantages of convolutional neural networks for optimal local feature extraction.
• We propose to add a transformer-based coding path, so the model coding is done in two paths, a CNN-based CSNet sampling path, and a transformer path.These two paths are linked by a fusion layer to merge the features and produce the vector that presents the input image.
• We use a perceptual optimization in the training process, to semantically guide the model to learn long-range and local high-frequency details of visual and contextual features.
• Finally, we run extensive experiments to evaluate our approach in term of reconstruction quality and compare it to state-of-the-art methods on different image compression benchmarks.
The remainder of this paper is organized as follows.In section II, we present and discuss previous works on image-based CS reconstruction.In section III, we explain the proposed approach.In section IV, experimental results and comparisons with State-Of-The-Art (SOTA) methods are carried out.Finally, in section V, we summarize our findings and present some opportunities for future works.

II. RELATED WORK
In this section, we present a CS image reconstruction literature review.We first discuss the existing deep learningbased CS methods.Then we review the recent development of vision transformers for image reconstruction.

A. Deep learning-based CS approaches
Compressing Sensing theory was first proposed in 2004 by David Donoho [1].Deep learning-based CS approaches have been proposed to solve the CS reconstruction problem through the extraction (learning) of significant features from the input signal itself.Several reconstruction algorithms based on CNNs have been proposed to overcome the complexity of traditional methods.At the outset, Kulkarni et al. [2] developed a non-iterative reconstruction model using CNN (ReconNet).Based on iterative thresholding algorithms, Zhang et al. [3] proposed the convolutional ISTA-Net model for image recovery.Afterward, Shi et al. proposed a Scalable Convolutional Neural Network (SCSNet), and right after proposed a sampling reconstruction framework called CSNet [4], which replaces the sampling model with a convolutional layer.However, these methods have limitations due to their random sampling.To address this problem, Siwang Zhou et al. in [5] have proposed a Block-Based Image Compressive Sensing (BCS-Net), which uses block correlation for sampling.Nevertheless, the model training overlooked the semantic information of the image to draw the prior knowledge.Hence, in order to improve the reconstruction quality by considering the prior knowledge, Wenxue Cui et al. [6] have proposed a non-local CSNet (NL-CSNet) based on non-local self-similarity priors.
However, all the previous methods did not consider perceptual information, which is important for visual and semantic content reconstruction of the images.Recently, in [7], Bairi et al. proposed a perceptual-optimized CS framework that uses perceptual information for image reconstruction.The model is based on an auto-encoder, which is trained using perceptual optimization.Despite its power in the reconstruction of semantic information, this model still lacks high-frequency feature extraction.

B. Transformer-based image reconstruction
The first transformer was proposed by Vaswani et al. [8] for Natural Language Processing (NLP) tasks.In the latter, the long-range dependencies were given by multi-headed selfattention and feed-forward Multi Layer Perceptron (MLP) block.Among the best-known models dealing with this type of task are BERT [9] and GPT [10].Based on the transformer force in NLP, transformers were recently integrated into the context of image processing.For classification tasks, the innovative work of Vision Transformer (ViT) [11] divides an image into 16 by 16 patches, to use the previous multi-headed self-attention and feed-forward MLP to build a classifier.In addition to the original ViT, transformer models, with different versions and architectures were proposed for several computer vision tasks namely for classification [12], [13], [14], [15], [16], [17], [8], for object detection [18], [19]and for image segmentation tasks [20], [21], [22].
Few works have investigated transformers for image reconstruction.Indeed, this task produces images as a final output, which is more difficult than high-level vision tasks such as classification, segmentation, and object detection, whose outputs are labels or areas.For transformer-based image reconstruction, Hanting et al. [23] proposed a pre-trained model called IPT that can be used for computer image reconstruction tasks.This approach suffers from the large number of parameters and image features are still extracted from CNN.A concurrent work [24] proposed a U-shaped transformer for image reconstruction, which is built upon the UNet architecture and based on the Swin's transformer block.However, these models, based solely on pure transformers, overlook local feature identification and low-frequency information.To preserve the advantages of both CNN-based networks for the local description and the transformer for long-range dependencies handling, Liang et al. [25] proposed a SwinIR model for the restoration of compressed or noisy images based on both Swin transformer blocks and CNNs which were designed for image classification in [15], this model showed better results than those obtained by IPT.Similarly, a transformerbased image reconstruction (TIC) method is developed in [26].The latter uses a canonical architecture of the VAE variational autoencoder in the form of convolutional layers and Swin transform blocks to capture long and short-term dependencies of the input image.Test results on the Kodark dataset show the good performance of this approach.Dongjie et al. [27] extend the technique of self-attention in compressed sensing to overcome the limitations of convolution layers in modeling global features, by a CSformer model that combines the advantages of CNNs and transformers.The model contains a sampling module as a convolution layer and a reconstruction module in the form of two branches that integrate local and global-range dependencies.Nevertheless, these architectures need the integration of perceptual information, which helps the reconstruction of semantic details of the image.

III. PROPOSED APPROACH
In this section, we present the proposed PCST-Net framework by using self-attention through vision transformer for better feature extraction and visual perception to make sense of these features.Fig. 1 illustrates the proposed approach architecture.Indeed, it is based on CS sampling/reconstruction autoencoder which adopts an attention mechanism to capture long-range contextual information.The learning process is guided by the image's visual content information.The proposed approach involves two neural networks, an encoder, and a decoder.The encoder network compresses the input images by projecting them into a lower-dimensional space, while the decoder network restores the original image representation from the compressed representation.The network is trained in an end-to-end manner to minimize image reconstruction error, allowing it to find the optimal parameters that enable sampling and reconstruction for any input image.

A. Sampling network
The Sampling network (Encoder) is a combination of CNN and transformer models to take advantage of the spatial locality and self-attention mechanisms.The CNN model is inspired by PSCS-Net [7] and is laid out as three Convolution/MaxPooling blocks.In the original CS framework, encoded data are the result of sampling the input image.The latter are called encoded data as they correspond to the rows of the sampled image.In the context of deep learning, the encoded data is arranged rather like an ordinary 3D tensor like any CNN feature map.Theoretically, they still correspond to CS sampled vectors, just stacked in a 3D tensor.When we apply the sampling operator S CN N on the input image x, we obtain y 1 , which corresponds to the encoded data obtained by the CNN sampling Network.
In Eq.1, the network operates on 2D image patches with the convolution operator ( * ) with the sampling Matrix W 1 s .Such an operation projects an input image x ∈ R dx onto one of the encoded vectors y 1 ∈ R dy .The sampling matrix W 1 s is a composition of convolutions and nonlinear activation functions f that allows for better features extraction.The obtained result y 1 can be written as: Transformer-based encoder aims to capture long-range visual dependencies through the self-attention mechanism.It is composed of a projection layer and a transformer block which is the architecture of the ViT backbone [11].An image projection is a lower-dimensional representation of the image.In other words, it is a dense vector representation of the image.First, the image is divided into P × P non-overlapping patches, then this feature projection layer projects the input patches having a size of (P x P x C) into a dimension of (1 x Pd) such that Pd is the projection dimension.The selfattention mechanism is an integral component of a transformer, which explicitly models the interactions between all entities in a sequence.For an input sequence of Np elements, selfattention captures the interaction between all Np entities and encodes each entity in terms of global contextual information.For this fair, three weight learning matrices are defined, Queries (W Q ∈ R P d * q ), Keys (W K ∈ R P d * k ), and Values (W V ∈ R P d * v ).The input sequence X is projected onto these weight matrices to obtain: Self-attention is formulated by: Fig. 2 shows the transformer block architecture which consists of two LN normalization layers, a multi-headed selfattention layer M SA and a M LP made up of two fully connected layers, the τ norm is inserted before MSA and MLP.
The multi-headed self-attention MSA comprises several blocks of self-attention, each block has its own set of learnable weight matrices Query, key, and Value.Multi-headed selfattention runs h times in parallel, such that h is the number of heads, then concatenated into a single matrix.This block takes a series of sequences I patches of size (Np x Pd) as input and globally calculates the self-attention between them.The whole process of this block can be formulated as follows: The transformer path is composed of four transformer blocks.Feature fusion aims to extract the most discriminating information and eliminate redundant information.The fusion function combines the global features of the transformer and the local features of the CNN by a fusion strategy, such as addition or average.The fusion of y 1 and y 2 is given by Eq.6.
Since the stems of the transformer and the CNN have different dimensions, we need to modify the characteristics of the transformer to match those of the CNN.

B. Reconstruction Network
The upsampling network (Decoder) is designed in [7] as a three-block de-convolutional network to learn the inverse convolution filters to reconstruct images.The decoder returns y to the input space by obtaining the feature representation in the image recovery process.The decoder represents a nonlinear mapping that is learned from measurements y to its original image x by training.The decoder is symmetric with the CNN sampling network and consists of three layers: the input layer and two hidden layers.The decoder function (Eq.7) is used to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

PSCS-Net Sampling
ViT Sampling

PSCS-Net Features
ViT Features

Reshaped ViT Features Merged Features
Fig. 1: Overview of the proposed image reconstruction-based framework: the model is trained using the combination of visual perception and self-attention.

Normalisation
Multi-Head Self-Attention Normalisation Patches Fig. 2: Architecture of a transformer block recover the reconstruction images x from measurement vector y.

C. Training of PCST-Net
To semantically guide our model to learn visual and contextual features, we use perceptual loss optimization in the training process as shown in [7].The used perceptual loss measures the distance between images in high-level feature space using a pre-trained compressing sensing network [4] (CSNet).This model is originally trained on ImageNet dataset.The PCST-Net network is trained in an end-to-end fashion through the minimization of the global loss term expressed as: With : L p in Eq.9 is the perceptual loss, L s in Eq.10 is the sparsity loss, and L 2 in Eq.11 is the L 2 Norm between the original and reconstruted image.The three terms are weighted by α 1 , α 2 , and α 3 , respectively.
Where φ is the sampling operator of CSNet to compute the difference between the feature vector of the input image x and the predicted image x.
The first term of L s in Eq.10 limits the weight parameters W with L 2 norm as to penalize large weight.The second term is the sparsity regularizer.β 1 is the penalty term and KL is the Kullback-Leibler divergence for penalizing active code units.β 2 is the intensity of the sparsity, ρ is the sparse factor, and ρ j represents the mean value of activation of the j th neuron in each batch of the training set.
The L 2 Norm is used to profit from the qualities of pixelwise loss functions.
The goal of training PCST-Net model is to minimize L total as shown in Algorithm 1. First, parameters W 1 s , W Q , W K , and W V are randomly initialized to serve the purpose of symmetry breaking.Then, encoded data y and the reconstruction images x are obtained through the encoder and decoder subnetworks, respectively.
Compute encoded image y Compute perceptual loss L total (x, x) (Eq.8)Minimize final loss by gradient descent algorithm Update W 1 s , W Q , W K , W V , and W r end for

IV. EXPERIMENTS AND RESULTS
In this section, we first introduce the dataset used for PCST-Net model training and the evaluation metrics.Second, we present the model settings for better training (Section IV-B).Next, in section IV-C, we conduct an experimental study on image compression benchmarks for model objective evaluation and compare the proposed approach with state-ofthe-art methods.Finally, in Section IV-D, we evaluate the quality of PCST-Net image reconstruction with a subjective evaluation.

A. Datasets and evaluation metrics
PCST-Net is trained using a large-scale dataset which is COCO 2017 dataset 1 .118k and 40k images have been used for training and validation respectively.
To evaluate the model, two metrics are computed: Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM).PSNR measures image reconstruction quality, while SSIM, a perceptual metric, quantifies image degradation.

B. PCST-Net model settings and training 1) Model hyper-parameters selection:
After an axial empirical study, the hyper-parameters of the model are set to 256 x 256 x 3, 8 x 8 x 3, and 8 x 8 x 3 for Image size, Patch size, and Block size, respectively.The perceptual loss function is optimized using the Adam optimizer with a batch size equal to 32 and a learning rate of 0.002 for 100 epochs.
1 https://cocodataset.org/home 2) Fusion strategy selection: Our method adopts an addition strategy to merge the features of different paths.To illustrate the effectiveness of this method, we construct a variant in which the features of the CNN and the transformer are averaged rather than summed.Fig. 3 shows the PSNR results of the two models on Set5, Set14, and BSD100.The feature addition fusion operation shows superior PSNR performance with different compression ratios.The feature averaging operation achieves a close performance when the compression ratios are lower than 10%, but above this compression ratio, the addition shows its efficiency against the average.
3) Path selection: PCST-Net is a Dual Path model that aims to combine the efficiency of convolution in extracting local features with the capability of the transformer in modeling global representations.To compare the advantages of the two branches-based approach, we created a Single Path model called SPCST-Net, which uses only the transformer path for compression.The results of the tests on three datasets (Set5, Set14, and BSD100) are presented in Fig. 4.
Obtained results on Set5, Set14, and BSD100 datasets confirm that PCST-Net helps in recovering more details and semantic information of the images compared to PSCS-Net (based only on CNN) or SPCST-Net (based only on transformers).
Our experimental results show that our approach achieves higher performance for image reconstruction compared to state-of-the-art algorithms.The results obtained by PCST-Net on the different compression datasets benefited from the coupling between perception and self-attention to give the best PSNR and SSIM values compared to other state-of-the-art reconstruction methods.

D. Subjective Evaluation
In this section, we describe the subjective evaluation to visualize the quality of reconstructed images.This qualitative assessment is done with the naked eye by noting the differences between images at a ratio of 0.25.We also provide PSNR and SSIM values for each image to highlight quantitative differences.
The visualization obtained by PCST-Net in Fig. 5 shows again that the use of both perception and self-attention gives the best result compared to other reconstruction methods.Obtained results suggest that the combination of self-attention and perceptual optimization can provide a powerful tool for improving the quality of image reconstruction.The use of selfattention mechanisms to capture long-range dependencies in the image data can lead to better sampling performance, while the incorporation of perceptual optimization can enhance the perceptual quality of the reconstructed images.

V. CONCLUSION
In this paper, we proposed a novel approach for image sampling and reconstruction that combines Vision Transformer and perceptual optimization techniques.Our approach leverages the power of self-attention to capture the global context of the image and guide the sampling process while optimizing the perceptual quality of the sampled image using a perceptual loss function.We have demonstrated the effectiveness of our proposed approach through experiments on several benchmark datasets, and we have shown that it outperforms existing state-of-the-art methods in terms of reconstruction quality and visual fidelity.Our approach has potential applications in a wide range of domains, including medical imaging, video processing, and computer graphics.In conclusion, our work contributes to the development of efficient and effective techniques for image sampling and reconstruction, which are critical components in the field of multimedia processing.We believe that our proposed approach can serve as a foundation for future research in this area, and we hope that it will inspire further innovations in the field of computer vision.

Fig. 5 :
Fig. 5: Comparison of the visual quality of image reconstruction using a ratio of 0.2(a) and 0.4(b).
ZAKARIA BAIRI ET AL.: DUAL-PATH IMAGE RECONSTRUCTION: BRIDGING VISION TRANSFORMER AND PERCEPTUAL COMPRESSIVE SENSING 349

TABLE I :
Comparaison of PSNR(dB) and SSIM on Set5 ZAKARIA BAIRI ET AL.: DUAL-PATH IMAGE RECONSTRUCTION: BRIDGING VISION TRANSFORMER AND PERCEPTUAL COMPRESSIVE SENSING 351Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II :
Comparaison of PSNR(dB) and SSIM on Set14

TABLE III :
Comparaison of PSNR(dB) and SSIM on BSD100