Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board


2022,  4 (3):   247 - 262

Published Date:2022-6-20 DOI: 10.1016/j.vrih.2022.05.001


A 360° video stream provide users a choice of viewing one's own point of interest inside the immersive contents. Performing head or hand manipulations to view the interesting scene in a 360° video is very tedious and the user may view the interested frame during his head/hand movement or even lose it. While automatically extracting user's point of interest (UPI) in a 360° video is very challenging because of subjectivity and difference of comforts. To handle these challenges and provide user's the best and visually pleasant view, we propose an automatic approach by utilizing two CNN models: object detector and aesthetic score of the scene. The proposed framework is three folded: pre-processing, Deepdive architecture, and view selection pipeline. In first fold, an input 360° video-frame is divided into three sub-frames, each one with 120° view. In second fold, each sub-frame is passed through CNN models to extract visual features in the sub-frames and calculate aesthetic score. Finally, decision pipeline selects the sub-frame with salient object based on the detected object and calculated aesthetic score. As compared to other state-of-the-art techniques which are domain specific approaches i.e., support sports 360° video, our system support most of the 360° videos genre. Performance evaluation of proposed framework on our own collected data from various websites indicate performance for different categories of 360° videos.


1 Introduction
A 360° panoramic video provide the entire view of the surrounding environment, which make it dominant over a standard 90° of normal field of view (NFOV) videos. This dramatic increase in field of view (FOV) provides new exciting ways to capture and record visual contents. For example, imagine a cricket player who want to study and analyse his week shots and mistakes he did during playing the game. Through NFOV camera, it is impossible for the player to view all scenes at the same time. However, from a 360° video, he can review all shots and mistakes that couldn't notice during playing the game. Similarly, in many other same situations such as education and training[1], a 360° video provide a beautiful view of the virtual world which is restricted by standard camera's limited FOV. Further, 360° videos have got attention of the both consumers- and production-grade (e.g., GoPro, 360°Fly) and warmly welcomes 360° tools in market. Further, famous giant sites including Google, Facebook, and YouTube has also started to support the contents of 360° videos. It is predicted that in near future, 360° video will become a major source of entertainment for both augmented reality (AR) and virtual reality (VR).
Though 360° video has provided a widespread view of different visual contents but due to limited FOV of human perception to visual contents lead to different problems watching 360° videos. Foremost, it is very strenuous for a user to find the choice of "where to look" because of spacious scene of 360° video. There are different techniques to select the current FOV. The first technique is to manually navigate the 360° video. Normally a standard 360° viewer displays 360° video and users search the region of interest (ROI) through mouse clicks. In addition, watching 360° video with the help of wearable technology devices such as Samsung Gear, where embedded sensors in the device navigate the video with the help of head movements. Finding ROI through mouse clicks and head movements is very difficult especially in immersive contents. However, both techniques require the user to manually select the ROI region. Further, recent study showed that these techniques produce mental stress (such as motion sickness) and user can feel discomfort during watching 360° video (Figure 1). Therefore, an intelligent system is needed to find the interesting scene in the wide view of 360° video and to bring it into the current user's FOV. State-of-the-art contain many automatic visual contents selection mechanism for displaying video contents. Example of such mechanisms[2-5] have been used to condense a long movie or surveillance video into a short clip of video. [6,7] select few key-frames as the summary of the entire video. Some domain specific highlights are focused in [8-10]. Important key-frame selection based on weekly-supervised techniques are proposed in [11,12]. [13,14] proposed deep learning approaches which shows high performance. [15] proposed video summarization based on diverseness of object faces and interestingness. Further, [16] proposed the influence of each frame and object tracking based video summarization. Some summarization techniques based on saliency are proposed in [17,18]. Most of these mechanisms are based on binary decision whether to select the frame or not, while in 360° videos, finding the interesting contents in the single frame, and steering that region to users FOV is challenging task.
Similar study to our proposed approach is conducted by [19], in which interest events of the candidate are detected and then applying dynamic programming to link all the detected event together. But in this method, object is first observed in the entire video, it is not applicable for video streaming. Another similar study conducted by [20], which select the most active moving object in the foreground of frame in sports videos and steers it to the users FOV. This system only selects the most active object in the foreground of the 360° video. Further, this system fails when there is natural scene where no object is moving. In this study, we propose the first Deepdive mechanism based on interestingness of the scene, which analyse the 360° video frame and automatically nselect the fascinating scene in 360° videos frame and steer it to user's FOV to upsurge user comfort inside virtual contents. Our system not only follows natural scenes but also track moving objects in the video. Our main contributions are as follows:
• A 360° video provides coverage of overall scenario that makes watching of such videos difficult while a user is only interested in specific FOV. It requires physical efforts and creates mental stress for users when the field of interest is outside the current view. To tackle this challenge, we propose an intelligent and novel framework to bring interesting and visually important view to the user's current FOV.
• In 360° videos, selecting and viewing object-based FOV is very tedious due to the continuous motion of the object in widespread view of 360° video. Many approaches have been proposed i.e. manually selecting the object or using HMD device which produces mental stress to the viewer during watching 360° video. In the proposed system, existing deep learning-based approach is utilized which efficiently find salient object for the viewer in 360° videos thus decreasing mental stress and produce comfort during watching 360° videos.
• Beside different objects tracking, 360° videos also provide a wide view of the surrounding world. Beautiful scenes (natural landscapes, historic places, and tour videos) are enclosed inside 360° videos. The tautest part of an Deepdive system is to predict the interestingness in a sub-frame as in ROI of a 360° video. To detect interesting UFV for a user, we have measured the interestingness of sub-frames using CNN model of aesthetic calculation and have selected the one with high confidence value.
The rest of paper is organized in three sections. Section II consists detail of the current state-of-the-art methods. Section III provides description of the proposed system, where detailed outline and flow of frames are described in more details. Section IV consist of experimental results, where the system efficiency is measured using different experiments and results are compared with other state-of-the-art techniques. Section V wraps up the paper with conclusion and future directions.
2 Literature review
This section provide details summary of current literature in the context of view selection in videos. In the literature, a lot of work has been done on the video processing ranging from detection[21] and quality assessment[22] to autonmous driving[23]. However, in comparison to conventional videos, 360° videos give the users an exciting experience through an illusion of being there in the virtual contents. This new video genre has attracted users, huge companies, and researchers, however, at the same time created new challenges in its exploration for various applications. In 360° contents, different problems are encountered due to the high resolution of the video and the limited FoV of the human perception towards such visual contents. Foremost, it is very strenuous for a user to find the choice of "where to look" because of spacious coverage of a 360° video. There are certain manual techniques that make it possible to navigate current FoV in a 360° video. The first technique involves mouse clicks where a viewer watches an input video in a 360° video player. The viewers can also search the current FoV in a 360° video using a wearable device such as Head-Mounted Displays (HMD), i.e., Samsung Gear. The embedded sensors in such devices navigate the video with the help of head movements. A typical example of manually viewing a 360° video using HMD is shown in Figure 1. These manual techniques allow viewers to select the current FoV in a 360° panoramic video. Searching for a ROI in such a vast 360° contetns through mouse clicks and head movements is very dizzy and exhausting for the viewer. Further, a recent study[24] showed that these techniques produces VR sickness such as mental stress, motion sickness, headache, stomach awareness, and disorientation where viewer feel discomfort while watching a immersive videos. To overcome these challenges, automatic virtual camera for 360° videos[25,26] is an attractive field where novel techniques are designed to process an unedited video for generating visually appealing and pleasant events. There are other related work that apply different techniques, including saliency detection[27,28], video retargeting[29,30], and video summarization[31] to extract visually interesting and salient regions in videos. However, these methods have several limitations when dealing with complex 360° videos. Saliency detection in VR images is at a primitive level[32,33], but these methods are restrained by the dynamic nature of the 360° videos. Further, video retargeting also predicts the visual attention in videos, however, the main problem is that it requires a well pre-processed and edited video. Moreover, in video summarization, only keyframes are generated where redundant frames are removed, resulting in a non-sequence frames. In comparison, our focus is to generate a smooth, visually interesting, and pleasant video without losing the sequence of the frames.
Detecting saliency attracts the viewer's attention via extracting the interesting visual contents present inside input videos. For instance, Jiang et al. used saliency as a regression problem, where an image is segmented into a multi-level saliency score using a supervised learning technique[34]. Further, all these scores are fused to produce a saliency map. In another research, Tong et al. produced a saliency map from an input image for generating training samples followed by a classifier to learn and detect salient pixels[35]. Finally, both the saliency map and salient pixels are integrated to improve the detection performance. However, saliency maps provide some blurry edges that limit the performance of saliency detection. Li et al. used an end-to-end CNN network with a pixel-level fully connected and a segment-wise features layer to handle blurry edges problem[36]. Wang et al. introduced a foreground inference network for the detection of interesting and salient objects in an image. However, this method ignored the extraction of global semantic information that resulted in effective feature learning[37]. Zhang et al. extracted selective multi-level contextual information through a progressive manner[38]. They developed a multi-path recurrent feedback to improve the system efficiency. Despite these methods, Wang et al. introduced a novel approach named attentive saliency network that mimics the human visual system by developing a fixation map to find salient and interesting objects in a visual scene[39]. The main aim of saliency is to emphasize the local contents of the image however, our proposed system focusses on the entire 360 FOV. Moreover, these methods focused on 2D locations of the image, while the proposed Deepdive architecture uses virtual camera of 120° FOV that provide a smooth, interesting, and pleasant video to the viewer.
Besides saliency, another emerging domain that aims at cropping, scaling, and best fitting of source video into a given display also known as "video retargeting". For instance, Lin et al. introduced a novel approach for video retargeting based on content-aware warping[40]. Their system used object-preserving warping that reduces the unpleasant distortion of the objects inside an event. The visual 3D salient object is passed through the "as-rigid-as-possible warping" scheme while the least significant content is input to the "as-close-as-possible" scheme for linear scaling, which enables their system to avoid over deformations. Further, this problem is also studied by Zhang et al.[41] proposing a compressed-domain solution for video retargeting on low power devices such as smartphones. In their method, low-level domain features such as motion information are extracted from the motion vectors of the bit-streams to enhance the quad-shape mesh deformation in compressed video. These methods carry operations to the specific regions or pixels of the images that distort the shapes of the objects. Li et al. proposed a grid flow method for video retargeting to overcome the distortion problem[42]. They developed a two-step approach where in the first step, a video is divided into segments containing grids called grid flows for content inconsistency removal. Secondly, these grids flows are used to select key-frames for the summarization of each video segment. Bansal et al. proposed an unsupervised approach by combining spatial and temporal information with adversarial loss for enhancing content translation and preservation[29]. Kim et al. proposed a deep architecture for fast video inpainting using image-based encoder-decoder model to synthesize still-unknown regions resulting in enhanced and pleasant videos[43]. Both our work and video retargeting focus on visually appealing and salient portion selection for the viewer. However, video retargeting requires a pre-well edited video, while our method emphasis on extracting visually appealing and engaging portion that is displayed to the user through 120° virtual camera.
Video summarization is another emerging domain in Computer Vision[44,45], which aims to condense clips from a full-length video by removing redundant frames while preserving the important frames. There exists a tremendous amount of works on video summarization, such as Hussain et al. proposed a lightweight CNN model to select suspicious objects in a multi-view video surveillance system[46]. The proposed system can be used for industrial application for saving bandwidth and other resource consmption. Besides this, there are other related works such as [31,47,48] for efficient video summarization. The main goal of these methods is removing redundant frames and keeping the salient information as a combination of key-frames or concatenation of disjoint frames. In comparison these methods, the output of our system is a continuous video of interesting and pleasant scenes from a 360° video.
In the current literature, there are also exists several methods for virtual camera selection in 360° video. For instance, Su et al. proposed a conventional algorithm that creates a virtual camera within 360° video for controlling the viewing angle of the viewer for watching 360° video[26]. However, their method lacks salient object detection in the video. Another technique proposed by Drakopoulos et al.[49], utilized a conventional iris tracking technique for pointing FOV in a mobile-based VR system however their system is not robust toward illumination and other spatial changes. Similar work is presented by Hu et al.[50], based on a deep learning-based approach called "deep 360 pilot", an agent that navigates viewing angle in 360° sports videos. This method only focuses on sports videos where the pre-annotated frames of the objects are supplied and has limitation while dealing with other categories, i.e., sports, entertainment, tour, cartoon, and documentaries videos. A study conducted by Cheng[25] via computing saliency-based heat maps for predicting the most salient scene in the wild 360° videos. Further, this viewing angle is enhanced by Xu et al.[51] by utilizing eye-gaze data combined with saliency data to control "where to look" of the viewer in the 360° videos. In addition, a deep DL based approach for user head movement prediction while using a VR device is developed by Chen et al.[52] where authors have used A CNN plus long short term memory (LSM) based DL for user head movememt prediction from user head current position and FOV to provide relistic envoirment experience to users. Their method has achieved 16% better accuracy than baseline methods. The current problem of "where to look" is further studied by Li et al.[53]. They proposed a virtual camera called "viewport" that is based on CNN, where the authors predicted the saliency of the user's PoI. Most of the existing methods are domain-specific that only work for sports and wild videos where these systems have limitations dealing with other 360° video categories. Furthermore, 360° videos are considered to be the primary source of entertainment, and to the best of our knowledge, no such system exists that finds interesting and visually pleasant FoV in 360° videos. Therefore, we proposed an intelligent system, which can find visually pleasant and interesting FoV and covers most of the 360° video categories such as sports, entertainment, tour, cartoon, and documentaries videos.
3 Proposed framework
In this section, we discuss detail overview of the proposed method that is composed of various steps. For easy understanding, the proposed methodology is divided into three subsections. Pre-processing (A) where the mechanism of 360° video into sub-frames is described in detail. Following subsection (B) present details about how deep CNNs architectures measure the memorability and aesthetic score of the salient object in the immersive contents. Fusion of memorability and aesthetic score and how salient object is controlled inside 360° videos are described in the last subsection (C). Visual representation of overall framework is depicted in Figure 3.
3.1 Input acquisition
For most people, resolution comes to mind when referring video content. But the phenomenon of 360° video is a bit complicated as compared to normal. In 360° video, contents are stretched 360° horizontally and 180° vertically and this whole scene is split between two eyes of the viewer, thus limiting user's FOV to 120° out of 360°. This means that a viewer can only see one third of 360° view at given time. Hence, based on viewer's FOV, the input frame of immersive video is divided into three sub-frames, each one with 120° view as shown in Figure 3 (Step 1: Input Acquisition). Further, 360° videos come with different video resolution ranging from 2K to 8K as shown in Figure 2. In the proposed framework, videos from various sources are pre-processed to automatically adjust the resolution of the user's FOV using following equation has been used:
p t o t a l = r * c
U F O V = c 3
Where UFOV is the horizontal view of in 120°. Equation 2 is used to automatically split 120° horizontal view from the input 360° frame.
3.2 Deepdive architecture
To detect objects in images and videos, various CNNs architectures have been proposed that process images multiple time for object detection. But YOLO (you only look once), as the name suggest, scan the image by applying single forward pass processing strategy to find object and predicts their bounding boxes. This forward pass technique enables YOLO to detect objects in real time. In order to detect and classify objects in images, YOLO architecture is divided into two steps: 1) features extractor and 2) features detector. In our proposed method, we modified the original YOLO architecture to only extract visual features related to salient object. For the features extraction, we used YOLO v3 (Darknet-53) that consist of 53 layers consist of consecutive 3×3 and 1×1 convolutions followed by skip connections. The architecture has been stacked with 53 more layers thus forming a total of 106 layers where the processing is slower as compared to the previous versions but enhance performance in terms features related to the salient object. The detail overview for visual features extraction is illustrated in Figure 4.
We extracted features using three scale method for the efficient features extraction. In the first scale, features at 81 layers are down sampled with stride of 32 resulting our first set of features map of size 13×13. In the second scale, the layer 79 and onward is convolved before up sampling of size 26×26. These features are concatenated with features of 61 layers forming a new feature map. For the third scale, same procedure is followed for the layer of 91 and onward and fused with feature of at layer 36.
Besides the visual features related to salient object, we also extracted aesthetic features for each scene of the 360° videos. For this purpose, we utilized existing state-of-the-art model[54] in order to measure aesthetic score for each scene of the immersive video. To extract spatial features, the pooling operation inside the network is slightly different for two feature types. One activation block of the inception module results into a fixed spatial resolution, 1×1 global resolution, and a spatial average pooling of 5×5 for the wide features. All the extracted features are resized and concatenated along their kernel size. More details about the model are available in [54].
3.3 View selection pipeline
Once salient object and aesthetic score of each scene of the immersive contents are measured, our final goal is to fuse all score in order to find most dominant and interesting view inside 360° video. For this purpose, we measured weighted sum of the detected salient object (measured via confidence value) and aesthetic score. In addition, we also assign balancing weight γ = 0.7 and µ= 0.3 to salient object and aesthetic score of the scene respectively. Overall equation can be summarized as:
V S = ε γ + μ ω
Where VS is the dominant view among three FOVs,
is the confidence score of detected object and ɷ aesthetic score, respectively.
4 Experimental setup
This section describes the experimental setup of the proposed system and evaluation of the system on different videos downloaded from YouTube. Detail discussion on the collected videos and result obtained from the proposed system are presented in the coming subsections.
4.1 Dataset description
Different videos have been used to evaluate and find efficiency of the proposed system. Total of 7 videos are downloaded from a huge video source site YouTube, which includes videos of different categories. These categories include sports, entertainment, cartoon, and home general videos. These videos are downloaded in 720×1920 resolution with different time duration. Duration of videos ranges from 30 seconds to 516 seconds of a video. Each video is downloaded in equirectangular format with.mp4 extension and frame rate of these are kept to 30-fps. Details of each video is presented in Table 2. The contents of entertainment videos are changes during time duration. Ground truth for each video is generated manually and store in separate directory.
Performance of CNN Models for Various Objects
Name of object Precision (%) Recall (%) F1-sore (%)
Animal 85.71 75.00 80.00
Person 65.22 66.67 65.93
Cartoon 66.67 68.09 67.37
4.2 Structure of the dataset
For convenience and easy understanding of other researcher fellows to utilize the dataset and add more traces, the dataset is structured in organized format. For each video, there are five sub-directories: sub-frame1, sub-frame2, sub-frame3, ground truth, and predicted. Input frame which is divided into three sub-frames are written into sub-frame1, sub-frame2, sub-frame3 directories, respectively. The sub-frame selected out of these three sub-frames selected by the system for the viewer is written into the predicted directory. The predicted frame is written as name-of-the-sub-frame followed by number-of-frame i.e. sub_ frame_1_0 where sub_frame_1 is first sub-frame1 and 0 is the number of frames. Later, this sub_frame_1_0 is compared with the ground truth for the proposed system analysis.
4.3 Experimental setup
This system has been implanted using python version 3.6. For basic image processing operation, we have used open-source image processing library OpenCV version 4.0. Other necessary library include numpy, keras, tensorflow (GPU version), caffe (compiled for python), matplotlib, scikit-image, and scikit-learn.
For the proposed system evaluation, first ground truth from seven videos are generated manually by carefully examining each frame of the video. Total 8000 frames are generated from the videos as ground truth out 65000 frames. In each video, object (human, animal, or vehicle,) in a scene is considered as a ground truth. These videos contain Static viewpoint (SVP) in which the object is static and moving viewpoint (MVP) where the object moves around 360°videos. The proposed system performed well and able to focus only the view where the desired objects move around.
4.4 System accuracy
To analyse the accuracy of the system different types of experimental have been carried out. Different networks have been combined to improve the accuracy of the system. Initially, only aesthetic score of the sub_frame 1, sub_frame 2, and sub_frame 3 are calculated. Next, frames with high aesthetic score is viewed to the viewer. However, due to different contents, scenes and lighting condition in the videos, the best accuracy obtained was 52.57%. In the next step of the proposed system, only salient objects are measured and frames containing most salient objects are displayed to the viewer. The highest accuracy score obtained with salient object is 62.36%. The main purpose of the proposed system is to the track FOV where an object moves around in the 360° videos. For this purpose, we combined visual features with aesthetic features in order to improve the overall accuracy of the proposed system. Thus, the highest possible obtained accuracy of the proposed system is 68.24% as shown in Figure 5. For the SVP Video 1 and Video 6, aesthetic score is 54.12%, and 54.81% respectively and for MVP Video 2, Video 3, Video 4, Video 5, and Video 7 aesthetic score is 53.01%, 51.90%, 52.04%, 50.65%, and 53.02% respectively. Compared to MVP videos, each approach showed good result on SVP because of the static view. For the MVP videos, as videos contents, scenery, size of objects, and motion of objects are changed with time thus resulting less accuracy as compared to SVP videos. Moreover, in MVP videos different objects are involved (human, animals and vehicles) where misclassification of an object leads decrease accuracy of the proposed system.
Salient Object detection, which includes person, animal, and cartoon, is one of the important components of the proposed framework; the evolution of the detector in terms of object detection is conducted based on precision, recall, and F1-score. We used selected 1200 frames of videos given in Table 1 where each object appeared in a minimum of 1000 frames. Performance of the detector is given in Table 2 where the animal class has the highest precision, recall, and F1-score while person and cartoon have lower performance relatively. The reason behind it could be the animal has less similarity with other classes however, person and cartoon have relatively high similarity with the person and both have a complex shape.
Details of the dataset videos
Video No. Video name Focus point Starting offset FOV Resolution FPS
1 360° Degree Kitchen Home Tour Persons 0:01 2k 1920×720 30
2 Kitchen 360 test Person 0:01 2k 1920×720 30
3 360° Camera England at Wembley, unlike you have seen before! Persons 0:05 2k 1920×720 30
4 Real Madrid vs. Juventus | 2017 Champions League Final | 360° VIdeo | FOX SOCCER Persons 0:10 2k 1920×720 30
5 Lions 360° National Geographic Animal 0:07 2k 1920×720 30
6 Clash of Clans 360°: Experience a Virtual Reality Raid Cartoon 0:04 2k 1920×720 30
7 360° Underwater National Park National Geographic Animal 0:04 2k 1920×720 30
4.5 Time analysis
Number of experiments have been conducted to analyse the effectiveness of the proposed system in term of time complexity. Using aesthetic approach, system took average of 0.06 seconds to process each frame. On the other hand, using YOLO visual features, the total average time was 0.08. However, the accuracy of the visual features is higher than that aesthetic approach as shown in Figure 5. Further, the accuracy for various set of videos is illustrated in Figure 6. Using YOLO features for object detection to capture FOV of user interests, the processing time of the YOLO is 0.1-fps. Time consumed by each approach and fusion of different approach is presented in Figure 7. first step of experiment, users watch the video on HMD device and find the interesting FOV through head movements searches for different objects in the videos. In the second step, output video of the proposed system is played on HMD device to the users where they watch video without manually searching the FOV. Before presenting each video to the users, summary of all the videos are generated using the proposed system and are supplied to the smart phone which is placed inside HMD device for the users. After watching the video, the questionnaire is filled by the user for each video. In the questionnaire, each video is categorized into five classes: 1) Excellent, 2) Good, 3) Satisfactory, 4) Needs improvement, and 5) Poor. The proposed system outperforms in both MVP and SVP videos as compared to manually searching the FOV using HMD device. The overall percentage of SVP videos are slightly higher than the MVP videos as the FOV in the SVP was constant and more stable than the MVP videos. Users rated percentage of each video is presented in Figure 8.
4.6 User study
We conducted a user study to compare and investigate usefulness of the proposed system in more detail. We have explored the user interaction with 360° video in more comprehensive manner using the proposed system. The apparatus used in this study are Samsung s6-edge smartphone and Samsung Gear VR HMD for viewing the output videos of the proposed system to the users.
Total of 20 users are recruited from our research facilities. The ages of these users are in between 20 to 40 years.
4.7 Comparison with state-of-the-art methods
The proposed system is compared with other state-of-the-art techniques in order to find efficiency and effectiveness of the system. Proposed system has been evaluated with other similar studies conducted by [53], [50], and [19]. In [53], viewport, saliency, and VQA score of 360° videos have been proposed for the viewer. However, final version of the proposed system has not been updated on the given link to compare the result with the proposed system. Moreover, the limitation of the method proposed in [50] only focuses sports videos. Further, this system on work efficiently when there is single object in the videos however, the systems fails when there are multiple objects in the widespread scene of 360° video or there is no object in the video. The system proposed in [50], processes the whole video and generate pre-defined viewing angles for specific videos. These viewing angles coordinates are then used through dynamic programming to steer UFOV of the 360° videos to the viewer. Further, the proposed system does not generate any type of pre-defined viewing angles which are used to focus UFOV. Moreover, the proposed system does not require any post-processing of the 360 videos to focus UFOV in 360° videos. Limitation discussed in these state-of-the-art methods have effectively and efficiently handled by the proposed system. The proposed system works for most the categories of 360° videos, ranges from sports videos to tour videos, natural scenes videos, entertainment, and cartoon videos thus outperforming state-of-the-art techniques. The comparison of propose system with other state-of-the-art methods are illustrated in Table 3.
Comparison with other existing techniques
Method Video support category
Sports Entertainment Tour/ General Videos Cartoon
[50] Yes No No No
[19] Yes Not Provided Not Provided Not Provided
Proposed system Yes Yes Yes Yes
4.8 Limitation of the proposed system
There is certain limitation of the proposed system. First, due to automatic generation of UFOV in 360° videos, this proposed system only generate a single video from the whole 360° video. During the user's study, some users where more interested in entertainment videos and documentaries video. As different users have different choice of videos, this system may have limitation due to priority based UFOV selection. Moreover, the proposed system fails when an object moves to the edges of the sub_frames which is handled by the memorability score up to extent. Further, this system is fusion of two deep models, the processing time of the system is not in real-time, limiting this system while using the HMD device directly.
5 Conclusion
We developed a system that composed of two different Deep models that automatically selects UFOV in 360° videos for the users. Our focus was to develop a system that is efficient in most categories of 360° videos. We focused on developing a system that covers most of the 360° video categories i.e. sports videos, entertainment, general and tour videos. The proposed system was validated on newly collected 360° videos of different categories. The experimental result showed that the proposed method outperforms other state-of-the-art techniques which are developed domain-specific 360° videos.
Our future direction includes to enhance the system in terms of time cost by minimizing the number of deep models and using lightweight models that should run on small devices. Utilizing deep learning models to find and learn UPI in a single 360° and thus generating multiple videos based on users' interest. Instead of dividing the input 360° into 120° multiple frames, we will focus to design a single 120° viewing angles that will steer according to the UPI. This viewing angle will also cover the edge problems in the current system and will minimize time cost. Our final focus will be developing in application which will be based on deep model where viewing-angle will guide the users in 360° videos thus minimizing mental stress of users during watching 360° videos.



Khan N, Muhammad K, Hussain T, Nasir M, Munsif M, Imran A S, Sajjad M. An adaptive game-based learning strategy for children road safety education and practice in virtual space. Sensors, 2021, 21(11): 3661 DOI:10.3390/s21113661


Muhammad K, Hussain T, Baik S W. Efficient CNN based summarization of surveillance videos for resource-constrained devices. Pattern Recognition Letters, 2020, 130: 370–375 DOI:10.1016/j.patrec.2018.08.003


Mehmood I, Sajjad M, Baik S W. Video summarization based tele-endoscopy: a service to efficiently manage visual data generated during wireless capsule endoscopy procedure. Journal of Medical Systems, 2014, 38(9): 1–9 DOI:10.1007/s10916-014-0109-y


Muhammad K, Ahmad J, Sajjad M, Baik S W. Visual saliency models for summarization of diagnostic hysteroscopy videos in healthcare systems. SpringerPlus, 2016, 5(1): 1495 DOI:10.1186/s40064-016-3171-8


Haq I U, Muhammad K, Ullah A, Baik S W. DeepStar: detecting starring characters in movies. IEEE Access, 2019, 7: 9265–9272 DOI:10.1109/access.2018.2890560


Liu D, Hua G, Chen T. A hierarchical visual model for video object summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(12): 2178–2190 DOI:10.1109/tpami.2010.31


Khosla A, Hamid R, Lin C J, Sundaresan N. Large-scale video summarization using web-image priors. 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013, 2698–2705 DOI:10.1109/cvpr.2013.348


Potapov D, Douze M, Harchaoui Z, Schmid C. Category-specific video summarization. In: Computer Vision–ECCV 2014. Cham, Springer International Publishing, 2014, 540–555


Sun M, Farhadi A, Seitz S. Ranking domain-specific highlights by analyzing edited videos. In: Computer Vision–ECCV 2014., Cham, Springer International Publishing, 2014, 787–802


Yao T, Mei T, Rui Y. Highlight detection with pairwise deep ranking for first-person video summarization. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, 982–990 DOI:10.1109/cvpr.2016.112


Zhao B, Xing E P. Quasi real-time summarization for consumer videos. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA, IEEE, 2014, 2513–2520 DOI:10.1109/cvpr.2014.322


Gong B Q, Chao W L, Grauman K, Sha F. Diverse sequential subset selection for supervised video summarization. Advances in Neural Information Processing Systems, 2014, 3: 2069–2077


Zhang K, Chao W L, Sha F, Grauman K. Summary transfer: exemplar-based subset selection for video summarization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, IEEE, 2016, 1059–1067 DOI:10.1109/cvpr.2016.120


Zhang K, Chao W-L, Sha F, Grauman K. Video summarization with long short-term memory. In: Computer Vision–ECCV 2016. Cham, Springer International Publishing, 2016, 766–782


Lee Y J, Ghosh J, Grauman K. Discovering important people and objects for egocentric video summarization. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA, IEEE, 2012, 1346–1353 DOI:10.1109/cvpr.2012.6247820


Lu Z, Grauman K. Story-driven summarization for egocentric video. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA, IEEE, 2013, 2714–2721 DOI:10.1109/cvpr.2013.350


Perazzi F, Krähenbühl P, Pritch Y, Hornung A. Saliency filters: contrast based filtering for salient region detection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA, IEEE, 2012, 733–740 DOI:10.1109/cvpr.2012.6247743


Wang J W, Borji A, Jay Kuo C C, Itti L. Learning a combined model of visual saliency for fixation prediction. IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society, 2016, 25(4): 1566–1579 DOI:10.1109/tip.2016.2522380


Su Y C, Jayaraman D, Grauman K. Pano2Vid: automatic cinematography for watching 360° videos. 2016


Lin Y C, Chang Y J, Hu H N, Cheng H T, Huang C W, Sun M. Tell me where to look: investigating ways for assisting focus in 360° video. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. Denver Colorado USA, New York, NY, USA, ACM, 2017, 2535–2545 DOI:10.1145/3025453.3025757


Ullah H, Muhammad K, Irfan M, Anwar S, Sajjad M, Imran A S, de Albuquerque V H C. Light-DehazeNet: a novel lightweight CNN architecture for single image dehazing. IEEE Transactions on Image Processing, 2021, 30: 8968–8982 DOI:10.1109/tip.2021.3116790


Ullah H, Irfan M, Han K, Lee J W. DLNR-SIQA: deep learning-based no-reference stitched image quality assessment. Sensors, 2020, 20(22): 6457 DOI:10.3390/s20226457


Sajjad M, Irfan M, Muhammad K, Ser J D, Sanchez-Medina J, Andreev S, Ding W P, Lee J W. An efficient and scalable simulation model for autonomous vehicles with economical hardware. IEEE Transactions on Intelligent Transportation Systems, 2021, 22(3): 1718–1732 DOI:10.1109/tits.2020.2980855


Kim H G, Baddar W J, Lim H T, Jeong H, Ro Y M. Measurement of exceptional motion in VR video contents for VR sickness assessment using deep convolutional autoencoder. VRST '17: Proceedings of the 23rd ACM Symposium on Virtual Reality Software and Technology. 2017, 1–7 DOI:10.1145/3139131.3139137


Cheng H T, Chao C H, Dong J D, Wen H K, Liu T L, Sun M. Cube padding for weakly-supervised saliency prediction in 360° videos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 1420–1429 DOI:10.1109/cvpr.2018.00154


Su Y C, Grauman K. Making 360° video watchable in 2D: learning videography for click free viewing. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA, IEEE, 2017, 1368–1376 DOI:10.1109/cvpr.2017.150


Li G B, Yu Y Z. Visual saliency based on multiscale deep features. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, 5455–5463 DOI:10.1109/cvpr.2015.7299184


Yu Y L, Gu J, Mann G K I, Gosine R G. Development and evaluation of object-based visual attention for automatic perception of robots. IEEE Transactions on Automation Science and Engineering, 2013, 10(2): 365–379 DOI:10.1109/tase.2012.2214772


Bansal A, Ma S, Ramanan D, Sheikh Y. Recycle-GAN: Unsupervised Video Retargeting. 2018


Li B, Lin C W, Shi B X, Huang T J, Gao W, Kuo C C J. Depth-aware stereo video retargeting. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 6517–6525 DOI:10.1109/cvpr.2018.00682


Lei J, Luan Q, Song X H, Liu X, Tao D P, Song M L. Action parsing-driven video summarization based on reinforcement learning. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(7): 2126–2137 DOI:10.1109/tcsvt.2018.2860797


Sitzmann V, Serrano A, Pavel A, Agrawala M, Gutierrez D, Masia B, Wetzstein G. How do people explore virtual environments? 2016


Rai Y, Gutiérrez J, le Callet P. A dataset of head and eye movements for 360 degree images. MMSys'17: Proceedings of the 8th ACM on Multimedia Systems Conference. 2017, 205–210


Jiang H Z, Wang J D, Yuan Z J, Wu Y, Zheng N N, Li S P. Salient object detection: a discriminative regional feature integration approach. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA, IEEE, 2013, 2083–2090 DOI:10.1109/cvpr.2013.271


Tong N, Lu H C, Ruan X, Yang M H. Salient object detection via bootstrap learning. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, IEEE, 2015, 1884–1892 DOI:10.1109/cvpr.2015.7298798


Li G B, Yu Y Z. Deep contrast learning for salient object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, IEEE, 2016, 478–487 DOI:10.1109/cvpr.2016.58


Wang L J, Lu H C, Wang Y F, Feng M Y, Wang D, Yin B C, Ruan X. Learning to detect salient objects with image-level supervision. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA, IEEE, 2017, 3796–3805 DOI:10.1109/cvpr.2017.404


Zhang X N, Wang T T, Qi J Q, Lu H C, Wang G. Progressive attention guided recurrent network for salient object detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 714–722 DOI:10.1109/cvpr.2018.00081


Wang W G, Shen J B, Dong X P, Borji A, Yang R G. Inferring salient objects from human fixations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 1913–1927 DOI:10.1109/tpami.2019.2905607


Lin S S, Lin C H, Yeh I C, Chang S H, Yeh C K, Lee T Y. Content-aware video retargeting using object-preserving warping. IEEE Transactions on Visualization and Computer Graphics, 2013, 19(10): 1677–1686 DOI:10.1109/tvcg.2013.75


Zhang J Y, Li S W, Kuo C C J. Compressed-domain video retargeting. IEEE Transactions on Image Processing, 2014, 23(2): 797–809 DOI:10.1109/tip.2013.2294541


Li B, Duan L Y, Wang J Q, Ji R R, Lin C W, Gao W. Spatiotemporal grid flow for video retargeting. IEEE Transactions on Image Processing, 2014, 23(4): 1615–1628 DOI:10.1109/tip.2014.2305843


Kim D, Woo S, Lee J Y, Kweon I S. Deep video inpainting. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 5785–5794 DOI:10.1109/cvpr.2019.00594


Khan S, Muhammad K, Mumtaz S, Baik S W, de Albuquerque V H C. Energy-efficient deep CNN for smoke detection in foggy IoT environment. IEEE Internet of Things Journal, 2019, 6(6): 9237–9245 DOI:10.1109/jiot.2019.2896120


Sajjad M, Khan S, Muhammad K, Wu W Q, Ullah A, Baik S W. Multi-grade brain tumor classification using deep CNN with extensive data augmentation. Journal of Computational Science, 2019, 30: 174–182 DOI:10.1016/j.jocs.2018.12.003


Hussain T, Muhammad K, Ser J D, Baik S W, de Albuquerque V H C. Intelligent embedded vision for summarization of multiview videos in IIoT. IEEE Transactions on Industrial Informatics, 2020, 16(4): 2592–2602 DOI:10.1109/tii.2019.2937905


Thomas S S, Gupta S, Subramanian V K. Perceptual video summarization—A new framework for video summarization. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 27(8): 1790–1802 DOI:10.1109/tcsvt.2016.2556558


Zhang Y, Zimmermann R. Efficient summarization from multiple georeferenced user-generated videos. IEEE Transactions on Multimedia, 2016, 18(3): 418–431 DOI:10.1109/tmm.2016.2520827


Drakopoulos P, Koulieris G A, Mania K. Eye tracking interaction on unmodified mobile VR headsets using the selfie camera. ACM Transactions on Applied Perception, 2021, 18(3): 1–20 DOI:10.1145/3456875


Hu H N, Lin Y C, Liu M Y, Cheng H T, Chang Y J, Sun M. Deep 360 pilot: learning a deep agent for piloting through 360° sports videos. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA, IEEE, 2017, 1396–1405 DOI:10.1109/cvpr.2017.153


Xu Y Y, Dong Y B, Wu J R, Sun Z Z, Shi Z R, Yu J Y, Gao S H. Gaze prediction in dynamic 360° immersive videos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 5333–5342 DOI:10.1109/cvpr.2018.00559


Chen X W, Kasgari A T Z, Saad W. Deep learning for content-based personalized viewport prediction of 360-degree VR videos. IEEE Networking Letters, 2020, 2(2): 81–84 DOI:10.1109/lnet.2020.2977124


Li C, Xu M, Jiang L, Zhang S Y, Tao X M. Viewport proposal CNN for 360° video quality assessment. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 10169–10178 DOI:10.1109/cvpr.2019.01042


Hosu V, Goldlücke B, Saupe D. Effective aesthetics prediction with multi-level spatially pooled features. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 9367–9375 DOI:10.1109/cvpr.2019.00960