Bidirectional adaptive feature fusion for remote sensing scene classification

Bidirectional adaptive feature fusion for remote sensing scene classification

Accepted Manuscript Bidirectional Adaptive Feature Fusion for Remote Sensing Scene Classification Xiaoqiang Lu, Weijun Ji, Xuelong Li, Xiangtao Zheng...

4MB Sizes 0 Downloads 23 Views

Accepted Manuscript

Bidirectional Adaptive Feature Fusion for Remote Sensing Scene Classification Xiaoqiang Lu, Weijun Ji, Xuelong Li, Xiangtao Zheng PII: DOI: Reference:

S0925-2312(18)30947-0 https://doi.org/10.1016/j.neucom.2018.03.076 NEUCOM 19861

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

15 November 2017 27 February 2018 8 March 2018

Please cite this article as: Xiaoqiang Lu, Weijun Ji, Xuelong Li, Xiangtao Zheng, Bidirectional Adaptive Feature Fusion for Remote Sensing Scene Classification, Neurocomputing (2018), doi: https://doi.org/10.1016/j.neucom.2018.03.076

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Bidirectional Adaptive Feature Fusion for Remote Sensing Scene Classification

a

CR IP T

Xiaoqiang Lua,∗, Weijun Jia,b , Xuelong Lia,b , Xiangtao Zhenga

Center for OPTical IMagery Analysis and Learning (OPTIMAL), Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, Shaanxi, P. R. China. b University of Chinese Academy of Sciences, 19A Yuquanlu, Beijing, 100049, P. R. China.

AN US

Abstract

Scene classification has become an effective way to interpret the High Spatial Resolution (HSR) remote sensing images. Recently, Convolutional Neural Networks (CNN) have been found to be excellent for scene classification. However, only us-

M

ing the deep models as feature extractor on the aerial image directly is not proper, because the extracted deep features can not capture spatial scale variability and

ED

rotation variability in HSR remote sensing images. To relieve this limitation, a bidirectional adaptive feature fusion strategy is investigated to deal with the remote sensing scene classification. The deep learning feature and the SIFT feature

PT

are fused together to get a discriminative image presentation. The fused feature can not only describe the scenes effectively by employing deep learning feature

CE

but also overcome the scale and rotation variability with the usage of the SIFT feature. By fusing both SIFT feature and global CNN feature, our method achieves

AC

state-of-the-art scene classification performances on the UCM, the Sydney and the AID datasets. ∗

Corresponding author Email address: [email protected] (Xiaoqiang Lu)

Preprint submitted to Elsevier

August 15, 2018

ACCEPTED MANUSCRIPT

Keywords: Bidirectional Adaptive Feature Fusion, High Spatial Resolution Remote Sensing Images, Scene Classification.

CR IP T

1. Introduction

With the ongoing development of satellite sensors and remote sensing tech-

niques, huge quantities of High Spatial Resolution (HSR) remote sensing images

are available [1, 2, 3, 4, 5]. These high spatial resolution remote sensing images

AN US

often contain abundant spatial and semantic information, which are widely used in the field of military and civilian [6, 7, 8, 9, 10, 11]. In order to make full use

of these images, scene classification is a necessary precedent procedure, which is the basis of land-use object detection and image understanding. So important the scene classification is that much attention has been attracted in the remote sensing

M

fields [12, 13, 14, 15]. Generally speaking, remote sensing image scene classification is a task of automatically labeling an aerial image with the specific category.

ED

The most vital and challenging issue of the scene classification is to develop an effective holistic representation to directly model an image scene.

PT

To develop the effective holistic representation, many methods have been proposed in recent years. Primitively, a bottom-up scheme was proposed to model a

CE

HSR remote sensing image scene by three “pixel-region-scene” steps. To further express the scene, many researchers tried to represent the scene directly without

AC

classifying pixels and regions. For example, Bag of the Visual Words (BoVW) [16] and some extensions of BoVW were proposed to improve the classification accuracy. These methods simply count the occurrences of the low-level features (e.g. color feature, texture feature and structure feature) in an image and assign a semantic label to the image according to its content. Although the BoVW ap2

ACCEPTED MANUSCRIPT

proach is efficient, a large semantic gap between the low-level features and the high-level semantic meanings still exists, which severely limits the ability to han-

CR IP T

dle complex image scenes. The traditional BoVW based methods are mainly based on low-level features, and low-level features cannot precisely represent the semantics of the images because of the so-called semantic gap. To bridge this gap,

some authors have applied Latent Dirichlet Allocation (LDA) [17], a hierarchical probabilistic model, for the unsupervised learning of the topic allocation, based

AN US

on the visual word distribution. Under these models, each image is represented as a finite mixture of an intermediate set of topics, which are expected to summarize the scene information.

Sparse coding (SC) is another landmark method. Its basic idea is to reconstruct the original signal sparsely with some fixed bases (dictionary) and the se-

M

lected bases are enforced into as few categories as possible. In [18], a two-layer sparse coding method was proposed to address the problem of remote sensing

ED

image classification. Sheng et al. [19] proposed a method which use sparse coding to generate three mid-level representations based on low-level representations

PT

(SIFT, local ternary pattern histogram Fourier, and color histogram) respectively. Recently, another unsupervised dictionary learning method has been proposed by

CE

[20], which achieves favorable performance in remote sensing image classification. All these methods have achieved high performance, but they can not satisfy the demand of learning the hierarchical internal feature representations from im-

AC

age data sets. One main reason is that these methods may lack the flexibility and adaptivity to different scenes [21]. Recently, it has been proved that deep learning methods can adaptively learn

image features which are suitable for specific scene classification task and achieve

3

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

PT

1 shows as the process of extracting features. 2 Figure 1: The flowchart of the proposed method. 3 shows as the process of the bidirectional feature shows as the process of feature normalizing.

AC

CE

fusion.

4

ACCEPTED MANUSCRIPT

far better classification performances than traditional scene classification methods [22, 23, 24]. However, there are two major issues that seriously influence the use

CR IP T

of the deep learning method. 1) The deep learning method needs large training data to train the model and is time consuming. Many works have demonstrated

that the existing pretrained CNN models which pretrained on large datasets such

as ImageNet can be transferred to other recognition tasks [24, 25]. Nogueira et al. [24] have illustrated that the pretrained CNN used as feature extractors achieved

AN US

better performance than fully training a new CNN for three remote sensing scene

datasets. However, the pre-trained models take little consideration of the HSR remote sensing image characteristics [26] , which cause the second problem. 2) The spatial scale variability and rotation variability of HSR remote sensing images cannot be expressed precisely by pre-trained model. Remote sensing images are

M

taken from the upper airspace, which cause the scene in HSR images having many different orientations. At the same time, distance between sensors and subjects in

ED

remote sensing images changes widely, which causes the scale of HSR images changeable. The examples of spatial scale variability and rotation variability of

PT

remote sensing images are shown in Figure 2. In order to relieve above issues, we try to learn a scale and rotation invariant

CE

image representation from the remote sensing images. In this paper, a simple but effective way is proposed to acquire the powerful image representation. Both deep feature and SIFT are fully exploit by a feature fusion strategy. Many works

AC

have proved that SIFT feature has better performance on overcoming the scale variability and rotation variability [27, 28]. So SIFT feature is used in this paper to increase scale and rotation invariant. However, it is difficult to directly fuse both features because the following two serious issues: 1) the dimensions of two fusion

5

ACCEPTED MANUSCRIPT

features must keep the same, 2) how to improve the scale and rotation invariant by considering the mutual complement between deep feature and SIFT. To relieve

CR IP T

the first issue, a feature normalizing strategy is proposed to keep uniform length for deep feature and SIFT feature. To relieve the second issue, a bidirectional adaptive feature fusion strategy is used to get the effective fusion feature. The relationship among fusion feature, deep feature and SIFT is learned adaptive by training. We first use deep feature guide the feature fusion. Then, SIFT switches

AN US

to guide feature fusion. The two step get two fusion features which are fused into the final fusion feature.

All in all, the proposed method can be divided into three steps. First, the deep features and the SIFT features are extracted by CNN and SIFT filters respectively. Second, the two features are normalized by normalization layer to get the same

M

dimension. Finally, the normalized features are adaptively assigned with optimal weights to get the fused feature. Specially, the fusion strategy is to assign

ED

average weights to confidence scores from the deep feature and the SIFT feature. The optimal weight of a feature is trained by the model and is optimized by

PT

Back Propagation Through Time (BPTT). With the help of above adaptive feature fusion strategy, the proposed method can get a discriminative representation for

CE

remote sensing scene classification. Flowchart of the proposed method is shown as Figure 1.

AC

In general, the major contributions of this paper are as follows:

1. For HSR image scene classification, deep learning features are sensitive to scale and rotation variant. To address aforementioned problem, deep learning features and SIFT features are together exploited in remote sensing scene classification. 6

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 2: Examples of HSR remote sensing images spatial scale variability and rotation variability.

7

ACCEPTED MANUSCRIPT

2. A bidirectional adaptive feature fusion strategy is investigated to take full advantage of fusion features. The relationship of the them is learned by the

CR IP T

bidirectional adaptive feature fusion strategy. 3. A feature normalizing strategy is proposed to keep uniform length for deep feature and SIFT feature which ensures the feature fusion going smoothly.

The rest of this paper is organized as follows: In Section 2, the related works

on HSR scene classification are reviewed. Section 3 gives the detailed description

AN US

of our proposed method. The experiments of our method on three datasets are shown in Section 4. In the last Section, we conclude this paper. 2. Related Work

M

In this section we provide previous work on remote sensing scene classification and feature fusion. Recently, several methods have been proposed for remote

ED

sensing scene classification. Many of these methods are based on deep learning [29, 30, 26, 31, 24, 32]. The deep learning method becomes the mainstream approach, because it uses a multi-stage global feature learning strategy to adap-

PT

tively learn image features and cast the aerial scene classification as an end-to-end problem. To further improve the classification accuracy, many researchers have

CE

proposed different methods to overcome the issues which pre-trained models take little consideration of the HSR remote sensing image characteristics. The methods

AC

used for addressing the problem can be divided into three main parts, including methods based on the dataset, methods based on the networks construct and methods based on classifier. Some of the methods have addressed the problem from datasets, they have enlarged the datasets to increase the scale and rotation invariant information. Liu 8

ACCEPTED MANUSCRIPT

et al. [33] added a random number α obeying normal distribution or uniform distribution to the raw input image, so the scale of the cropped patch is α to the

CR IP T

raw input image. Then the cropped patch is stretched to the same scale. The strategy can be regarded as a data augmentation strategy for remote sensing image classification, which can be combined with other data augmentation methods.

At the same time, it added structure noise to the input, which can increase the

scale and rotation variability information to the raw input image. Cheng et al.

AN US

[34, 35] rotated the original data to enlarge the dataset and increased the rotation information. The raw input image is simply rotated in the step of 90◦ from 0◦ to

360◦ , and then extract five n × n patches (the four corner patches and the center

patch ) as well as their horizontal reflections from each rotated image. The strategy can not only increase the rotation invariant information but also can increase

M

the size of the dataset. Li et al. [36] proposed a multiscale deep feature learning method, the original remote sensing images were warped into multiple different

ED

scales. Then the multiscale remote sensing images are fed into their corresponding spatial pyramid pooling net, respectively, to extract multiscale deep features.

PT

A multiple kernel learning method is developed to automatically learn the optimal combination of such features. The strategy can increase the scale invariant

CE

information prominently. Another mainstream methods used to address the problem focused on the deep

learning architecture. They change the structure of the neural networks. Some of

AC

them change the pooling layer, others change the neural networks from the fully connection layer. Different layers are changed with different methods to increase the scale and rotation invariance information. Cheng et al. [34] proposed a novel

and effective approach to learn a Rotation-Invariant CNN (RICNN) model for ad-

9

ACCEPTED MANUSCRIPT

vancing the performance of scene classification, which is achieved by introducing and learning a new rotation-invariant layer on the basis of the existing CNN archi-

CR IP T

tectures. The RICNN model is trained by optimizing a new objective function via imposing a regularization constraint, which explicitly enforces the feature repre-

sentations of the training samples before and after rotating to be mapped close to each other, hence achieving rotation invariance.

Moreover, some researchers cared about the classifier. They focused on fus-

AN US

ing the different neural networks together or fusing different features together. To

fuse different information together increases the the scale and rotation invariant information. Zhang et al. [37] propose a gradient boosting random convolutional network (GBRCN) framework for scene classification, which can effectively combine many deep neural networks. The basic neural networks are combined to-

M

gether to fuse the scale and rotation invariant information into the deep model. However, the methods proposed above face some problems. The methods

ED

proposed to enlarge the datasets face the risk of overfitting and are time consuming to train a new model, the methods proposed to change the model structure faces

PT

the problems of training the new model. To avoid these problems, we proposed a new solution.

CE

In this paper, we address the problem from feature fusion. The objective of feature fusion is to reduce redundancy, uncertainty and ambiguity of signatures. Under this conditions, the fused feature should enable better classification perfor-

AC

mance. Two approaches of feature fusion are summarized by static approach and dynamic approach as followed. Static Approach: this approach is based on simple operators such as sum,

concatenation and average. The descriptors are merged into a unique vector which

10

ACCEPTED MANUSCRIPT

does not need any compilation of feature vectors. The sum operator sum each elements of the feature together to get the same dimension of the original feature.

CR IP T

The average operator needs a simple sum of the region for each key-frame and get the average of the input features. The concatenation operator concatenate feature to get sum of the input dimension as final dimension.

Dynamic Approach: this approach is based on dimensionality reduction, whose purpose is to map data onto low dimensional space. Several methods are proposed

AN US

for the various learning problems and data mining, such as PCA, LDA, ICA, NMF. PCA seeks to represent the data in terms of minimum mean square error between representation and the original data optimally. It is derived from eigenvectors (the principle components) corresponding to the largest eigenvalues of the covariance matrix for data of all classes. At the same time, deep neural networks have also

M

been used in the field of dimensionality reduction such as Autoencoder. The approach used the middle layer of the networks whose input and output features

ED

keep the same as the feature representation. Using the method can get the feature representation more effectively.

PT

There has been a long line of previous work incorporating the idea of fusion for scene classification. For example, Guo et al. [19] designed Sparse Coding based

CE

on Multiple Feature Fusion (SCMF) for HSR image scene classification. SCMF sets the fused result as the connection of the probability images obtained by the sparse codes of SIFT, the local ternary pattern histogram Fourier, and the color

AC

histogram features. Zheng et al. [38] also used four features and concatenated the

quantized vector by k-means clustering for each feature to form a vocabulary. All these fusion methods proposed for HSR image scene classification obtained good results to some degree. Nevertheless, all of these methods were limited because

11

ACCEPTED MANUSCRIPT

the fused features were not effective enough to represent the scene. In this paper, the strategy fuses deep learning features and SIFT features together to achieve

CR IP T

comparable performance. 3. Proposed Method

In this Section, we present details of the proposed method for remote sensing

scene classification, which contains three main parts, namely feature extraction by

AN US

deep convolutional neural network and SIFT feature extraction, feature normaliz-

ing using Fisher Vector (FV) and embedding method, feature fusion by proposed bidirectional adaptive feature fusion strategy. Each step is introduced in particular, and the whole procedure is illustrated in Figure 1. Besides, the inference

3.1. Feature Extraction

M

procedure and some implementation issues are also discussed.

ED

This Section of feature extraction is divided into two parts, the first part is deep feature extraction and the second part is SIFT feature extraction. Deep features

PT

are extremely effective features for scene classification, and they are the guarantee of classification accuracy. Considering the limitation of training data in satellite image and high computation complexity of tuning convolutional neural networks,

CE

we use the pre-trained convolutional neural networks. [21] compares three representative high-level deep learned scene classification methods. The comparison

AC

shows that VGG-VD-16 gets the best result. Deep feature extraction: VGG gives a thorough evaluation of networks by

increasing depth using an architecture with very small (3 × 3) convolution filters, which shows a significant improvement on the accuracies, and can be generalised

12

ACCEPTED MANUSCRIPT

well to wide range of tasks and datasets. In our work, we use one of its bestperformance models named VGG-VD-16, beacause of its simpler architecture and

CR IP T

slightly better results. It is composed of 13 convolutional layers and followed by 3 fully connected layers, thus 16 layers in total. The convolution stride is fixed to 1 pixel; the spatial padding of convolutional layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3 × 3

convolutional layers. Spatial pooling is carried out by five max-pooling layers,

AN US

which follow some of the convolutional layers (not all the convolutional layers are followed by max-pooling). Max-pooling is performed over a 2 × 2 pixel window,

with stride 2. A stack of convolutional layers is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000 way ILSVRC classification and thus contains 1000 channels (one for each class). The

M

final layer is the soft-max layer. Suppose given an input image X , the output of

ED

the convolutional layer in N th block can be produced by

fN (X) =

N X i=1

X i ∗ Wi + bi

(1)

where X is the input of the N th layer, Wi represents the weights, bi represents

PT

i

the bias. “ ∗ ” is the convolutional operation. The parameters of the model are

CE

pre-trained on approximately 1.2 million RGB images in the ImageNet ILSVRC 2012. We carry on the experiment to verify that the first fully connected layer

AC

perform best in the experiment. So the first fully connected layer is chosen as the feature vectors of the images [21]. The deep features extracted by convolutional neural network can be repre-

sented by

13

ACCEPTED MANUSCRIPT

hd,N = fN (Iid ) i

(2)

CR IP T

where hd,N is the convolutional output of Iid on the N th layer; fN (◦) is the prei trained deep model; The deep feature extracted by the convolutional neural network of a given scene image Ii can be represented as hdi .

SIFT feature extraction: The SIFT feature [39] searches over all scales and

image locations. It is implemented effectively by using a difference of Gaussian

AN US

function to identify potential interest points that are invariant to scale and orien-

tation. The initial image is incrementally convolved with Gaussians to produce images separated by a constant factor K in scale space. Adjacent images scales are subtracted to produce the difference-of Gaussian images. This step detect the scale invariance feature from different scale space. And the date will experience

M

key point localization, orientation assignment and key point descriptor. The features with scale invariant and rotation invariant are extracted. The SIFT feature

ED

describes a patch by the histograms of gradients computed over a (4 × 4) spatial

grid. The gradients are then quantized into eight bins, so the dimension of the

PT

final feature vector is 128 (4 × 4 × 8).

To capture the stable number of the SIFT features, the dense SIFT is adopt in

CE

the research. The input images are dived into blocks with the same size of n × n.

In each block, the SIFT feature is captured follow the steps described above. So

AC

the number of the input image is fixed with n2 and each dimension of the SIFT feature is 128. The SIFT feature is described as hsi of the ith input image. 3.2. Feature Normalizing The dimension of deep feature and SIFT feature is different after feature extraction. The dimension of deep feature is 4096 and dimension of the SIFT feature 14

ACCEPTED MANUSCRIPT

is 128 × n2 (where n2 indicates the number of the points of the interest). The di-

mensions of the two features are different so we need to utilize the normalizing

CR IP T

method to reshape the two features into the same dimension[40, 41, 42]. We firstly use the Fisher Vector (FV) [27] which is verified as an effective

feature coding method to encode the SIFT feature. In essence, the SIFT features generated from FV encoding method is a gradient vector of the log-likelihood. By computing and concatenating the partial derivatives, the mean and variance

AN US

of Gaussian functions, the dimension of the final feature vector is 2 × K × F

(where F indicates the dimension of the local feature descriptors and K denotes the size of the dictionary). Let X = { x1 , x2 , ..., xT } ∈ RD×T be the set of the

input SIFT features ,and λ = {wi , µi , Σi |i = 1, 2, ..., K} be the parameters of

the Gaussian mixture model (GMM) fitting the distribution of features from the

M

SIFT feature, where wi is the the mixture weight, µi is the mean vector, Σi is the covariance matrix of the Gaussian model; K is the component number of GMM.

ED

It is assume that the generation process of SIFT feature can be modeled by a

AC

CE

PT

probability density function p(x|λ) with the parameters λ [28]. T 1X Gλ (X) = Oλ logp(xt |λ) T i=1

p(xt |λ) =

k X i=1

wi pi (xt |λ)

exp{ 12 (xt − µi )T Σ−1 i (xt − µi )} pi (xt |λ) = (2π)D/2 |Σ|1/2

(3)

(4)

(5)

Accordingly, the Fisher vector gλX of X is defined as −1/2

gλX = Fλ

15

GX λ

(6)

ACCEPTED MANUSCRIPT

where Fλ is the Fisher information matrix of the GMM model λ. Following the mathematical derivations in [28], the D-dimensional gradients with respect to the

X gµ,i

T wi

T X t=1

γt (i)(

xt − µ i ) σi

T X 1 xt − µ i 2 √ = γt (i)[( ) − 1] σi T 2wi t=1

γt (i) =

AN US

X gσ,i

=

1 √

CR IP T

mean µi and the standard deviation σi are represented as:

wi pi (xt |λ) K P wj pj (xt |λ)

(7)

(8)

(9)

j=1

where γt (i) is the occupancy probability of xt to ith Gaussian component; σi is the

M

standard deviation and σi2 = diag(Σi ). The Fisher vector contains three gradients in theory, one of them with respect to the mixture weight wi is discarded because

ED

it contains few information for facilitating the recognition task. Therefore, only the two gradients are utilized to build the Fisher vector representation. As a result, the linear vector representation of a Fisher vector can be denoted as φ(X) =

PT

X X X X , gσ,K }, where K is the component number of GMM. And then, , ..., gµ,K {gµ,1 , gσ,1

we resize the size of the dictionary to control the dimension of the output feature.

CE

The output dimension can be controlled by the dictionary size K. In order to get the same dimension with the deep features, the dictionary size is calculated by

AC

4096 ÷ (2 × 128) = 16. Finally, we use the encoding method to express SIFT feature more effectively.

Then the deep feature and the SIFT feature are together to get the same dimen-

sion by a normalizing layer. The normalizing layer is composed of two fully connected layers. The deep feature Od (xi ) and the SIFT feature Os (xi ) pass through 16

ACCEPTED MANUSCRIPT

the network respectively. Wa and Wb are the weights trained together with the feature fusion network. The dimensions of the fully connected layers and the number

CR IP T

of the connection layers are verified by the experiments. The number of the fully connection layer is set to two and the dimensions of the fully connection layers are 4096 dimensions and 2048 dimensions. The parameters of the fully connection layer are trained together with the below proposed fusion strategy.

AN US

Oa (xi ) = σ(Wa Od/s (xi ) + Ba )

Ob (xi ) = σ(Wb Oa (xi ) + Bb ) 3.3. Feature Fusion

(10)

(11)

M

We propose a novel bidirectional adaptive feature fusion method inspired by recurrent neural networks [43, 44]. The process of feature fusion can be divided

ED

into three procedures. The input features pass through the primary fusion node to get a primary fusion feature. Then the primary feature and the first input node

PT

feature are fused with different weights to get the intermediate feature. After getting the two intermediate features from two directions, the two intermediate

CE

features are summed together with different weights and with the bias to get the final fusion feature. The reset ratio and the update ratio should be calculated firstly before the feature fusion. The details of feature fusion is described as below.

AC

In forward direction, the framework of feature fusion is shown as Fig. 3. There

are two input nodes in the framework, the first input feature is the deep learning feature marked as hdf and the second input feature is the SIFT feature marked as hsf . The reset ratio and the update ratio are calculated respectively by Eq. 12 17

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 3: The details of the feature fusion strategy, the input features are the embedded deep

CE

feature and embedded SIFT feature. The output is the label of the image. The hdf is the deep learning feature used in the forward feature fusion step and hdb is the deep learning feature used in the backward feature fusion step. Similarly, hsf is the SIFT feature used in the forward feature

AC

fusion step and hsb is the SIFT feature used in the backward feature fusion step. The blue line means the forward feature fusion and the red line means the backward feature fusion. The two steps are described as the bidirectional feature fusion strategy.

18

ACCEPTED MANUSCRIPT

and Eq. 13 . The primary fusion feature marked as hpf is calculated by Eq. 14. The primary fusion feature and the deep learning feature are used as the two input

zf = sigmoid(Wz hsf + Uz hdf ) rf = sigmoid(Wr hsf + Ur hdf ) hpf = tanh(W hsf + rf ◦ U hdf )

AN US

hif = zf ∗ hdf + (1 − zf ) ∗ hpf

CR IP T

features to calculate the intermediate fusion feature (marked as hif ) Eq. 15. (12) (13)

(14)

(15)

In backward direction, the first input feature is the SIFT feature marked as hsb and the second input feature is the deep learning feature marked as hdb . The reset ratio and the update ratio are also calculated respectively by Eq. 16 and Eq. 17.

M

The primary fusion feature marked as hpb is calculated by Eq. 18. The primary fusion feature and the SIFT feature are used as the two input features to calculate

zb = sigmoid(Wz hdb + Uz hsb )

(16)

rb = sigmoid(Wr hdb + Ur hsb )

(17)

hpb = tanh(W hdb + rb ◦ U hsb )

(18)

hib = zb ∗ hsb + (1 − zb ) ∗ hpb

(19)

AC

CE

PT

ED

the intermediate fusion feature (marked as hib ) by Eq. 19.

In bidirectional feature fusion, hib and hif are calculated together to get the final

fusion feature yt . The formula is show as (9). y = Wf hif + Wb hib + by

19

(20)

ACCEPTED MANUSCRIPT

In all these formula Eq. 12 to Eq. 20. Wz , Wr , Uz , Ur , W, U, Wf , Wb , by are all the weight parameters needed to be learned during the training the model. After

CR IP T

getting the final fusion feature, the fusion feature passes through the softmax layer to classify the HSR images into different categories. The model is trained by

BPTT method. The proposed fusion method takes full advantage of deep learning feature and SIFT feature to achieve comparable performance in the HSR image scene classification.

AN US

4. Experiments

To evaluate the effectiveness of proposed method for scene classification, we perform experiments on three datasets: the UC Merced Land Use dataset [16], the Sydney dataset and the AID dataset [21]. In the rest part of this section, a brief

M

introduction of the datasets is given in Section 4.1. In Section 4.2, the evaluation protocols used here are introduced. Then, section 4.3 gives the experiment results

ED

and analysis on the UC Merced Land Use dataset. An additional experiment is carried on the AID dataset in Section 4.4. Section 4.5 gives the experiment results

PT

and analysis on the Sydney dataset. At the same time, in order to evaluate the fusion strategy, we perform experiments on both single features (deep feature and

CE

SIFT feature) and fused feature. The details of the experiments and the results are described in the following sections.

AC

4.1. Description of the datasets Experiments are conducted on three public datasets, the UC Merced land use

dataset, the Sydney dataset and the AID dataset. UC Merced Land Use dataset: It contains 21 scene categories. Each category contains 100 images. These images are manually extracted from large images 20

ACCEPTED MANUSCRIPT

of the USGS National Map Urban Area Imagery collection. Each scene image consists of 256 × 256 pixels, with a spatial resolution of one foot per pixel. The

CR IP T

example images are shown as Figure 4. As can be seen, there exist lots of similarities between different aerial scenes, such as (9) F reeway and (15) Overpass,

which results in great difficulties to separate them. In the experiment, we randomly choose 80 images from each category as the training set, the rest images

are chosen as the testing set. Some of the scenes are similar to each other causing

PT

ED

M

AN US

the scene classification to be challenging.

Figure 4: Example images associated with 21 land-use categories in the UC-Merced data set: (1)

CE

agricultural, (2) airplane, (3) baseball diamond, (4) beach, (5) buildings, (6) chaparral, (7) dense residential, (8) forest, (9) freeway, (10) golf course, (11) harbor, (12) intersection, (13) medium residential, (14) mobile home park, (15) overpass, (16) parking lot, (17) river, (18) runway, (19)

AC

sparse residential, (20) storage tanks and (21) tennis court.

Sydney dataset: It comes from a huge high-resolution remote sensing image

with a size of 18,000 × 14,000 pixels, which is acquired from the Google Earth

of Sydney. The pixel resolution is about 0.5 m. The whole image is cut into 613 21

ACCEPTED MANUSCRIPT

non-overlapping patches with 7 categories. The subimages are 500 × 500 pixel

and each subimages is supposed to contain one certain scene. Figure 5 shows the

M

AN US

CR IP T

examples of each class, and Table 1 shows the number of images in each class.

Figure 5: (a) Whole image for scene classification. (b) Example images associated with the seven

ED

land-use categories from the image. (1) residential, (2) airport, (3) meadow, (4) river, (5) ocean, (6) industrial, (7) runway.

PT

AID dataset [21]: It is a new dataset containing 10000 images which make it the biggest dataset of high spatial resolution remote sensing images. Thirty

CE

scene categories are included in this dataset. The number of different scenes is different from each other changing from 220 to 420. Table 2 shows the number of

AC

images in each class. The shape of the scene image consists of 600 × 600 pixels.

We follow the parameters set in [21]. Fifty percent images of each category are chosen as the training set and the rest of the images are set as the testing set. The dataset is famous for its images with high intra-class diversity and low inter-class dissimilarity. The example images are shown as Figure 6. 22

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

CE

Figure 6: Example images associated with 30 land-use categories in the UC-Merced data set: (1) airport, (2) bare land, (3) baseball field, (4) beach, (5) bridge, (6) center, (7) church, (8) commercial, (9) dense residential, (10) desert, (11) farmland, (12) forest, (13) industrial, (14) meadow,

AC

(15) medium residential, (16) mountain, (17) park, (18) parking, (19) play ground, (20) pond, (21) port, (22) railway station, (23) resort, (24) river, (25) school, (26) sparse residential, (27) square, (28) stadium, (28) storage tanks, (29) viaduct and (30) viaduct.

23

ACCEPTED MANUSCRIPT

Number

Resident

242

Airport

22

Vegetation

50

River

45

Sea

92

AN US

Category

CR IP T

Table 1: Seven categories and corresponding numbers in Sydney dataset

White house

96

Runway

66

4.2. Evaluation protocols

M

To compare the classification quantitatively, we compute the common used measures: overall accuracy (OA) and Confusion matrix. OA is defined as the

ED

number of correctly predicted images divided by the total number of predicted images. It is a direct measure to reveal the classification performance on the whole

PT

dataset. Confusion matrix is a specific table layout that allows direct visualization of the performance on each class. Each column of the matrix represents the

CE

instances in a predicted class, and each row represents the instances in an actual class. Thus, each item xij in the matrix computes the proportion of image that is

AC

predicted to be the i-th type while is truly belong to the j-th type. 4.3. Experiment on UC Merced dataset To evaluate the proposed method, we firstly conduct experiments on the UCMerced

dataset which is an acknowledged challenging dataset in HSR image scene clas24

CR IP T

ACCEPTED MANUSCRIPT

Table 2: Twenty one categories and corresponding numbers in AID dataset

Category

Number

Category

airport

360

bare land

baseball field

220

beach

360

center

260

350

church

240

410

desert

300

370

forest

250

390

meadow

280

medium residential

290

mountain

340

park

350

parking

390

playground

370

pond

420

port

380

resort

290

railway station

260

river

410

sparse residential

300

school

300

square

330

stadium

290

storage tanks

360

viaduct

420

commercial dense residential

AC

CE

PT

ED

industrial

M

farmland

310 400

AN US

bridge

Number

25

ACCEPTED MANUSCRIPT

sification area. we compare our method with the state-of-the-art methods in scene classification. According to the investigation of the related works in Section 2,

CR IP T

we compare our method with the traditional and classical methods including Bag of Visual Word (BOVW) [16], probabilistic Latent Semantic Analysis (pLSA) [45], Latent Dirichlet Allocation (LDA) [17], Spatial Pyramid Match (SPM) [46],

extended Spatial Pyramid Co-occurrence Kernel (SPCK) [47], the SIFT feature

coding by Spare Coding (SIFT+SC) [48], Saliency-Guided Unsupervised Feature

AN US

Learning (S-UFL) [49], and some latest deep learning based methods including Gradient Boosting Random Convolutional Network (GBRCN) [37], Semantic Allocation Level Probabilistic Topic Model (SAL-PTM) [50] and Scale Random Stretched Convolutional Neural Network (SRSCNN) [33]. All these methods have been introduced detailedly in the part of the related work.

M

We test the proposed method by using five-fold cross validation and take the average as the final result, and the results of other compared methods are from

ED

their own paper. The results are showed in Table 3. From Table 3, it can be concluded that the proposed feature fusion strategy which fuses deep feature and SIFT

PT

feature together performs better than other methods. Compared with the middle level method, the higher accuracies imply that deep learning method learns more

CE

powerful features. Besides, the combination of scale invariant and rotation invariant information furtherly improves the performance. Specifically, the proposed method boosts the performance dramatically to 95.48%. To the best of our knowl-

AC

edge, the result is the best on this dataset, which adequately show the superiority of our proposed method. Compared with the original CNN, the proposed method gets better result than it. It can be concluded that the proposed question mentioned in section 1 has been addressed. Compared with GBRCN which combines many

26

ACCEPTED MANUSCRIPT

CNN for scene classification, our method is 0.91% better than it. This indicates that sufficiently capturing the scale invariant information in remote sensing scene

CR IP T

images can improve the classification performance. In addition, it is worth noting that our method obtains higher OA (95.48%) than SRSCNN whose OA is 95.10%, which randomly selects patches from image and stretches to the specific scale as input to train CNN. The experiment results indicate that the supplement of the rotation information can improve the classification accuracy.

AN US

To further evaluate the proposed method, we calculate the confusion matrix

of UCMerced dataset. The confusion matrix is shown in Figure 7. Apparently, most categories are classified correctly, Eleven scenes are classified 100% correctly. Five scenes are classified 95% correctly. Four scenes are classified 95% correctly. The scene of Buildings get the 80% accuracy. The reason of misclas-

M

sifying scene Buildings to scene Denseresidential and M ediumresidential is that there exit similar regions in these three scene categories. The differences

ED

among Buildings, Denseresidential and M ediumresidential are mainly the scales of the buildings, in other words, the density of the buildings. Our proposed

PT

method overcomes the influence of the scale variety so the misclassifying can be understood. However, the other scene such as F reeway which is difficult to

CE

classify by other method can be classified correctly. 4.4. Experiment on AID dataset

AC

AID dataset is the newest and biggest dataset of high spatial resolution remote

sensing images. At the same time, the dataset is the most difficult to be classified compared with other datasets. To evaluate the proposed method on AID, the compared methods are chosen to code SIFT feature with models such as Bag of Visual Word (BOVW) [16], probabilistic Latent Semantic Analysis (pLSA) [45], 27

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 7: The confusion matrix with the 21 land-use categories in the UCMerced dataset. (1) agricultural, (2) airplane, (3) baseball diamond, (4) beach, (5) buildings, (6) chaparral, (7) dense residential, (8) forest, (9) freeway, (10) golf course, (11) harbor, (12) intersection, (13) medium

M

residential, (14) mobile home park, (15) overpass, (16) parking lot, (17) river, (18) runway, (19)

ED

sparse residential, (20) storage tanks and (21) tennis court. Table 3: COMPARISON WITH THE PREVIOUS REPORTED ACCURACIES WITH THE UC

AC

CE

PT

MERCED DATA SET

Method

OA(%)

Method

OA(%)

BOVW

72.05

S-UFL

82.30

pLSA

80.71

SAL-PTM

88.33

SPCK++

76.05

GBRCN

94.53

LDA

81.92

SRSCNN

95.10

SPM+SIFT

82.30

CNN

93.10

SIFT+FC

81.67

PROPOSED

95.48

28

ACCEPTED MANUSCRIPT

Spatial Pyramid Match (SPM) [46], Vector of Locally Aggregated Descriptors (VLAD), Fisher Vector (FV) and Locality-constrained Linear Coding (LLC). The

CR IP T

deep models are chosen as VGG-VD-16, CaffeNet [51], GoogleLeNet [23] and the proposed method. Table 4 gives the comparison, which also shows the effectiveness of the proposed method.

We test the proposed method as the experiment of UCM. We use five-fold

cross validation and take the average as the final result, and the results of other

AN US

compared methods are from their own paper. From Table 4, it can be concluded that the proposed feature fusion strategy which fuses deep feature and SIFT feature together performs better than other methods. Compared with the middle level method, the proposed method get remarkable performance than them. Compared with the deep learning method, the proposed method boosts the performance dra-

M

matically to 93.56%. To the best of our knowledge, the result is the best on this dataset, which adequately show the superiority of our proposed method. Com-

ED

pared with the representativeness deep learning method, VGG-VD-16, CaffeNet, GoogleLeNet, our proposed method get higher accuracy than them. The basic

PT

model of our proposed method is the VGG-VD-16, the proposed method is 3.93% higher than it. It illustrates that the harder the dataset is, the higher the accuracy

CE

increasing are. The basic VGG is higher than CaffeNet and GoogleLeNet, so the proposed method get higher accuracy than them, too. The experiment results indicate that the supplement of the rotation information can improve the classification

AC

accuracy.

In addition, we compare our proposed method with VGG-VD-16 by evalu-

ating the accuracy in each class, which is shown in Figure 8. From Figure 8, we observe that our proposed method get the higher or similarity accuracy with

29

ACCEPTED MANUSCRIPT

Table 4: COMPARISON WITH THE FEATURE FUSION WITH THE AID DATA SET

Method

BOVW

68.37

FV

LLC

63.24

pLSA

SPM

45.52

VLAD

VGG-VD-16

89.64

CaffeNet

GoogleLeNet

86.39

Proposed

OA(%)

CR IP T

OA(%)

78.99 63.07 68.96 89.53 93.56

AN US

Method

the VGG net. Most scenes are classified correctly, the accuracy of 11 scenes are higher than 95%, the accuracy of 20 scenes are higher than 90%, the accuracy of 25 scenes are higher than 80%. Scene f orest and scene meadow get 100%

M

correctly. Scene desert, mountain, parking and viaduct get higher than 98% accuracy. These scenes are relatively easy to classify. The worse scene classifica-

ED

tion is Resort and School. It is hard to discriminate the the two scenes because the two scenes are different in semantic but not in the represent in image. It is

PT

a hard task to discriminate the two scene with other scene for human. Nevertheless, an outstanding performance of the proposed method is the discriminate of

CE

the scene of commercial. Our proposed method is 11% higher than the original VGG net. It illustrates that the scale invariance and the rotation invariance play an

AC

important role in discriminating the scene of commercial. 4.5. Experiment on Sydney dataset For the purpose of further estimation of proposed method, experiments on

Sydney dataset are carried. The comparison methods are Spatial Pyramid Match30

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 8: Each class accuracy comparison between VGG-VD-16 and our proposed method on

PT

the AID dataset. (1) airport, (2) bare land, (3) baseball field, (4) beach, (5) bridge, (6) center, (7) church, (8) commercial, (9) dense residential, (10) desert, (11) farmland, (12) forest, (13)

CE

industrial, (14) meadow, (15) medium residential, (16) mountain, (17) park, (18) parking, (19) play ground, (20) pond, (21) port, (22) railway station, (23) resort, (24) river, (25) school, (26)

AC

sparse residential, (27) square, (28) stadium, (28) storage tanks, (29) viaduct and (30) viaduct.

31

ACCEPTED MANUSCRIPT

Table 5: COMPARISON WITH THE FEATURE FUSION WITH THE SYDNEY DATA SET

OA(%)

81.82

90

96.83

Ours

CR IP T

Method SPMK LDA-SVM Saliency+SC

99.05

ing Kernel (SPMK) based method [46], LDA+SVM [52], and Saliency+SC [1].

We test the proposed method by using five-fold cross validation and take the aver-

AN US

age as the final result. The accuracy results are shown in Table 5 and the proposed method achieves an extremely high accuracy, which exceeds the traditional SPMK enormously. As to the slight surpass to saliency guided sparse coding method and LDA+SVM, the main reason is that this dataset is too easy to classify.

M

4.6. Experiment on Feature Fusion

To evaluate the proposed fusion strategy, we compare the classification accu-

ED

racy in experiments with single deep feature and experiments with single SIFT feature encoded with fisher vector. Moreover, to illustrate the effectiveness of the

PT

feature fusion, we compare the fusion strategy with some traditional feature fusion method, sum of the features and the average of the features.

CE

The result shows that single SIFT feature encoded with fisher vector performs worst in the experiment. At the same time, deep feature and fusion feature all get the classification accuracy exceeding 90%. All method fuse with deep feature can

AC

get the accuracy exceeding 90%. The proposed fusion strategy obtains the best performance. The sum of feature gets the accuracy of 94.01% and the average of the features gets the accuracy of 93.05%. The classification accuracy of the proposed fusion strategy is shown as the best result compared with other fusion 32

ACCEPTED MANUSCRIPT

OA(%)

FV

85.11

VGG-VD-16

93.09

SUM

94.01

AVG

93.05

PROPOSED

95.48

AN US

Method

CR IP T

Table 6: COMPARISON WITH THE FEATURE FUSION WITH THE UC MERCED DATA SET

strategies. The final result is shown in Table 6. Our experiment is conducted on the UC Merced data set.

M

5. Conclusion

ED

This paper proposed a feature fusion strategy for remote sensing scene classification. The strategy fuses the deep feature and the SIFT feature together to solve the scene classification problem. The deep feature comes from the VGG-VD-16

PT

net which is used as a feature extractor. In the view of limit of the remote sensing images, the pre-trained deep model is used in the method. The SIFT feature is

CE

famous for its scale and rotation invariance which can be used for overcome the scale changeable and the rotation changeable of the remote sensing scene. To fuse

AC

the two features, the normalizing layer is used with two fully connection layer to resize the two fusion features. Finally the bidirectional adaptive feature fusion strategy is proposed to fuse the two features together. The experiments on three datasets have demonstrated that the proposed method dramatically improves the classification results compared with state of the arts. In the feature work, our 33

ACCEPTED MANUSCRIPT

proposed method can be used in other fields[53, 54, 55]. Acknowledgments

CR IP T

This work was supported in part by the National Natural Science Founda-

tion of China under Grant 61761130079, Grant 61472413, and Grant 61772510,

in part by the Key Research Program of Frontier Sciences, CAS, under Grant QYZDY-SSW-JSC044, in part by the Young Top-Notch Talent Program of Chi-

nese Academy of Sciences under Grant QYZDB-SSWJSC015, in part by Open

AN US

Research Fund of State Key Laboratory of Transient Optics and Photonics, Chinese Academy of Sciences (SKLST2017010), and in part by Xi’an Postdoctoral Innovation Base Scientific Research Project.

M

References References

ED

[1] F. Zhang, B. Du, L. Zhang, Saliency-guided unsupervised feature learning for scene classification, IEEE Transactions on Geoscience and Remote Sens-

PT

ing 53 (4) (2014) 2175–2184. [2] Y. Yuan, X. Zheng, X. Lu, Discovering diverse subset for unsupervised hy-

CE

perspectral band selection, IEEE Transactions on Image Processing 26 (1)

AC

(2017) 51–64. doi:10.1109/TIP.2016.2617462.

[3] L. Zhang, M. Wang, R. Hong, B. C. Yin, X. Li, Large-scale aerial image categorization using a multitask topological codebook, IEEE Transactions on Cybernetics 46 (2) (2016) 535.

34

ACCEPTED MANUSCRIPT

[4] J. Han, D. Zhang, G. Cheng, L. Guo, J. Ren, Object detection in optical remote sensing images based on weakly supervised learning and high-level

CR IP T

feature learning, IEEE Transactions on Geoscience and Remote Sensing 53 (6) (2015) 3325–3337.

[5] X. Xu, J. Li, X. Huang, M. D. Mura, A. Plaza, Multiple morphological component analysis based decomposition for remote sensing image classifica-

3083–3102.

AN US

tion, IEEE Transactions on Geoscience and Remote Sensing 54 (5) (2016)

[6] G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, J. Ren, Effective and efficient midlevel visual elements-oriented land-use classification using vhr remote sensing images, IEEE Transactions on Geoscience and Remote Sensing

M

53 (8) (2015) 4238–4249.

[7] X. Lu, Y. Yuan, X. Zheng, Joint dictionary learning for multispectral change

ED

detection, IEEE Transactions on Cybernetics 47 (4) (2017) 884–897. [8] L. Wang, W. Song, P. Liu, Link the remote sensing big data to the image

CE

810.

PT

features via wavelet transformation, Cluster Computing 19 (2) (2016) 793–

[9] C. Deng, X. Liu, C. Li, D. Tao, Active multi-kernel domain adaptation for

AC

hyperspectral image classification, Pattern Recognition 77 (2017) 306 – 315.

[10] X. L. Xu Yang, Cheng Deng, F. Nie, New l21-norm relaxation of multi-way graph cut for clustering, in: Proc. AAAI Conference on Artificial Intelligence, accepted, 2018.

35

ACCEPTED MANUSCRIPT

[11] R. Wang, F. Nie, W. Yu, Fast spectral clustering with anchor graph for large hyperspectral images, IEEE Geoscience and Remote Sensing Letters 14 (11)

CR IP T

(2017) 2003–2007. [12] B. Du, W. Xiong, J. Wu, L. Zhang, L. Zhang, D. Tao, Stacked convolutional denoising auto-encoders for feature representation, IEEE Transactions on Cybernetics 47 (4) (2017) 1017.

[13] S. Yan, X. Xu, D. Xu, S. Lin, X. Li, Image classification with densely

AN US

sampled image windows and generalized adaptive multiple kernel learning, IEEE Transactions on Cybernetics 45 (3) (2015) 395.

[14] D. Tao, L. Jin, Z. Yang, X. Li, Rank preserving sparse learning for kinect based scene classification, IEEE Transactions on Cybernetics 43 (5) (2013)

M

1406–1417.

[15] B. Du, L. Zhang, A discriminative metric learning based anomaly detec-

ED

tion method, IEEE Transactions on Geoscience and Remote Sensing 52 (11) (2014) 6844–6857.

PT

[16] Y. Yang, S. Newsam, Bag-of-visual-words and spatial extensions for landuse classification, in: Proceedings of Sigspatial International Conference on

CE

Advances in Geographic Information Systems, 2010, pp. 270–279.

AC

[17] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, Journal of Machine Learning Research 3 (2003) 993–1022.

[18] D. Dai, W. Yang, Satellite image classification via two-layer sparse coding with biased image representation, IEEE Geoscience and Remote Sensing Letters 8 (1) (2011) 173–176. 36

ACCEPTED MANUSCRIPT

[19] G. Sheng, W. Yang, T. Xu, H. Sun, High-resolution satellite scene classification using a sparse coding based multiple feature combination, International

CR IP T

Journal of Remote Sensing 33 (8) (2012) 2395–2412. [20] A. M. Cheriyadat, Unsupervised feature learning for aerial scene classification, IEEE Transactions on Geoscience and Remote Sensing 52 (1) (2013) 439–451.

AN US

[21] G. S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, X. Lu, Aid: A benchmark data set for performance evaluation of aerial scene classification, IEEE Transactions on Geoscience and Remote Sensing 55 (7) (2017) 3965– 981.

[22] M. Vakalopoulou, K. Karantzalos, N. Komodakis, N. Paragios, Building de-

M

tection in very high resolution multispectral data with deep learning features,

ED

in: Geoscience and Remote Sensing Symposium, 2015, pp. 1873–1876. [23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Com-

PT

puter Vision and Pattern Recognition, 2015, pp. 1–9.

CE

[24] K. Nogueira, O. A. B. Penatti, J. A. D. Santos, Towards better exploiting convolutional neural networks for remote sensing scene classification, Pat-

AC

tern Recognition 61 (2017) 539–556.

[25] F. Hu, G. S. Xia, J. Hu, L. Zhang, Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery, Remote Sensing 7 (11) (2015) 14680–14707.

37

ACCEPTED MANUSCRIPT

[26] X. Lu, X. Zheng, Y. Yuan, Remote sensing scene classification by unsupervised representation learning, IEEE Transactions on Geoscience and Remote

CR IP T

Sensing 55 (9) (2017) 5148–5157. [27] F. Perronnin, C. Dance, Fisher kernels on visual vocabularies for image categorization, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8.

AN US

[28] F. Perronnin, J. Snchez, T. Mensink, Improving the fisher kernel for largescale image classification, in: Proceedings of European Conference on Computer Vision, 2010, pp. 119–133.

[29] M. Castelluccio, G. Poggi, C. Sansone, L. Verdoliva, Land use classification in remote sensing images by convolutional neural networks, Journal of

M

Molecular Structure Theochem 537 (1) (2015) 163–172.

ED

[30] X. Li, Q. Lu, Y. Dong, D. Tao, Sce: A manifold regularized set-covering method for data partitioning, IEEE Transactions on Neural Networks and

PT

Learning Systems PP (99) (2018) 1–14. [31] F. P. S. Luus, B. P. Salmon, F. V. D. Bergh, B. T. J. Maharaj, Multiview deep

CE

learning for land-use classification, IEEE Geoscience and Remote Sensing Letters 12 (12) (2015) 1–5.

AC

[32] Q. Zou, L. Ni, T. Zhang, Q. Wang, Deep learning based feature selection for remote sensing scene classification, IEEE Geoscience and Remote Sensing Letters 12 (11) (2015) 2321–2325.

[33] Y. Liu, Y. Zhong, F. Fei, L. Zhang, Scene semantic classification based on 38

ACCEPTED MANUSCRIPT

random-scale stretched convolutional neural network for high-spatial resolution remote sensing imagery, in: Proceedings of IEEE International Geo-

CR IP T

science and Remote Sensing Symposium, 2016, pp. 763–766. [34] G. Cheng, C. Ma, P. Zhou, X. Yao, J. Han, Scene classification of high resolution remote sensing images using convolutional neural networks, in:

Proceedings of IEEE International Geoscience and Remote Sensing Sympo-

AN US

sium, 2016, pp. 767–770.

[35] G. Cheng, P. Zhou, J. Han, Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images, IEEE Transactions on Geoscience and Remote Sensing 54 (12) (2016) 7405–7415. [36] E. Li, J. Xia, P. Du, C. Lin, A. Samat, Integrating multilayer features of

M

convolutional neural networks for remote sensing scene classification, IEEE

ED

Transactions on Geoscience and Remote Sensing (99) (2017) 1–13. [37] F. Zhang, B. Du, L. Zhang, Scene classification via a gradient boosting random convolutional network framework, IEEE Transactions on Geoscience

PT

and Remote Sensing 54 (3) (2016) 1793–1802.

CE

[38] X. Zheng, X. Sun, K. Fu, H. Wang, Automatic annotation of satellite images via multifeature joint sparse coding with spatial relation constraint, IEEE

AC

Geoscience and Remote Sensing Letters 10 (4) (2012) 652–656.

[39] D. G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60 (2) (2004) 91–110.

[40] R. Wang, F. Nie, R. Hong, X. Chang, X. Yang, W. Yu, Fast and orthogonal 39

ACCEPTED MANUSCRIPT

locality preserving projections for dimensionality reduction, IEEE Transactions on Image Processing 26 (10) (2017) 5019–5030.

CR IP T

[41] C. Deng, R. Ji, D. Tao, X. Gao, X. Li, Weakly supervised multi-graph learning for robust image reranking, IEEE Transactions on Multimedia 16 (3) (2014) 785–795.

[42] C. Deng, X. Tang, J. Yan, W. Liu, X. Gao, Discriminative dictionary learning

AN US

with common label alignment for cross-modal retrieval, IEEE Transactions on Multimedia 18 (2) (2016) 208–218.

[43] A. Graves, A. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in: Proceedings of IEEE International Conference on

M

Acoustics, Speech and Signal Processing, 2013, pp. 6645–6649. [44] J. Chung, C. Gulcehre, K. H. Cho, Y. Bengio, Empirical evaluation of gated

ED

recurrent neural networks on sequence modeling, Eprint Arxiv. [45] A. Bosch, A. Zisserman, X. Munoz, Scene classification via pLSA, in: Pro-

PT

ceedings of European Conference on Computer Vision, 2006. [46] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: Spatial pyramid

CE

matching for recognizing natural scene categories, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, 2006, pp.

AC

2169–2178.

[47] Y. Yang, S. Newsam, Spatial pyramid co-occurrence for image classification, in: Proceedings of International Conference on Computer Vision, 2011, pp. 1465–1472. 40

ACCEPTED MANUSCRIPT

[48] A. M. Cheriyadat, Unsupervised feature learning for aerial scene classification, IEEE Transactions on Geoscience and Remote Sensing 52 (1) (2014)

CR IP T

439–451. [49] F. Zhang, B. Du, L. Zhang, Saliency-guided unsupervised feature learning for scene classification, IEEE Transactions on Geoscience and Remote Sensing 53 (4) (2015) 2175–2184.

AN US

[50] Y. Zhong, Q. Zhu, L. Zhang, Scene classification based on multifeature probabilistic latent semantic analysis for high spatial resolution remote sensing images, IEEE Transactions on Geoscience and Remote Sensing 53 (11) (2015) 6207–6222.

[51] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,

M

S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: ACM International Conference on Multimedia, 2014,

ED

pp. 675–678.

[52] C. Vaduva, I. Gavat, M. Datcu, Latent dirichlet allocation for spatial analysis

PT

of satellite images, IEEE Transactions on Geoscience and Remote Sensing

CE

51 (5) (2013) 2770–2786. [53] X. Lu, B. Wang, X. Zheng, X. Li, Exploring models and data for remote

AC

sensing image caption generation, IEEE Transactions on Geoscience and Remote Sensing (2017) 1–13doi:10.1109/TGRS.2017.2776321.

[54] J. Han, D. Zhang, G. Cheng, N. Liu, D. Xu, Advanced deep-learning techniques for salient and category-specific object detection: A survey, IEEE

41

ACCEPTED MANUSCRIPT

Signal Processing Magazine 35 (1) (2018) 84–100. doi:10.1109/MSP. 2017.2749125.

CR IP T

[55] J. Han, R. Quan, D. Zhang, F. Nie, Robust object co-segmentation using background prior, IEEE Transactions on Image Processing 27 (4) (2018)

AC

CE

PT

ED

M

AN US

1639–1651. doi:10.1109/TIP.2017.2781424.

42

CR IP T

ACCEPTED MANUSCRIPT

Xiaoqiang Lu is a Full Professor with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), Xi’an Institute of Optics and Precision

Mechanics, Chinese Academy of Sciences, Xi’an, Shaanxi, P. R. China. His current research interests include pattern recognition, machine learning, hyperspec-

AN US

tral image analysis, cellular automata, and medical imaging.

M

Weijun Ji is a master with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), Xi’an Institute of Optics and Precision Mechanics,

ED

Chinese Academy of Sciences, Xi’an 710119, Shaanxi, P. R. China. His research

AC

CE

PT

interests include remote sensing image analysis and computer vision.

43

CR IP T

ACCEPTED MANUSCRIPT

Xuelong Li is a Full Professor with the Xi’an Institute of Optics and Precision

Mechanics, Chinese Academy of Sciences, Xi’an, China, and with the University

ED

M

AN US

of Chinese Academy of Sciences, Beijing, China.

PT

Xiangtao Zheng received the M.Sc. and Ph.D. degrees in signal and information processing from the Chinese Academy of Sciences, Xi’an, China,

CE

in 2014 and 2017, respectively. He is currently an Assistant Professor with the Center for Optical Imagery Analysis and Learning (OPTIMAL), Xi’an Institute

AC

of Optics and Precision Mechanics, Chinese Academy of Sciences. His research interests include computer vision and pattern recognition.

44