Exploring spatial-frequency-sequential relationships for motor imagery classification with recurrent neural network

Bioinformatics

Experimental datasets setup

The performances of the algorithms were evaluated on the BCI Competition IV [27] “Dataset 2a” and “Dataset 2b”1. The two datasets are compared in Table 3. Figure 6 illustrates how the single-trial EEG data were extracted on “Dataset 2a” and “Dataset 2b”. The two datasets share the same procedure. In the motor imagery classification experiments, each subject sat in a soft chair comfortably facing a computer screen. The BCI Competition IV experiments are composed of the following six steps: (1) Each trial started with a warning tone. (2) Simultaneously, a fixation cross was shown on the computer screen for two seconds. (3) After two seconds, a cue, in the form of an arrow, was randomly shown in lieu of the fixation cross, and the subjects started the corresponding motor imagery task of the cue. (4) After another 1.25 s, the cue reverted to the fixation cross. (5) The motor imagery task continued until the sixth second, at which time the fixation cross disappeared. (6) Finally, there was a short 1.5 s break. The signals were sampled at 250 Hz and recorded. The pre-processing operations on the signals for notch filtered and band-pass filtered were 50Hz and 0.1-100Hz, respectively.

Fig. 6

The procedure of single-trial motor imagery in BCI Competition IV. The BCI Competition IV experiments are composed of the following six steps: (1) Each trial started with a warning tone. (2) Simultaneously, a fixation cross was shown on the computer screen for two seconds. (3) After two seconds, a cue, in the form of an arrow, was randomly shown in lieu of the fixation cross, and the subjects started the corresponding motor imagery task of the cue. (4) After another 1.25 seconds, the cue reverted to the fixation cross. (5) The motor imagery task continued until the sixth second, at which time the fixation cross disappeared. (6) Finally, there was a short 1.5-seconds break

Table 3

Comparison of “Dataset 2a” and “Dataset 2b”

2a

4(left, right, foot, tongue)

22EEG+3EOG

9

2

288

2b

2(left, right)

3EEG+3EOG

9

5

160

The BCI Competition IV “Dataset 2a” is composed of the following four classes of motor imagery EEG measurements from nine subjects: (1) left hand, (2) right hand, (3) feet, and (4) tongue. Two sessions, one for training and another for evaluation, were recorded from each subject. “Dataset 2b” is composed of two classes of motor imagery EEG measurements from nine subjects: (1) left hand and (2) right hand. Five sessions, the first three for training and the last two for evaluation, were recorded from each subject. According to the extraction procedure, the time range [4, 6s] was chosen for motor imagery classification because of a strong ERD/ERS phenomenon within that range [12, 44].

The spatial-frequency features are extracted by the FB-CSPs algorithm. In the division of the whole band (8-30Hz, covered μ and β rhythms) to obtain universality for all subjects, the optimal band width range is 4Hz overlaps the next by 2Hz [5, 25]. The optimal division of band-pass filters is shown in Table 4. After the optimal frequency bands filter the raw EEG signals, the CSP algorithm is applied to the filtered EEG signals to obtain spatial-frequency features. In (2) in the CSP algorithm, parameter m for processing “Dataset 2a” and “Dataset 2b” is set to 2 and 1, respectively.

Table 4

Optimal division of band-pass filters

Frequency(Hz)

[8,12]

[10,14]

[12,16]

[14,18]

[16,20]

[18,22]

[20,24]

[22,26]

[24,28]

[26,30]

After extraction of spatial-frequency features, two separate experiments to confirm the parameters and validate the performances of spatial-frequency-sequential relationships and the classification of motor imagery are as follows:

  1. 1.

    EEG modeling experiments: First, a size range of [0, 4] smoothing time windows are put on the FB-CSP features to obtain the performance of classification. After validate the affections of performances by sequential relationships, two different sub-experiments on “Dataset 2a Subject 3” are presented to confirm whether a deep RNN architecture can model EEG signals well by cross-entropies and accuracies. Another sub-experiment is presented to find the optimal number of hidden layers in the deep RNN architecture.

  2. 2.

    Classification experiments: For motor imagery classification, the spatial-frequency FB-CSP features are fed into the deep RNN architecture to obtain spatial-frequency-sequential relationships. The spatial-frequency features are cropped by a sliding window sized by the optimal number of hidden layers. In the classification by LSTM-RNN architecture and GRU-RNN architecture, the accuracies, errors and efficiency of classification will be compared between spatial-frequency features and spatial-frequency-sequential relationships.

EEG modeling experiments and results

To obtain the performance of classification influenced by the sequential relationships, a group of smoothing windows with the size range [0,4] is presented to FB-CSP features. In our experiments, via smoothed FB-CSP features, the SVM classifier with RBF kernel is used for motor imagery classification. Figure 7 illustrates the smoothing time window experimental results for “Dataset 2a” and “Dataset 2b”. Among the results, “SW=0” expresses the FB-CSP features without smoothing. From the results, we find that the performance of EEG signals classification was fully influenced by the smoothing time windows. Thus, the RNN architecture is introduced in this study to extract spatial-frequency-sequential relationships from FB-CSP features for classification. However, we must validate the presentation of spatial-frequency-sequential relationships by a RNN architecture at first.

Fig. 7

The classification results by using different size of smoothed FB-CSP features and SVM for both “Dataset2a” and “Dataset2b”. To obtain the performance of classification influenced by the sequential relationships, a group of smoothing windows with the size range [0,4] is presented to FB-CSP features. In our experiments, via smoothed FB-CSP features, the SVM classifier is used for motor imagery classification. Among the results, “SW=0” expresses the FB-CSP features without smoothing. The size number of smoothing time window fully influences the performance of EEG signals classification. a The classification results of “Dataset2a” and b The classification results of “Dataset2b”

There are three steps to validate the presentation of spatial-frequency-sequential relationships by RNN architecture. First, to validate whether the deep RNN architecture can model EEG signals or not, we train a deep RNN architecture by 200 iterations of SGD algorithm over 22 channels of the first three seconds of EEG signals from “Dataset 2a Subject 3”. To test the modeling ability, the previous outputs are fed back into model’s inputs to predict the current EEG signals. The results on channel “ C3” by 20, 30, 90 hidden layers are drawn in Fig. 8. From the results, we find the deep RNN architecture will predict the same level of signals as the number of hidden layers increased. The predictions by 20 hidden layers matched the EEG signals after a few samples, and the predictions by 30 hidden layers matched almost half of the rest samples. The predictions by 90 hidden layers matched the entity of rest samples for both LSTM-RNN architecture and GRU-RNN architecture. A highest number of hidden layers will get rich sequential relationships which have a similar spectrum to the EEG signals.

Fig. 8

The prediction results of LSTM-RNN and GRU-RNN by 20, 30, and 90 hidden layers on channel “ C3”. The deep RNN architecture will predict the same level of signals as the number of hidden layers increased. A highest number of hidden layers will get rich sequential relationships which have a similar spectrum to the EEG signals. a 20 hidden layers of LSTM unit, b 20 hidden layers of GRU, c 30 hidden layers of LSTM unit, d 30 hidden layers of GRU, e 90 hidden layers of LSTM unit and f 90 hidden layers of GRU

Second, we evaluate the classification performances of the deep RNN architecture by 200 iterations of SGD algorithm over the training data of “Dataset 2a Subject 3”. The loss function for EEG signals by RNN architecture is the logarithmic cross-entropy, which is defined as [61]:

$$ E = – frac{1}{N}sumlimits_{t = 1}^{T} {{c_{t}}log b_{c}^{t} + (1 – {c_{t}})log left(1 – b_{c}^{t}right)} $$

(8)

where ct is the ground truth result, and (b_{c}^{t}) is the prediction result by deep classifier. The number of iterations for optimizing the loss function is an experience value of controlling training epochs by limiting the number of hidden layers. Figure 9 gives the training and validation cross-entropies as the number of hidden layers increased. From the results, we find the training and validation cross-entropies have separations over 20 hidden layers. The cross-entropies will not reduce if the signals are over-fitted by the RNN architecture. In fact, the cross-entropies will not increase, so the deep RNN architecture continues to learn components of the signals that are common to all of the EEG sequences. Compared with LSTM-RNN architecture and GRU-RNN architecture, the LSTM-RNN architecture needs more hidden layers to achieve a same level of cross-entropy during the classification of EEG signals.

Fig. 9

The curves of LSTM-RNN and GRU-RNN’s training and validation cross-entropies as the number of hidden layers increased. The training and validation cross-entropies have separations over 20 hidden layers. The cross-entropies will not reduce if the signals are over-fitting by RNN. In fact, the cross-entropies will not increase, so the deep RNN architecture continues to learn components of the signals that are common to all of the EEG sequences. a The curves of LSTM-RNN and b The curves of GRU-RNN

Third, since a large number of hidden layers requires much computational complexity, and causes the over-fitting problem to achieve low validation accuracies, Fig. 10 gives the training and validation accuracies as the number of hidden layers increased. From the results, we find the validation accuracies appear peaks with a 20–15 hidden layers. Compared with LSTM-RNN architecture and GRU-RNN architecture, the LSTM-RNN architecture needs more hidden layers to achieve a same level of accuracy during the classification of EEG signals. When the deep RNN architecture is over-fitting, the accuracy of GRU-RNN has a sharp drop than LSTM-RNN.

Fig. 10

The curves of LSTM-RNN and GRU-RNN’s training and validation accuracies as the number of hidden layers increased. The validation accuracies appear peaks with a 20–15 hidden layers. The results of accuracies will be paradoxical with the results of cross-entropies, and the classification of EEG signals will be quickly over-fitting. The reason is that the deep RNN architecture continues to learn common components of the EEG sequences, while simultaneously learning signal noise and non-discriminative components. Therefore, there is a trade-off between classification and long-term patterns of modeling errors. a The curves of LSTM-RNN and b The curves of GRU-RNN

Classification experiments and results

Let ({overline Z_{c}} in {R^{M * T}}) represents the spatial-frequency features, where M is the feature dimension, T is the number of samples in each trial. After EEG modeling experiments, the optimal number of hidden layers for LSTM-RNN and GRU-RNN of all subjects are confirmed. Then, the optimal number τ is used for cropping training set and validation set by sliding window cropping strategy. Hence, the samples of each trial in training set and validation set will be increased Tτ times to satisfy the deep RNN architecture. After training procedure, the validation procedure will produce Tτ classification targets in each trial. Finally, the unique target of the trial will be calculated by averaging all targets of time slices.

To confirm the parameters and weights of RNN architecture, the characteristics of non-linearity and non-stationarity in EEG signals will be considered, since the characteristics will limit the reliability of the conventional activation function in the deep learning architecture. Therefore, there are three different activation strategies, “tanh”, “sigmoid” and “ReLu”, for constructing activation functions [62]. The activation function, “tanh”, is applied to cell input activation function of both the LSTM unit and GRU. The activation function, “sigmoid”, is applied to the cell output activation function of the LSTM units. To prevent the vanishing error flow, the “ReLu” activation function is applied to the gates activation function of both the LSTM unit and GRU. The weights of RNN are initialized by a Gaussian distribution N(0,0.2). The BPTT algorithm is used to train RNN by minimizing cross-entropy (see (8)) loss function. Also, because the Adam strategy [63] is suitable for time-series in deep classifiers and its momentum improves the robustness of error flow, the strategy is applied to compute the learning rate during BPTT. Finally, a “Dropout” strategy is applied to prevent the over-fitting problem [64]. The key idea of the “Dropout” strategy is to randomly eliminate units (along with their connections) from the neural network during training. By experience, the dropout rate is set at 0.2, and the maximum number of iterations is set at 200.

For both “Dataset 2a” and “Dataset 2b”, we train two different RNN architectures, each of which includes LSTM unit and GRU. Figure 11 illustrates the learning curves for different memory units in different datasets. In the case of both datasets, GRU-RNN architecture converges faster than LSTM-RNN architecture. To reach lowest loss, GRU-RNN architecture acquire less number of iterations than LSTM-RNN architecture. For some specific subjects, GRU-RNN architecture obtains lower average cross-entropy loss than LSTM-RNN architecture within 200 iterations. Overall, the subjects’ EEG signals from “Dataset 2a” and “Dataset 2b” represent similar average cross-entropy between LSTM-RNN architecture and GRU-RNN architecture.

Fig. 11

The learning curves for different memory units in different datasets. In the case of both datasets, GRU-RNN converge faster than LSTM-RNN. To reach lowest loss, GRU-RNN acquires less number of iterations than LSTM-RNN. For some specific subjects, GRU-RNN obtains lower average cross-entropy loss than LSTM-RNN within 200 iterations. Overall, the subjects’ EEG signals from “Dataset 2a” and “Dataset 2b” represent similar average cross-entropy between LSTM-RNN and GRU-RNN. a LSTM-RNN in “Dataset 2a”, b GRU-RNN in “Dataset 2a”, c LSTM-RNN in “Dataset 2b” and d GRU-RNN in “Dataset 2b”

Table 5 gives the average training and validation time complexity per trial comparison between spatial-frequency-sequential relationships and spatial-frequency features. To compare time complexity, “Dataset 2a Subject 7” and “Dataset 2b Subject 8” are used to detect the average training and validation time complexity per trial. In Table 5, since the deep neural networks architectures (RNN and CNN) need more number of iterations for convergence in training phase, the average training time complexity of deep architectures is significantly higher than the conventional SVM model. However, in the validation phase, the deep architectures achieve a same level of time complexity than the conventional SVM model. Hence, the RNN architecture will cost appropriate time consumptions in the applications of MI-BCIs. Besides, compared with two different memory units of RNN architecture, GRU-RNN architecture outperform LSTM-RNN architecture in time complexity of EEG signals’ training and validation.

Table 5

Training and validation time complexity comparison between spatial-frequency- sequential features and spatial-frequency features with respect to “Dataset 2a Subject 7” and “Dataset 2b Subject 8”

Training time

“Dataset 2a S7”

272.26

214.34

478.08

8.19

complexity (/s)

“Dataset 2b S8”

169.80

148.11

342.00

22.83

Validation time

“Dataset 2a S7”

1.89

1.49

3.32

2.01

complexity (/s)

“Dataset 2b S8”

3.40

2.96

6.84

4.17

All classification experimental results of all subjects in “Dataset 2a” and “Dataset 2b” are listed in Tables 6 and 7, respectively. Spatial-frequency-sequential relationships extracted from LSTM-RNN architecture and GRU-RNN architecture; spatial-frequency features extracted from CNN, SVM with linear, polynomial and RBF kernels are used to classify motor imagery for comparison. The results are presented by error rate forms, and a paired t-test statistical technique is used to detect whether the spatial-frequency-sequential relationships significantly outperform than spatial-frequency features in the classification of MI-BCIs. For each subject, we confirm the optimal number of hidden layers (ONHL) for LSTM-RNN architecture and GRU-RNN architecture.

Table 6

The misclassification rate and variance of motor imagery classification in “Dataset 2a”

S1

37

16.41 ±2.92

30

15.18 ±2.86

20.86 ±3.19

18.06 ±2.88

18.40 ±2.94

17.71 ±3.03

S2

35

34.13 ±5.63

29

34.68 ±5.56

48.18 ±6.87

43.40 ±6.69

46.53 ±6.55

46.18 ±6.36

S3

41

21.19 ±3.42

33

16.46 ±3.55

17.59 ±3.48

15.97 ±3.24

15.28 ±3.35

17.71 ±3.36

S4

40

31.67 ±5.15

31

32.33 ±5.60

44.12 ±7.37

39.24 ±5.64

37.15 ±5.48

42.71 ±7.19

S5

34

36.14 ±5.72

27

36.00 ±5.65

32.53 ±5.28

57.99 ±9.75

40.62 ±6.24

38.54 ±5.82

S6

36

33.11 ±5.15

26

29.13 ±5.36

49.30 ±8.68

46.87 ±8.98

47.22 ±8.52

47.57 ±8.49

S7

33

23.52 ±3.40

24

15.04 ±2.86

16.63 ±2.72

13.54 ±3.64

30.21 ±3.57

12.16 ±2.40

S8

36

23.60 ±3.63

28

28.05 ±3.75

13.61 ±3.86

29.51 ±4.35

28.13 ±4.63

31.94 ±5.72

S9

39

26.99 ±4.14

26

31.10 ±4.26

15.71 ±3.85

26.39 ±4.50

29.17 ±5.21

34.03 ±5.43

AVG

27.42 ±4.35

26.44 ±4.38

28.73 ±5.03

32.33 ±5.52

32.52 ±5.17

32.06 ±5.31

p-test

a vs. f

b vs. f

c vs. b

d vs. b

e vs. b

p-value

p=0.13

p <0.05

p=0.59

p=0.08

p <0.05

Table 7

The misclassification rate and variance of motor imagery classification in “Dataset 2b”

S1

39

22.61 ±4.63

31

20.24 ±4.29

30.66 ±5.96

35.94 ±6.46

32.5 ±5.27

33.44 ±5.65

S2

36

28.15 ±5.86

27

27.24 ±5.69

36.76 ±5.42

46.43 ±6.48

47.86 ±6.86

46.79 ±7.62

S3

44

27.87 ±4.53

35

26.85 ±4.62

38.61 ±5.27

45.94 ±7.52

44.69 ±7.40

45.62 ±7.65

S4

42

8.64 ±2.13

32

7.53 ±2.20

1.87 ±1.03

3.12 ±2.58

3.75 ±2.62

4.06 ±2.18

S5

36

14.67 ±3.46

28

13.75 ±3.35

14.58 ±3.28

19.37 ±4.25

16.25 ±3.86

16.25 ±3.97

S6

32

17.92 ±4.16

30

15.49 ±4.23

12.64 ±3.60

21.87 ±5.30

22.19 ±5.14

23.12 ±5.21

S7

33

14.53 ±3.59

39

13.35 ±3.46

10.06 ±2.84

22.81 ±4.72

22.50 ±4.41

22.81 ±4.23

S8

35

7.25 ±2.15

25

6.53 ±2.03

3.16 ±1.86

10.00 ±3.68

9.69 ±3.13

10.31 ±3.24

S9

38

24.64 ±4.62

28

24.31 ±4.72

30.66 ±5.13

17.19 ±3.42

16.56 ±3.35

17.19 ±3.41

AVG

18.48 ±3.90

17.25 ±3.84

19.89 ±3.83

24.74 ±4.93

24.00 ±4.67

24.40 ±4.80

p-test

a vs. f

b vs. f

c vs. b

d vs. b

e vs. b

p-value

p=0.08

p <0.05

p=0.28

p <0.05

p=0.06

From the results in Tables 6 and 7, for both datasets, spatial-frequency-sequential relationships outperform the spatial-frequency features in the classification of MI-BCIs. Among them, the average error rate of 27.42% and 26.44% is achieved by LSTM-RNN and GRU-RNN in the case of “Dataset 2a”, respectively. The results of paired t-test show GRU-RNN (b) achieves significantly lower error rate than SVM-Polynomial (e) (p <0.05) and SVM-Linear (f) (p <0.05). In the case of “Dataset 2b”, an averaged error rate of 18.48% and 17.25% is achieved by LSTM-RNN and GRU-RNN, respectively. The results of paired t-test show GRU-RNN (b) achieves significantly lower error rate than SVM-RBF (d) (p <0.05) and SVM-Linear (f) (p <0.05). To compare the classification performances of GRU-RNN and LSTM-RNN, Fig. 12 gives the classification accuracies for all subjects with all algorithms.

Fig. 12

The classification accuracy performances for all subjects with all algorithms. CNN and SVM (Linear) outperforms RNN in some subjects with high-level (over 60%) accuracies (S3, S7, S8, S9 in “Dataset 2a” and S4, S6, S8, S9 in “Dataset 2b”). However, in low-level (below 60%) accuracies of subjects, RNN outperforms CNN and SVM-Linear (S2, S4, S6 in “Dataset 2a” and S2, S3 in for “Dataset 2b”). In the average-level accuracies, RNN outperforms CNN and SVM (Linear). a Classification accuracy performances for “Dataset 2a” and b Classification accuracy performances for “Dataset 2b”

From the results in Fig. 12, we find CNN and SVM (Linear) outperformed RNN in some subjects with high-level (over 60%) accuracies (S3, S7, S8, S9 in “Dataset 2a” and S4, S6, S8, S9 in “Dataset 2b”). However, in low-level (below 60%) accuracies of subjects, RNN outperformed CNN and SVM (Linear) (S2, S4, S6 in “Dataset 2a” and S2, S3 in “Dataset 2b”). In the average-level accuracies, RNN architecture outperformed CNN and SVM (Linear). Besides, a comparison of the averaged accuracies for LSTM unit and GRU in both datasets shows that GRU-RNN architecture outperformed LSTM-RNN.

Articles You May Like

High performance solid-state sodium-ion battery
Defying the laws of physics? Engineers demonstrate bubbles of sand
Ocean circulation likely to blame for severity of 2018 red tide around Florida
Scientists Dig Into Hard Questions About The Fluorinated Pollutants Known As PFAS
Catalyst renders nerve agents harmless

Leave a Reply

Your email address will not be published. Required fields are marked *