SpikingYOLOv4

ABSTRACT

Neuromorphic vision is one of the novel research fields that studies neuromorphic cameras and spiking neural networks (SNNs) for computer vision. Instead of computing on frame-based images, spike events are streamed from neuromorphic cameras, and novel object detection algorithms have to deal with spike events to achieve detection tasks. In this paper, we propose a solution of the novel object detection method with spike events. Spike events are first decoded to event images according to the computational methodology of neuromorphic theory. The event images can be realized as change-detected images of moving objects with a high frame rate. A redesigned deep learning framework is proposed for the object detection to process the event images. We propose a deep SNN method that is achieved by the conversion of successful convolution neural networks but trained by event images. The networks with multiscale representation are discussed and designed in our method. We also design a semi-automatic data labeling method to build event-image datasets by object tracking algorithms. The proposed solution therefore includes spike event decoding, a redesigned deep SNN, and an event-image data augmentation algorithm. Experiments are conducted not only on the MNIST-DVS dataset, which is a benchmark dataset for the study of neuromorphic vision, but also on our event pedestrian detection dataset. The experimental results show that the performance of the deep SNN trained with our augmented data is close to the model trained on manual labeled data. A performance comparison based on the PAFBenchmark dataset shows that our proposed method has higher accuracy than existing SNN methods, and better energy efficiency and lower energy consumption than existing CNN methods. It demonstrates that our deep SNN method is a feasible solution for the study of neuromorphic vision. The intuition that deep SNN trained with more learning data can achieve better accuracy is also confirmed for this brand new research field.

INDEX TERMS Deep spiking neural network, semi-automatic data labeling, event camera, multiscale representation, YOLO.

STATE-OF-THE-ART COMPARISON

Performance comparisons with existing SNNs and object detection models are evaluated. Tiny YOLOv3, fcYOLE [1], SpikingYOLO [2], and YOLOv4 [3] are compared using the MNIST-DVS [4] and FJUPD datasets. In this experiment our deep SNN method adopts YOLOv4 as a basic CNN model for conversion, and uses the semi-automatic labeling algorithm for data augmentation. Our method is called SpikingYOLOv4 in the following discussions. Manually labeled test set is used to obtain the performance metrics, where especially power consumption is also compared. We use the semi-automatically labeling algorithm to generate a total of 5,912 labeled images for training. Their size is 438*438. The labeling algorithm generates 590 labeled images for validation, and 590 manually labeled image data are used as the test set.

Figure 1 shows the training and test results obtained in the experiment of FJU event pedestrian detection. It can be found that our SpikingYOLOv4 gets better results than the Spiking-YOLO. The power consumption of SpikingYOLOv4 is also the most efficient. Although the model performance is slightly reduced, the power consumption has been significantly reduced, which proves that SNNs have more power advantages than CNNs.

(a)

(b)

Figure 1. The comparison of each model on (a) F-measure score and (b) power consumption.

Figure 2. Detection results of three scales and two models. The object scales from above to below respectively are large, normal, and small. The middle column are the results of semi-automatically labelling with PAFBPretrain model, and the right column the results of manually labelling with PAFBPretrain model.

CONCLUSION

This paper proposes a framework for event object detection, including event images, multiscale deep SNN, and event-image data augmentation. The spike events captured from event camera are first decoded into event images for the learning of deep CNN models. The conversion of multiscale network representation is discussed and designed in our deep SNN method. A semi-automatic data augmentation algorithm is devised by using visual tracking algorithm to label the objects in event images.

Experiments of three different decoding schemes are evaluated with discussions of their advantages and disadvantages. The results conducted on the MNIST-DVS benchmark dataset show that the proposed algorithmic flow successfully converts an existing CNN into spike-event domain application, and still has high accuracy. A detailed performance analysis conducted on the FJUPD dataset demonstrates the multiscale representation of the SpikingYOLOv4 network. Final PAFBenchmark experiment shows that our proposed method has higher accuracy than existing SNN methods, and better energy efficiency and lower energy consumption than existing CNN methods. It demonstrates that our deep SNN method is a feasible solution for the study of neuromorphic vision. The intuition that deep SNN trained with more data can achieve better accuracy is also confirmed.


REFERENCES

[1] T. Delbruck, "Fun with asynchronous vision sensors and processing," European Conference on Computer Vision (ECCV) Workshops and Demonstrations, vol. 7583, no. 52, pp. 506–515, Berlin, Heidelberg: Springer Berlin Heidelberg, 2012.

[2] S. Kim, S. Park, B. Na, and S. Yoon, "Spiking-YOLO: Spiking neural network for energy-efficient object detection," Computer Vision and Pattern Recognition (CVPR), 2020.

[3] A. Bochkovskiy, C. Wang, and H. M. Liao, "YOLOv4: Optimal speed and accuracy of object detection," Computer Vision and Pattern Recognition (CVPR), 2020.

[4] M. Cannici, M. Ciccone, A. Romanoni, and M. Matteucci, "Asynchronous convolutional networks for object detection in neuromorphic cameras," Computer Vision and Pattern Recognition (CVPR), 2019.