Multi-Domain Joint Learning (MDJL) of Pedestrian Detection for Quadrotors

Publication: Yuan-Kai Wang, J. Guo, T. M. Pan, "Multidomain Joint Learning of Pedestrian Detection for Application to Quadrotors," Drones, vol. 6, 430, 2022. DOI


Pedestrian detection and tracking are critical functions in the field of computer vision. They play important roles in autonomous driving and are vital to the safety of passengers and pedestrians in terms of accident avoidance. These functions are also potentially useful in drones for numerous applications including the rescue of victims after a natural disaster or crowd control. However, drones can capture images at multiple angles and at different heights, which poses a challenge for pedestrian detection and tracking.

In this study, we propose a novel multi-domain joint learning (MDJL) pedestrian detection method is proposed that can overcome this challenge by exploiting data from various datasets from multiple domains. Using domain-guided dropout, a network was developed to self-organize task-specific areas according to the neuron impact scores of each domain. After training and fine-tuning the network, the accuracy of the obtained model improved in all the domains. In addition, we also combined the MDJL and Markov decision process (MDP) trackers to create a multi-object tracking system for flying drones. The experimental results revealed that MDJL can effectively address a multitude of different scenarios and significantly improve the system’s tracking performance.

INDEX TERMS Pedestrian detection; multi-object tracking; multi-task learning; multi-domain joint learning; MDJL; drones application.

Pedestrian detection with different viewpoints from dones and rrobots

In computer vision, a domain can be broadly referred to as a collection of data groups with the same features. For example, Figure 1 shows three pedestrian detection datasets that belong to different domains. From left to right: (a) videos recorded by the driving recorder, (b) videos of campus surveillance, and (c) aerial videos obtained by drones. The objective is to train the pedestrian detection network by combining the datasets of mul-tiple domains so that the learned model can be more generalized to cope with the challenges of different domains.

Front view

Overlook view

Top view


Our proposed pedestrian detection method, multi-domain joint learning (MDJL), is derived from multitask learning. It uses the DGD characteristics to allow the network to effectively learn the features during training with the multidomain dataset to improve accuracy and generalization ability across the domains. It was applied to the mul-ti-pedestrian tracking algorithm for use in the changeable environment of a flying drone.

Figure 2 shows the flowchart for the proposed method. First, initialization training is performed so that the network can initially learn the features of each domain. The neuron impact score of each domain is then calculated. The dropout in the network is replaced with the DGD (Domain Guided Dropout) [1] and training is continued. Finally, domain-specific fine-tuning is per-formed for each domain-specific neuron. This process divides the network into several subnets, as shown in Figure 3, and each subnet is specialized to a specific domain. Based on the characteristics of the DGD, subnets may overlap, which means that the domains have common data features.

Figure 2. DJL folwchart

Figure 3. chematic diagram of MDJL subnet segmentation


The computer hardware specifications used in the experiment included an Intel Core i7-930 CPU, 24 GB RAM, and an Nvidia GTX 1080Ti GPU with 11 GB RAM. The op-erating system was Ubuntu 14.04 LTS, and CUDA 8.0, CUDNN 5.1, and MATLAB R2014b were the primary software. In the experiment, we primarily used datasets from three sources. The Caltech pedestrian detection dataset, the PETS09 dataset, and DroneFJU dataset. The DroneFJU dataset was obtained on the Fu Jen Catholic University campus using a quadrotor. The equipment included a DJI Matrice 100 + Zenmuse integrated PTZ camera. The shooting format was 1080 × 1920 pixels at 60 Hz, and the shooting angle was divided into a 45-degree downward-facing and a 90-degree downward orientation. The challenge associated with this dataset was the shooting angle from a downward orientation. In addition, because a wide-angle lens was used for video acquisition, the pedestrian's face and overall outline changed depending on their position and various body postures.

We compared the proposed method to current state-of-the-art approaches. From Figure 4, it is evident that although MDJL used a multi-domain dataset to train the network, it can still maintain the performance of the state-of-the-art pedestrian detection benchmark and can beat MS-CNN after fine-tuning.

Figure 4. ROC line chart compared with State-of-the-Art methods.


This paper discusses a common problem in the implementation of machine learning: appropriate datasets are often unavailable. To address this issue, we proposed the MDJL method. This approach can effectively address the phenomenon in which the network performance in each domain is not ideal when multiple datasets are used for training. Moreover, it can improve the network’s performance in similar domains. In addition, the MJDL model is more generalizable. When the number of samples in each dataset differed significantly, to prevent a large dataset from dominating the network and affecting the performance of the other domains, we used data augmentation to balance the difference in the number of samples between the domains. In addition, we also combined the MDJL and MDP trackers to create a multi-object tracking system that is suitable for flying drones. We use the characteristics of the MDJL model to enable the tracking system to cope ef-fectively with a photography platform with a high degree of spatial freedom, like an aerial camera. The experimental results also show that MDJL can cope with different scenarios and significantly improve the systems’ tracking performance.

When samples are expanded using data augmentation, it may be useful to consider referring to the characteristics of the original dataset; it may yield different training results. In addition, the drone’s camera mobility and high shooting angle are not available in fixed-position surveillance and driving recorders. An inappropriate viewing angle leads to more misjudgments. Improvement of the pedestrian tracking recognition rate for horizontal viewing angles should be investigated in the future.


[1] T. Xiao, H. Li, W. Ouyang and X. Wang, "Learning Deep Feature Representations with Domain Guided Dropout for Person Re-identification," in Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016.