Russian Journal of Nonlinear Dynamics, 2022, vol. 18, no. 5, pp. 787-802. Full-texts are available at http://nd.ics.org.ru DOI: 10.20537/nd221218
NONLINEAR ENGINEERING AND ROBOTICS
MSC 2010: 68T40, 68T45, 93C85
Image-Based Object Detection Approaches to be Used in Embedded Systems for Robots Navigation
A. Ali Deeb, F. Shahhoud
This paper investigates the problem of object detection for real-time agents' navigation using embedded systems. In real-world problems, a compromise between accuracy and speed must be found. In this paper, we consider a description of the architecture of different object detection algorithms, such as R-CNN and YOLO, to compare them on different variants of embedded systems using different datasets. As a result, we provide a trade-off study based on accuracy and speed for different object detection algorithms to choose the appropriate one depending on the specific application task.
Keywords: robot navigation, object detection, embedded systems, YOLO algorithms, R-CNN algorithms, object semantics
1. Introduction
Target detection has attracted significant attention for autonomous robots due to its notable benefits and recent progress [1]. Target tracking can be used in autonomous vehicles for the development of guidance systems [2]. Pedestrian detection [3], dynamic vehicle detection, and obstacle detection [4] can improve the features of the guiding assistance system. Object recognition technologies for self-driving vehicles have strict requirements in terms of accuracy, unambiguousness, robustness, space demand, and costs [5]. Similarly, object recognition and tracking features in robots can assist in wheeled robots navigation and obstacle avoidance.
Visual navigation systems in service robots can be used in many applications, they are typically deployed in retail, healthcare, and warehouses. Others are deployed in more rugged settings, such as space, defense, agricultural applications, demolition, and for automating dangerous or laborious tasks.
Received September 15, 2022 Accepted November 10, 2022
Ahmad Ali Deeb [email protected] Farah Shahhoud [email protected]
Bauman Moscow State Technical University ul. 2-ya Baumanskaya, Moscow, 105005 Russia
Previously, target detection in robot systems mostly used vision-based target finding algorithms. For example, Raspberry Pi and OpenCV were used to find a target [6]. However, computer vision techniques might provide less accurate results and have issues in predicting unknown future data. On the other hand, machine learning target-detection algorithms can provide a very accurate result, and the model can make predictions from unknown future data. Visual recognition systems involving image classification, localization, and segmentation have accomplished extraordinary research contributions [7]. Moreover, deep learning has made great progress in solving issues in the fields of computer vision, image and video processing, and multimedia [8]. Because of the critical advancements in neural networks, particularly deep learning [9], these visual recognition systems have shown great potential in target tracking.
On-board and off-board ground-based systems are promising platforms in this context. Most of the time, the robot system cannot be equipped with heavy devices due to weight and power consumption. Therefore, off-board ground systems play a vital role. In some cases, communication with the ground station could be impossible due to distance or coverage. An on-board system that can support both weight and power consumption would be a perfect framework for such a situation and environment.
In the present study, we compared various object detection algorithms and different embedded systems to execute those algorithms, based on the comparison we will choose the best algorithm with the best embedded system for object detection in a robot system in real time.
2. Implemented object detection algorithm in robot systems
2.1. Region proposal-based framework
The region proposal-based framework is a two-step process and matches the attentional mechanism of human brain to some extent, which gives a coarse scan of the whole scenario firstly and then focuses on regions of interest [10].
2.1.1. Region with CNN (R-CNN)
The original paper "Rich feature hierarchies for accurate object detection and semantic segmentation" [11] elaborates one of the first breakthroughs of the use of CNNs in an object detection system called the "R-CNN" or "Regions with CNN" that had a much higher object detection performance than other popular methods at the time.
R-CNN generates features in a region using CNN. The algorithm proposed in [11] employed selective search [12] to extract just 2000 regions from the entire input image. These regions are referred to as region proposals Rol and they have a high probability of containing an object. Therefore, instead of classifying a large number of regions, just 2000 regions can be worked with as in Fig. 1.
Due to the requirement of CNNs to have a fixed input image size, the proposed Rols are then warped to have a fixed size, then they are fed to a convolution neural network that will extract features from each candidate region.
A classifier like a support vector machine (SVM) [13, 14] will classify the presence of the object within that candidate region proposal based on the extracted features from the previous step.
In addition to predicting the presence of an object within the region proposals, the algorithm also has a bounding-box regressor that predicts four values which are the location and size of the
Input Image
4. The network produces
bounding-box and classification predictions.
3. Forward each region through the pretrained ConvNet to extract features.
2. Extracted regions are warped before being fed to the ConvNet.
1. Selective search algorithm is used to extract ROIs from the input image.
Fig. 1. R-CNN Architecture
bounding box that surrounds the object, then filtering with a greedy nonmaximum suppression (NMS) [15, 16] to produce final bounding boxes. The R-CNN architecture is shown in Fig. 1.
Limitations of R-CNN. It still takes a huge amount of time to train the network as it would have to classify 2000 region proposals per image, it cannot be implemented in real time as it takes around 47 seconds for each test image and the selective search algorithm is a fixed algorithm. Therefore, no learning is happening at that stage. This could lead to the generation of bad candidate region proposals.
2.1.2. Fast R-CNN
The R-CNN model took a huge amount of time to train the network. Girshick et al. [17] built another faster object detection algorithm known as Fast R-CNN to circumvent this problem. Instead of starting with the regions proposal module and then using the feature extraction module, like R-CNN, Fast-RCNN proposes that we apply the CNN feature extractor first to the entire input image and then propose regions. In this way, we run only one ConvNet over the entire image instead of 2,000 ConvNets over 2,000 overlapping regions. And the region proposals are generated using other algorithms such as Edge Boxes [18].
The ConvNet has an extended job to do the classification part as well, this has been done by replacing the traditional SVM machine learning algorithm [13, 14] with a softmax layer. In this way, a single model will perform both tasks: feature extraction and object classification. The fast R-CNN architecture is shown in Fig. 2.
Limitations of Fast R-CNN. This still requires candidate regions as input and the running time of Fast R-CNN is reduced, exposing region proposal computation as a bottleneck.
2.1.3. Faster R-CNN
Faster R-CNN is the third iteration of the R-CNN family, developed in 2016 by Shaoqing Ren et al. [19]. Similar to Fast R-CNN, the image is provided as an input to a convolutional network that provides a convolutional feature map. Instead of using a selective search algorithm
Fixed-size Rols after the Rol pooling layer
Proposed Rols have different sizes
Two output layers
J Fully-connected layers ] Rol pooling layer
Rol extractor (selective search)
Feature extractor
Input image Fig. 2. Fast R-CNN Architecture
on the feature map to identify the region proposals, a region proposal network (RPN) is used to predict the region proposals as part of the training process.
The architecture of Faster R-CNN can be described using two main networks, the first network is a region proposal network (RPN) — the selective search is replaced by a ConvNet that proposes RoIs from the last feature maps of the feature extractor to be considered for investigation where the RPN has two outputs, the objectness score (object or no object) and the box location. The second network consists of typical components of Fast R-CNN. Faster R-CNN architecture is shown in Fig. 3.
Fig. 3. Faster R-CNN Architecture
2.1.4. Mask R-CNN
The mask R-CNN is an extended version of faster R-CNN for pixel level segmentation. Mask R-CNN [20] works by adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression. The branch is a fully convolutional network FCN [21] on top of a CNN-based feature map. Once these masks are generated, mask R-CNN amalgamates them with the classifications and bounding boxes that result from faster R-CNN. Overall, it generates precise segmentation.
In the second stage of Faster R-CNN, RoI pool is replaced by RoIAlign which helps to preserve spatial information which gets misaligned in the case of RoI pool. RoIAlign uses binary interpolation to create a feature map that is of fixed size.
The output from RoIAlign layer is then fed into Mask head, which consists of two convolution layers. It generates a mask for each RoI, thus segmenting an image in a pixel-to-pixel manner. The mask R-CNN architecture is shown in Fig. 4.
Faster R-CNN
Faster R-CNN + Mask Head -> Mask R-CNN
Fig. 4. Mask R-CNN Architecture
2.2. Regression/classification-based framework
One-step frameworks based on global regression/classification, mapping straightly from image pixels to bounding box coordinates and class probabilities, can reduce the time expenditure [10].
R-CNN object detection systems need to go through two stages to detect the objects. YOLO doesn't need to go through these boring processes. It only needs to look once at the image to detect all the objects, and that is why the name was chosen (You Only Look Once) and that is actually the reason why YOLO is a very fast model.
2.2.1. YOLOvl
YOLO [22] uses an innovative strategy to resolving object detection as a regression problem, it detects bounding box coordinates and class probabilities directly from the image.
YOLO divides the input image into an S x S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object (each grid cell predicts only one object). Each grid cell predicts B bounding boxes and confidence scores for those boxes.
These confidence scores reflect how confident the model is that the box contains an object Pr(Object), as well as how accurate the predicted box is in evaluating its overlap with the ground truth bounding box measured by intersection over union IoUp™^ •
Each grid cell also predicts C conditional class probabilities, Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B. YOLOvl architecture is shown in Fig. 5.
Fig. 5. YOLOvl Architecture
Loss function. YOLO uses sum-squared error between the predictions and the ground truth to calculate losses. The loss function is composed of the classification loss, the localization loss (errors between the predicted boundary box and the ground truth) and the confidence loss (the objectness of the box).
2.2.2. YOLOv2
Since YOLOvl suffers from localization errors and low recall predictions, the YOLOv2 [23] shows a lot of improvement to increase the speed vs accuracy trade-off.
Design improvement in YOLOv2. Batch normalization, classifying on high resolution inputs, convolutional with anchor boxes, multiscale training and direct location prediction. YOLOv2 architecture is shown in Fig. 6.
2.2.3. YOLOv3
YOLOv3 [24] is an improved version of YOLOv2. First, YOLOv3 can give multilabel classification [25] (independent logistic classifiers) to adapt to more complex datasets containing many overlapping labels. Second, YOLOv3 predicts bounding boxes at three different scales by following the idea of a feature pyramid network for object detection [26]. The last convolutional layer predicts a 3-d tensor encoding class predictions, objectness, and bounding box. Third, YOLOv3 proposes a deeper and robust feature extractor, called Darknet-53, inspired by ResNet. The YOLOv3 architecture is shown in Fig. 7.
Conv. layer Conv. layer
3x3x32 3x3x64
MaxPooling MaxPooling
Fig. 6. YOLOv2 Architecture
416
Conv. layers 3L 3x3x32 3x3x64rs2 1 Res (64) 208
Conv. layers 3x3xl28-s2 2 Res (128) 1Q4,
128
Conv. layers 3x3x256-s2 8 Res (256) 52X
Conv. layers 3x3x512-s2 „„ is-^f 8 Res (512)
4 Res (1024) 13ig
Predicting small objects
Prediction medium objects
Predicting big objects
Fig. 7. YOLOv3 Architecture
Due to the advantages of multiscale predictions, YOLOv3 can detect small objects even more, but has comparatively worse performance on medium and larger sized objects.
2.2.4. YOLOv3 tiny
The tiny YOLOv3 is a lightweight target detection algorithm applied to embedded platforms based on YOLOv3. Therefore, the running speed is significantly increased, but detection accuracy is reduced [27-31].
Tiny YOLOv3 reduced the YOLOv3 feature detection network darknet-53 to a 13 convolution layers and a 6 Max Pooling layers, Tiny-yolov3 uses the pooling layer instead of YOLOV3's convolutional layer with a step size of 2 to achieve dimensionality reduction. Prediction of bounding boxes occurs at two different feature map scales, which are 13 x 13, and 26 x 26 merged with an upsampled 13 x 13 feature map to predict the target. The YOLOv3 Tiny architecture is shown in Fig. 8.
(Input 416*416*5P)
Fig. 8. YOLOv3-Tiny Architecture
The YOLO family is a series of end-to-end DL models designed for fast object detection, and it was among the first attempts to build a fast real-time object detector. It is one of the fastest algorithms out there. Although the accuracy of the models is close but not as good as R-CNNs, they are popular for object detection because of their detection speed, often demonstrated in real-time video or camera feed input.
3. Experiments
3.1. Embedded systems
Since we are dealing with robots that may be small in size, we should search for embedded systems that are compact in size, low power consuming and the most important feature is to have high computational performance.
NVidia Jetson devices are embedded AI computing platforms that provide high-performance, low-power computing support for deep learning and computer vision. Together with NVIDIA JetPack™ SDK, these Jetson modules open the door to develop and deploy innovative products across all industries [32].
Jetson is used to deploy a wide range of popular DNN models and ML frameworks to the edge with high-performance inferencing, for tasks like real-time classification and object detection, pose estimation, semantic segmentation, and natural language processing (NLP).
Jetson Nano is a small, powerful computer that is able to run multiple neural networks in parallel for applications like image classification, object detection, segmentation, and speech processing. All in an easy-to-use platform that runs in as little as 5 watts [33].
Jetson TX1 is the world's first supercomputer on a module and can provide support for visual computing applications. It is built with the NVidia Maxwell™ architecture and 256 CUDA cores delivering performance of over one teraflop [34].
Jetson TX2 is one of the fastest, most power-efficient embedded AI computing devices. This 7.5-watt supercomputer on a module brings true AI computing at the edge. An NVidia Pascal™ family GPU was used to build it and loaded with 8 GB of memory and 59.7 GB/s of memory bandwidth. It included an assortment of standard equipment interfaces that make it simple to incorporate into a wide scope of hardware [35].
Jetson AGX Xavier has exceeded the limit capabilities of previous Jetson modules to a great extent. In terms of performance and efficiency in deep learning and computer vision, it has surpassed the world's most autonomous machines and advanced robot [36]. This powerful AI computing GPU workstation works under 30 W. It was built around a NVidia Volta™ GPU with Tensor Cores, two NVDLA engines, and an eight-core 64-bit ARM CPU. NVidia Jetson AGX Xavier is the most recent expansion to the Jetson stage [37]. This AI GPU computer can provide unparalleled 32 TeraOPS (TOPS) of the peak computation in a compact 100-mm x 87-mm module form-factor [38]. Xavier's energy efficient module can be deployed in next-level intelligent machines for end-to-end autonomous capabilities. Table 1 shows comparison between Jetson modules.
Since the Jetson Xavier embedded system is the most powerful among all other candidates, we will use it to benchmark different object detection algorithms to choose the most suitable for our task.
3.2. Datasets
We compare various object detection methods on two benchmark datasets, including PASCAL VOC 2007 [39] and Microsoft COCO [40].
The evaluated approaches include R-CNN [11], Fast R-CNN [17], Faster R-CNN [19], Mask R-CNN [20], YOLO [22], YOLOv2 [23], YOLOv3 [24] and YOLOv3 tiny [27-31].
PASCAL VOC 2007 dataset consists of 20 categories. Microsoft COCO, on the other hand, is composed of more than 300,000 fully segmented images, in which each image has an average of 7 object instances from a total of 80 categories. As there are a lot of less iconic objects with
Table 1. BASIC COMPARISON BETWEEN JETSON MODULES
GPU card name GPU CPU Memory AI performance Power
Jetson Nano 128-core Maxwell Quad-core ARM A57 @ 1.43 GHz 4 GB 64-bit LPDDR4 25.6 GB/s 472 GFLOPs 5 W / 10 W
Jetson TX1 256-core NVIDIA Maxwell™ GPU Quad-Core ARM© Cortex©-A57 MPCore 4 GB 64-bit LPDDR4 Memory 1 TFLOPs Under 10 W
Jetson TX2 256-core NVIDIA Pascal™ GPU architecture with 256 NVIDIA CUDA cores Dual-Core NVIDIA Denver 2 64-Bit CPU Quad-Core ARM© Cortex©-A57 MPCore 8 GB 128-bit LPDDR4 Memory 1866 MHx -59.7 GB/s 1.33 TFLOPs 7.5 W / 15 W
Jetson Xavier 512-core Volta GPU with Tensor Cores 8-core ARM v8.2 64-bit CPU, 8 MB L2 + 4 MB L3 32 GB 256-Bit LPDDR4x -137 GB/s 16 TFLOPs 10 W / 15 W / 30 W
a broad range of scales and a stricter requirement on object localization, this dataset is more challenging than PASCAL 2007.
Table 2 shows some comparison of the differences between the PASCAL VOC 2007 and Microsoft COCO.
Table 2. Dataset statistics
Dataset Microsoft COCO Pascal VOC 2007
Number of categories 80 20
Number of train-val images 246 690 5011
Number of test images 81434 4952
Number of annotated objects 2 500 000 24 640
Total objects/total number of images 7.6 2.4
Object detection performance is evaluated by average precision AP. For PASCAL VOC 2007, the evaluation terms are Average Precision (AP) in each single category and mean Average Precision (mAP) across all the 20 categories. In COCO dataset AP metric is used for evaluation which is averaged over multiple Intersection over Union (IoU) values, specifically, 10 IoU thresholds of [.50:.05:.95] are used.
AP is also averaged over all categories. Traditionally, this is called "mean average precision" (mAP), but we will refer to it as AP to distinguish between COCO and VOC 07 evaluation metrics. So, for COCO we will use AP (averaged across all 10 IoU thresholds and all 80 categories). Averaging over IoUs rewards detectors with better localization.
4. Results
4.1. Accuracy comparison
It is important to note that technology is constantly evolving, any comparison can become obsolete quickly.
These experiments are performed in different environments [10, 11, 17, 19, 20, 22-24, 29, 41, 42], so maybe the results will change a little bit if an attempt to reproduce them is made. But the purpose of this article is to get a general idea of these methods.
Tables 3-5 show results of the proposed algorithms on different datasets, the backbone used in training the algorithm is mentioned in Tables 3 and 4 where it can be used to reproduce our results.
Table 3. Comparative results on Microsoft COCO test Dev set (%)
Algorimth backbone AP (%)
R-CNN — —
Fast R-CNN Vgg16 19.7
Faster R-CNN Vgg16 21.7
Mask R-CNN ResNet-101-FPN 39.8
YOLO — —
YOLOv2 Darknet-19 21.6
YOLOv3 Darknet-53 33.0
YOLOv3 tiny Reduced Darknet-53 15.3
Table 4. Comparative results on VOC 2007 test set (%)
Algorimth backbone mAP (%)
R-CNN Vgg16 66.0
Fast R-CNN Vgg16 70.0
Faster R-CNN Vgg16 76.4
Mask R-CNN — —
YOLO Googlenet 63.4
YOLOv2 Darknet-19 78.6
YOLOv3 Darknet-53 87.4
YOLOv3 tiny Reduced Darknet-53 61.3
Overall, region proposal-based methods, such as Faster R-CNN and Mask R-CNN, perform better than regression/classfication based approaches like YOLO, due to the fact that quite a lot of localization errors are produced by regression/classification-based approaches.
As YOLOv1 is not skilled in producing object localizations of high IoU, it obtains a very poor result on VOC 2007. However, with the aid of other strategies, such as anchor boxes, BN and fine-grained features, the localization errors are corrected (YOLOv2).
Figures 9 and 10 show the performance of the algorithms on PASCAL VOC 2007 and Microsoft COCO dataset, respectively.
We can notice that the results on COCO dataset are much worse than those of VOC 2007, and this is due to the existence of a large number of nonstandard small objects.
Table 5. Accuracy of different object detection algorithms on Microsoft COCO and PASCAL VOC 2007 datasets
Algorithm Microsoft COCO AP (%) Pascal VOC 2007 mAP (%)
R-CNN — 66.0
Fast R-CNN 19.7 70.0
Faster R-CNN 21.7 76.4
Mask R-CNN 39.8 —
YOLO — 63.4
YOLOv2 21.6 78.6
YOLOv3 33.0 87.4
YOLOv3 tiny 15.3 61.3
■ PASCAL VOC 2007
100
R-CNN Fast R-CNN Foster R-CNN YOLO YOLOv2 YOLOv3 YOLOv3 tiny
Fig. 9. Accuracy performance of different object detection algorithms on PASCAL VOC 2007 dataset
■ Microsoft COCO
Fast R-CNN Faster R-CNN Mask R-CNN YOLOy2 YOLOv3 YOLOy3 tiny
Fig. 10. Accuracy performance of different object detection algorithms on Microsoft COCO dataset
4.2. Speed comparison
The processing speed variation experiment was performed for each of the models on three devices. First, the experiments were performed on Jetson TX1 [34], Jetson TX2 [35], and Jetson Xavier that was released for edge-computing [37].
The performance results in the form of frame per seconds FPS are shown in Table 6 [43, 44]. This table provides a quantitative comparison between different types of on-board embedded
Table 6. Performance comparison between Jetson modules used for target detection
Algorimth TX1 (FPS) TX2 (FPS) Xavier (FPS)
Fast R-CNN — — —
Faster R-CNN — 1 1.3
Mask R-CNN — — —
YOLOv2 3 10 28
YOLOv3 — 4 17
YOLOv3 tiny 9 11 31
GPU system. However, using this table, one can choose the best algorithm and system for a specific operation.
Regression-based models can usually be processed in real time at the cost of a drop in accuracy compared with region proposal-based models.
The higher-resolution images for the same model have better Map, but are slower to process. The performance and efficiency of Jetson AGX Xavier makes it possible to process all of the components needed for robots to function safely with full autonomy in real time, including highperformance vision algorithms for real-time perception, navigation, and manipulation.
Figure 11 shows the performance in terms of speed and accuracy for some object detection algorithms that are applicable on NVIDIA Jetson Xavier.
Speed vs Accuracy
40
30
Ph <
820
o
u
10
Real time
YOLOv3 33
Faster R-CNN YOLOv2
21.9 21.6 YOLOv3 tiny 15.3
10
20
Frame per second
30
40
Fig. 11. Object detection algorithms Speed vs accuracy trade-off
5. Conclusion
It is difficult to define a fair feature of different object detectors, each case of real life can have different solutions to reach a decision concerning the accuracy and speed. It is necessary to know other factors that affect performance: the type of feature extractor, steps out of the extractor, income resolutions images, strategy coincidence and threshold (as predictions are excluded when calculating the loss), Threshold IoU no maximum suppression ratio of positive anchor and negative, number of proposals, increased data set of training data, using multiscale images training or testing.
Since the main focus of this paper was on object detection for robot navigation, we first needed to figure out which algorithm provides faster and more accurate results.
Even though YOLOv3 tiny is the only algorithm that can be implemented on Jetson Xavier in real time (more than 30 FPS), but our choice to use YOLOv2 on robot application demands real-time processing.
The YOLOv2 selection was based on its speed performance that is very close to real time and its accuracy (21.6% COCO AP), which is 40% better than YOLOv3 tiny (15.3% COCO AP).
In this paper, we have reviewed some object detection algorithms including region proposal approaches like R-CNN family and regression/classification approaches like YOLO versions.
We have also discussed the properties of some on-board embedded GPU system that can be used to perform deep learning processing and the difference between them.
In addition, we performed an experimental comparison on the performance of each algorithm in terms of accuracy based on two datasets Microsoft COCO and PASCAL VOC 2007 and in terms of processing speed by on-board embedded GPU systems.
This paper can be used to determine an appropriate object detection algorithm for a specific task relying on speed/accuracy trade-off.
Conflict of interest
The authors declare that they have no conflict of interest.
References
[1] Yoon, Y., Gruber, S., Krakow, L., and Pack, D., Autonomous Target Detection and Localization Using Cooperative Unmanned Aerial Vehicles, in Optimization and Cooperative Control Strategies, M.J.Hirsch, C. W. Commander, P. M. Pardalos, R. Murphey (Eds.), Lect. Notes Control Inf. Sci., vol. 381, Berlin: Springer, 2009, pp. 195-205.
[2] Gietelink, O., Ploeg, J., De Schutter, B., and Verhaegen, M., Development of Advanced Driver Assistance Systems with Vehicle Hardware-in-the-Loop Simulations, Veh. Syst. Dyn., 2006, vol. 44, no. 7, pp. 569-590.
[3] Gerónimo, D., López, A.M., Sappa, A.D., and Graf, T., Survey of Pedestrian Detection for Advanced Driver Assistance Systems, in IEEE Trans. Pattern Anal. Mach. Intell, 2010, vol. 32, no. 7, pp. 1239-1258.
[4] Ferguson, D., Darms, M., Urmson, C., and Kolski, S., Detection, Prediction, and Avoidance of Dynamic Obstacles in Urban Environments, in IEEE Intelligent Vehicles Symposium (Eindhoven, Netherlands, Jun 2008), pp. 1149-1154.
[5] Hirz, M. and Walzel, B., Sensor and Object Recognition Technologies for Self-Driving Cars, Comput. Aided Des. Appl, 2018, vol. 15, no. 4, pp. 501-508.
[6] Hinas, A., Roberts, J., and Gonzalez, F., Vision-Based Target Finding and Inspection of a Ground Target Using a Multirotor UAV System, Sensors, 2017, vol. 17, no. 12, Art. 2929, 17 pp.
[7] Pathak, A.R., Pandey, M., and Rautaray, S., Application of Deep Learning for Object Detection, Procedía Comput. Sci., 2018, vol. 132, pp. 1706-1717.
[8] Tijtgat, N., Van Ranst, W., Volckaert, B., Goedeme, T., and De Turck, F., Embedded Real-Time Object Detection for a UAV Warning System, in Proc of the IEEE Internat. Conf. on Computer Vision Workshops (ICCVW'2017), pp. 2110-2118.
[9] Han, S., Shen, W., and Liu, Z., Deep Drone: Object Detection and Tracking for Smart Drones on Embedded System, Stanford, Calif.: Stanford Univ., 2016.
[10] Zhao, Zh.-Q., Zheng, P., Xu, Sh., and Wu, X., Object Detection with Deep Learning: A Review, IEEE Trans. Neural Netw. Learn. Syst, 2019, vol. 30, no. 11, pp. 3212-3232.
[11] Girshick, R., Donahue, J., Darrell, T., and Malik, J., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, in Proc of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR, Columbus, Ohio, Jun 2014), pp. 580-587.
[12] Uijlings, J.R. R., van de Sande, K.E.A., Gevers, T., and Smeulders, A. W. M., Selective Search for Object Recognition, Int. J. Comput. Vis., 2013, vol. 104, no. 2, pp. 154-171.
[13] Drucker, H., Burges, Ch. J.C., Kaufman, L., Smola, A., and Vapnik, V., Support Vector Regression Machines, in NIPS'1996: Advances in Neural Information Processing Systems: Vol. 9, M. C. Mozer, M.Jordan, T. Petsche (Eds.), Cambridge, Mass.: MIT Press, 1996, pp. 155-161.
[14] Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., and Scholkopf, B., Support Vector Machines, IEEE Intell. Syst., 1998, vol. 13, no. 4, pp. 18-28.
[15] Neubeck, A. and Van Gool, L., Efficient Non-Maximum Suppression, in Proc. of the 18th Internat. Conf. on Pattern Recognition (ICPR, Hong Kong, Aug 2006), pp. 850-855.
[16] Hosang, J., Benenson, R., and Schiele, B., Learning Non-Maximum Suppression, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR, Honolulu, Hawaii, Jul 2017), pp. 6469-6477.
[17] Girshick, R., Fast R-CNN, in Proc. of the IEEE Internat. Conf. on Computer Vision (ICCV, Santiago, Chile, Dec 2015), pp. 1440-1448.
[18] Zitnick, C.L. and Dollar, P., Edge Boxes: Locating Object Proposals from Edges, in Computer Vision: ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.), Lecture Notes in Comput. Sci., vol. 8693, Cham: Springer, 2014, pp. 391-405.
[19] Ren, Sh., He, K., Girshick, R., and Sun, J., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, IEEE Trans. Pattern Anal. Mach. Intell., 2015, vol. 39, no. 6, pp.1137-1149.
[20] He, K., Gkioxari, G., Dollar, P., and Girshick, R., Mask R-CNN, in Proc. of the IEEE Internat. Conf. on Computer Vision (ICCV, Venice, Italy, Oct 2017), pp. 2980-2988.
[21] Long, J., Shelhamer, E., and Darrell, T., Fully Convolutional Networks for Semantic Segmentation, in Proc. of the IEEE Internat. Conf. on Computer Vision and Pattern Recognition (CVPR, Boston, Mass., 2015), pp. 3431-3440.
[22] Redmon, J., Divvala, S., Girshick, R., and Farhadi, A., You Only Look Once: Unified, Real-Time Object Detection, in Proc. of the IEEE Internat. Conf. on Computer Vision and Pattern Recognition (CVPR, Las Vegas, Nev., 2016), pp. 779-788.
[23] Redmon, J. and Farhadi, A., Yolo9000: Better, Faster, Stronger, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR, Honolulu, Hawaii, Jul 2017), pp. 6517-6525.
[24] Redmon, J. and Farhadi, A., YOLOv3: An Incremental Improvement, arXiv:1804.02767 (2018).
[25] Tsoumakas, G. and Katakis, I., Multi-Label Classification: An Overview, Int. J. Data Warehous. Min., 2007, vol. 3, no. 3, pp. 1-13.
[26] Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S., Feature Pyramid Networks for Object Detection, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR, Honolulu, Hawaii, Jul 2017), pp. 2117-2125.
[27] Ding, S., Long, F., Fan, H., Liu, L., and Wang, Y., A Novel YOLOv3-tiny Network for Unmanned Airship Obstacle Detection, in Proc. of the IEEE 8th Data Driven Control and Learning Systems Conference (DDCLS, Dali, China, May 2019), pp. 277-281.
[28] Mao, Q.-C., Sun, H.-M., Liu, Y.-B., and Jia, R.-S., Mini-YOLOv3: Real-Time Object Detector for Embedded Applications, IEEE Access, 2019, vol. 7, pp. 133529-133538.
[29] Fang, W., Wang, L., and Ren, P., Tinier-YOLO: A Real-Time Object Detection Method for Constrained Environments, IEEE Access, 2020, vol. 8, pp. 1935-1944.
[30] Adarsh, P., Rathi, P., and Kumar, M., YOLO v3-Tiny: Object Detection and Recognition Using One Stage Improved Model, in Proc. of the 6th Internat. Conf. on Advanced Computing and Communication Systems (ICACCS, Coimbatore, India, Mar 2020), pp. 687-694.
[31] Xiao, D., Shan, F., Li, Z., Le, B. T., Liu, X., and Li, X., A Target Detection Model Based on Improved Tiny-YOLOv3 under the Environment of Mining Truck, IEEE Access, 2019, vol. 7, pp. 123757-123764.
[32] Meet Jetson, the Platform for AI at the Edge, https://developer.nvidia.com/embedded-computing (2021).
[33] Jetson Nano Developer Kit, https://developer.nvidia.com/embedded/jetson-nano-developer-kit (2021).
[34] Jetson TX1 Module, https://developer.nvid ia.com/embedded/buy/jetson-tx1 (2021).
[35] Jetson TX2 Module, https://developer.nvidia.com/embedded/jetson-tx2 (2021).
[36] NVidia Jetson AGX Xavier: The AI Platform for Autonomous Machines, https://www.nvidia.com/en-us/autonomous-machines/jetson-agx-xavier/ (2021).
[37] Jetson AGX Xavier Developer Kit, https://developer.nvidia.com/embedded/jetson-agx-xavier-developer-kit (2021).
[38] NVidia Jetson AGX Xavier Delivers 32 TeraOps for New Era of AI in Robotics, https://devblogs.nvidia.com/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/ (2021).
[39] Everingham, M., Van Gool, L., Williams, Ch. K.I., Winn, J., and Zisserman, A., The PASCAL Visual Object Classes (VOC) Challenge, Int. J. Comput. Vis., 2010, vol. 88, no. 2, pp. 303-338.
[40] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., and Dollar, P., Microsoft COCO: Common Objects in Context, European Conf. on Computer Vision (Santiago, Chile, Dec 2015), pp. 740-755.
[41] Shen, Z., Liu, Z., Li, J., Jiang, Y.-G., Chen, Y., and Xue, X., Object Detection from Scratch with Deep Supervision, IEEE Trans. Pattern Anal. Mach. Intell., 2019, vol. 42, no. 2, pp. 398-412.
[42] Zhang, F., Luan, J., Xu, Zh., and Chen, W., DetReco: Object-Text Detection and Recognition Based on Deep Neural Network, Math. Probl. Eng., 2020, vol. 2020, Art. 2365076, 15 pp.
[43] Hossain, S., and Lee, D. J., Deep Learning-Based Real-Time Multiple-Object Detection and Tracking from Aerial Imagery via a Flying Robot with GPU-Based Embedded Devices, Sensors, 2019, vol. 19, no. 15, Art. 3371, 24 pp.
[44] Murthy, C.B., Hashmi, M.F., Bokde, N.D., and Geem,Z.W., Investigations of Object Detection in Images/Videos Using Various Deep Learning Techniques and Embedded Platforms: A Comprehensive Review, Appl. Sci., 2020, vol. 10, no. 9, Art. 3280, 46 pp.