Introduction
-
In order to solve the problem of extremely unbalanced distribution of foreground and complex background information in remote sensing images, we designed a trans-attention module, which performs remote pixel interaction from two dimensions of channel and space to realize the attention feature capture function. It is also a plug-and-play module that can quickly apply other natural image object detection methods to remote sensing images.
-
We used YOLOv8 as the baseline model and redesigned it. Including embedding the trans-attention module, adding a small object detection head, and optimizing the same-level feature fusion method. Therefore, a lightweight small object detection model named TA-YOLO for remote sensing images is proposed.
-
Our TA-YOLO exhibits superior performance on the remote sensing image dataset VisDrone [25] with a large number of dense small objects, as shown in Fig. 1. Our method achieves higher accuracy with fewer parameters than YOLOv8. This is also a powerful performance of the proposed trans-attention module to apply the natural image object detection method to remote sensing images.
Related work
Object detection based on CNN
Small object detection
Attention mechanism
Methodology
Revisiting transformer
TA-YOLO
Improvement of multi-level feature fusion
MCSTA module
MCTA module
MSTA module
Experiments
Datasets
Datasets | 10–50 Pixels | 50–300 Pixels | >300 Pixels |
---|---|---|---|
VisDrone | 0.74 | 0.26 | 0 |
PASCAL VOC | 0.14 | 0.61 | 0.25 |
Implementation details
Experimental results
Methods | P | R | \(mAP^{50}\) | \(mAP^{50:95}\) | Params (M) | GFLOPs |
---|---|---|---|---|---|---|
YOLOv8-n [22] | 0.810 | 0.733 | 0.806 | 0.599 | 3.2 | 8.9 |
YOLOv8-s [22] | 0.818 | 0.756 | 0.833 | 0.634 | 11.2 | 28.8 |
YOLOv8-m [22] | 0.818 | 0.790 | 0.854 | 0.670 | 25.9 | 79.3 |
TA-YOLO-tiny | 0.782 | 0.743 | 0.807 (\(\uparrow \)0.1%) | 0.608 | 2.3 (\(\downarrow \)0.9) | 7.9 (\(\downarrow \)1.0) |
TA-YOLO-n | 0.817 | 0.742 | 0.820 (\(\uparrow \)1.3%) | 0.624 | 3.8 (\(\uparrow \)0.6) | 14.1 (\(\uparrow \)5.2) |
TA-YOLO-s | 0.836 | 0.774 | 0.845 (\(\uparrow \)1.2%) | 0.656 | 13.9 (\(\uparrow \)2.7) | 43.3 (\(\uparrow \)14.5) |
YA-YOLO-o | 0.827 | 0.787 | 0.854 (\(\uparrow \)0%) | 0.663 | 21.4 (\(\downarrow \)4.5) | 64.6 (\(\downarrow \)14.7) |
TA-YOLO-m | 0.838 | 0.795 | 0.862 (\(\uparrow \)0.8%) | 0.680 | 29.7 (\(\uparrow \)3.8) | 110.2 (\(\uparrow \)30.9) |
Methods | P | R | \(mAP^{50}\) | \(mAP^{50:95}\) | Params (M) | GFLOPs |
---|---|---|---|---|---|---|
YOLOv8-n [22] | 0.450 | 0.338 | 0.339 | 0.196 | 3.2 | 8.9 |
YOLOv8-s [22] | 0.528 | 0.386 | 0.406 | 0.242 | 11.2 | 28.8 |
YOLOv8-m [22] | 0.556 | 0.416 | 0.433 | 0.265 | 25.9 | 79.3 |
TA-YOLO-tiny | 0.485 | 0.349 | 0.365 (\(\uparrow \)2.6%) | 0.218 | 2.3 (\(\downarrow \)0.9) | 7.9 (\(\downarrow \)1.0) |
TA-YOLO-n | 0.502 | 0.389 | 0.401 (\(\uparrow \)6.2%) | 0.241 | 3.8 (\(\uparrow \)0.6) | 14.1 (\(\uparrow \)5.2) |
TA-YOLO-s | 0.539 | 0.443 | 0.454 (\(\uparrow \)4.8%) | 0.277 | 13.9 (\(\uparrow \)2.7) | 43.3 (\(\uparrow \)14.5) |
YA-YOLO-o | 0.558 | 0.450 | 0.465 (\(\uparrow \)3.2%) | 0.286 | 21.4 (\(\downarrow \)4.5) | 64.6 (\(\downarrow \)14.7) |
TA-YOLO-m | 0.583 | 0.466 | 0.488 (\(\uparrow \)5.5%) | 0.302 | 29.7 (\(\uparrow \)3.8) | 110.2 (\(\uparrow \)30.9) |
Methods | P | R | \(mAP^{50}\) | \(mAP^{50:95}\) | Params (M) | GFLOPs |
---|---|---|---|---|---|---|
YOLOv8-n [22] | 0.405 | 0.301 | 0.279 | 0.158 | 3.2 | 8.9 |
YOLOv8-s [22] | 0.456 | 0.347 | 0.329 | 0.190 | 11.2 | 28.8 |
YOLOv8-m [22] | 0.487 | 0.369 | 0.353 | 0.207 | 25.9 | 79.3 |
TA-YOLO-tiny | 0.419 | 0.307 | 0.292 (\(\uparrow \)1.3%) | 0.165 | 2.3 (\(\downarrow \)0.9) | 7.9 (\(\downarrow \)1.0) |
TA-YOLO-n | 0.427 | 0.341 | 0.316 (\(\uparrow \)3.7%) | 0.180 | 3.8 (\(\uparrow \)0.6) | 14.1 (\(\uparrow \)5.2) |
TA-YOLO-s | 0.488 | 0.379 | 0.371 (\(\uparrow \)4.2%) | 0.214 | 13.9 (\(\uparrow \)2.7) | 43.3 (\(\uparrow \)14.5) |
YA-YOLO-o | 0.501 | 0.388 | 0.377 (\(\uparrow \)2.4%) | 0.219 | 21.4 (\(\downarrow \)4.5) | 64.6 (\(\downarrow \)14.7) |
TA-YOLO-m | 0.521 | 0.399 | 0.396 (\(\uparrow \)4.3%) | 0.231 | 29.7 (\(\uparrow \)3.8) | 110.2 (\(\uparrow \)30.9) |
Methods | \(mAP_{50}^{val}\) | \(mAP_{50:95}^{val}\) | \(mAP_{50}^{test}\) | \(mAP_{50:95}^{test}\) | Params (M) | GFLOPs |
---|---|---|---|---|---|---|
YOLOv5-n [19] | 0.337 | 0.194 | 0.278 | 0.156 | 2.5 | 7.2 |
YOLOv5-s [19] | 0.401 | 0.239 | 0.328 | 0.189 | 9.1 | 24.1 |
YOLOv5-m [19] | 0.430 | 0.263 | 0.352 | 0.205 | 25.1 | 64.4 |
YOLOv6-n [20] | 0.325 | 0.188 | 0.275 | 0.158 | 4.7 | 11.1 |
YOLOv6-s [20] | 0.372 | 0.220 | 0.313 | 0.180 | 18.5 | 44.2 |
YOLOv6-m [20] | 0.417 | 0.251 | 0.356 | 0.211 | 34.9 | 82.2 |
YOLOv7-tiny [21] | 0.307 | 0.182 | 0.268 | 0.151 | 6.0 | 13.7 |
YOLOv8-n [22] | 0.339 | 0.196 | 0.279 | 0.158 | 3.2 | 8.9 |
YOLOv8-s [22] | 0.406 | 0.242 | 0.329 | 0.190 | 11.2 | 28.8 |
YOLOv8-m [22] | 0.433 | 0.265 | 0.353 | 0.207 | 25.9 | 79.3 |
TA-YOLO-tiny | 0.365 | 0.218 | 0.292 | 0.165 | 2.3 | 7.9 |
TA-YOLO-n | 0.401 | 0.241 | 0.316 | 0.180 | 3.8 | 14.1 |
TA-YOLO-s | 0.454 | 0.277 | 0.371 | 0.214 | 13.9 | 43.3 |
TA-YOLO-o | 0.465 | 0.286 | 0.377 | 0.219 | 21.4 | 64.6 |
TA-YOLO-m | 0.488 | 0.302 | 0.396 | 0.231 | 29.7 | 110.2 |
Methods | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Baseline | P2 Head | Fusion | MCTA | MSTA | MCSTA | P | R | \(mAP^{50}\) | \(mAP^{50:95}\) | Params (M) |
\(\checkmark \) | 0.450 | 0.338 | 0.339 | 0.196 | 3.2 | |||||
\(\checkmark \) | \(\checkmark \) | 0.468 | 0.368 | 0.375(\(\uparrow \)3.6%) | 0.224 | 3.4 | ||||
\(\checkmark \) | \(\checkmark \) | \(\checkmark \) | 0.499 | 0.366 | 0.383(\(\uparrow \)4.4%) | 0.228 | 3.4 | |||
\(\checkmark \) | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | 0.494 | 0.381 | 0.389(\(\uparrow \)5.0%) | 0.233 | 3.5 | ||
\(\checkmark \) | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | 0.504 | 0.373 | 0.391(\(\uparrow \)5.2%) | 0.234 | 3.5 | ||
\(\checkmark \) | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | 0.502 | 0.389 | 0.401(\(\uparrow \)6.2%) | 0.241 | 3.8 |
Ablation experiments
Backbone | P2 | P3 | P4 | P5 | \(mAP^{50}\) | \(mAP^{50:95}\) | Params (M) |
---|---|---|---|---|---|---|---|
\(\checkmark \) | 0.389 | 0.232 | 3.3 | ||||
\(\checkmark \) | \(\checkmark \) | 0.394 | 0.235 | 3.4 | |||
\(\checkmark \) | \(\checkmark \) | \(\checkmark \) | 0.395 | 0.237 | 3.4 | ||
\(\checkmark \) | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | 0.398 | 0.238 | 3.5 | |
\(\checkmark \) | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | 0.401 | 0.241 | 3.8 |