Jolly P C Chan; Heiton M H Ho; T K Wong; Lawrence Y L Ho; Jackie Cheung; Samson Tai

Published Papers

Volume 31 - Issue 4 (Award Issue)

Award Papers

Environmental awareness in machines: a case study of automated debris removal using Generative Artificial Intelligence and Vision Language Models

Jolly P C Chan, Heiton M H Ho, T K Wong, Lawrence Y L Ho, Jackie Cheung and Samson Tai

Pages: 1-10Published: 10 Dec 2024

DOI: 10.33430/V31N4THIE-2024-0052

Download PDF | 1208KB

Cite thisHide

Chan PC, Ho MH, Wong TK, Ho YL, Cheung J and Tai S, Environmental awareness in machines: a case study of automated debris removal using Generative Artificial Intelligence and Vision Language Models, HKIE Transactions, Vol. 31, No. 4 (Award Issue), Article THIE-2024-0052, 2024, 10.33430/V31N4THIE-2024-0052

Copy

Abstract:

Water channels play a crucial role in stormwater management, but the build-up of debris in their grilles can lead to flooding, endangering humans and animals, properties, and critical infrastructure nearby. While automated mechanical grab systems are necessary for efficient debris removal, their deployment in outdoor environments has been non-existent due to safety concerns. Here we report the successful use of Generative Artificial Intelligence (GenAI) and a Vision Language Model (VLM) to endow an automated mechanical grab with “awareness”, which allows it to differentiate between non-living and living objects, deciding whether to initiate or abort grabbing actions. The existing approaches such as YOLOv7 only achieve a sensitivity of 86.94% (95% CI: 83.44% to 89.93%) in detecting humans and specified animals. They systematically miss crouching workers and animals facing away from the cameras. Grounding DINO (VLM) can achieve a sensitivity of 100% (95% CI: 99.17% to 100.00%) and a specificity of 85.37% (95% CI: 77.86% to 91.09%). Together with BLIP-2 (GenAI), it acquires “awareness”, allowing it to detect animals beyond those specified. This opens up possibilities for the application of GenAI/VLM in automation sectors where human-machine mingling occurs, such as manufacturing, logistics, and construction. This innovation can potentially improve the safety and efficiency in these domains.

Keywords:

Artificial intelligence; machine learning model; object detection; computer vision; flood management; debris clearance

Reference List:

Atkinson E (2023). Man crushed to death by robot in South Korea. BBC news. Available at: .
Deng J, Dong W, Socher R, Li LJ, Li K and Li FF (2009). Imagenet: A large-scale hierarchical image database. IEEE conference on computer vision and pattern recognition, pp. 248-255. IEEE.
Dhamija A, Gunther M, Ventura J and Boult T (2020). The overlooked elephant of object detection: Open set. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1021-1030.
Environmental Agency (2022). Reducing flood risk - the maintenance work we do to keep rivers flowing. [online]. Available at: .
Fathy I, Abdel-Aal GM, Fahmy MR, Fathy A and Zeleňáková M (2020). The negative impact of blockage on storm water drainage network. Water, 12(7), 1974.
Hennessy K, Lawrence J and Mackey B (2022). IPCC sixth assessment report (AR6): climate change 2022-impacts, adaptation and vulnerability: regional factsheet Australasia.
Hong Kong Observatory (2022). Climate change in Hong Kong – rainfall. Hong Kong Observatory (HKO) Climate Change.
Jiang P, Ergu D, Liu F, Cai Y and Ma B (2022). A Review of Yolo algorithm developments. Procedia Computer Science, 199, pp. 1066-1073.
Li F, Zhang H, Zhang YF, Liu S, Guo J, Ni LM and Zhang L (2022). Vision-language intelligence: Tasks, representation learning, and large models. [online report]. Available at: .
Li J, Li D, Savarese S and Hoi S (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. [online report]. Available .
Lin TY, Maire M, Belongie SJ, Bourdev LD, Girshick RB, Hays J, Perona P, Ramanan D Zitnick CL (2014). Microsoft COCO: Common Objects in Context. [online report]. Available at: .
Liu S, Zeng Z, Ren T, Li F, Zhang H, Yang J and Zhang L (2023). Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. [online report]. Available at: .
MetaAI (no date). Object Detection on COCO minival. Papers With Code. Available at: .
Redmon J, Divvala S, Girshick R, Farhadi A (2015). You Only Look Once: Unified, Real-Time Object Detection. [online report]. Available at: .
Zang Y, Li W, Han J, Zhou K, and Loy CC (2023). Contextual Object Detection with Multimodal Large Language Models. [online report]. Available at: .
Zhang J, Khayatkhoei M, Chhikara P, and Ilievski F (2023). Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models. [online report]. Available at: .
Zou Z, Chen K, Shi Z, Guo Y, and Ye J (2023). Object detection in 20 years: A survey. In: Proceedings of the IEEE.

>> more<< less

What's New

Call for Papers

Submit a Paper