集成多模态大模型的施工安全隐患识别

doi:10.16265/j.cnki.issn1003-3033.2025.09.1298

摘要/Abstract

摘要：

为提升施工场景中安全隐患的自动识别和安全管理水平,构建一个集成多模态大模型的施工安全隐患识别模型,进而构成其核心组件——多模态安全隐患识别模型LLaVA-CS(用于施工场景(Construction Site,CS)下的多模态视觉-文本大语言模型(LLaVA));该系统将图像(施工现场照片)与安全操作规程(工人行为描述)相结合,利用多模态学习和深度学习技术,实时监控和分析施工现场;为支持系统的有效运行,构建一个涵盖不同光照、遮挡和多人场景等复杂条件的多模态数据集,弥补现有公开数据集的空白。结果表明:通过对LLaVA-1.5模型进行提示调优,LLaVA-CS模型能有效融合视觉与文本信息,提升安全隐患识别的精度和可解释性。集成该模型的施工安全隐患识别方法在多个实际施工项目中识别准确率达到0.722 2,能够实时生成详细的解释文本,帮助管理人员快速理解安全隐患的具体情境,增强安全管理的决策支持。将多模态大模型应用于施工安全管理系统,有助于提供实时、可解释的安全监控解决方案。

关键词: 多模态大模型, 施工安全隐患, 复杂施工场景, 安全管理, 提示调优

Abstract:

In order to enhance the automatic recognition of safety hazards and improve safety management in construction scenarios, a multimodal large-model-based method for construction safety hazard recognition was proposed and its core component—the multimodal safety hazard recognition model, LLaVA(Large Language and Vision Assistant)-CS(Construction Site), was implemented. The system integrated images (construction site photos) with safety operating procedures (worker behavior descriptions), leveraging multimodal learning and deep learning technologies to perform real-time monitoring and analysis of construction sites. To support the system's effective operation, a multimodal dataset covering complex conditions such as varying lighting, occlusions, and multi-person scenarios was constructed, addressing the gaps in existing public datasets. Through prompt tuning of the LLaVA-1.5 model, the LLaVA-CS model effectively integrated visual and textual information, enhancing the accuracy and interpretability of safety hazard recognition. Experimental results show that this method achieves an accuracy of 0.722 2 in multiple real-world construction projects, generating detailed explanatory texts in real time to help managers quickly understand specific safety hazard contexts, thereby improving decision-making in safety management. This study innovatively applies multimodal large models to construction safety management systems, providing real-time, interpretable safety monitoring solutions and offering new technical support and optimization directions for construction safety management.

Key words: multimodal large model, construction safety hazards, complex construction scenarios, accident prevention, tips for tuning

中图分类号:

X948

安思齐, 蔡昂林, 马子程, 朱宝岩. 集成多模态大模型的施工安全隐患识别[J]. 中国安全科学学报, 2025, 35(9): 185-192.

AN Siqi, CAI Anglin, MA Zicheng, ZHU Baoyan. Multimodal large model-based approach for construction safety hazard recognition[J]. China Safety Science Journal, 2025, 35(9): 185-192.

图/表 8

图1

图2

图3

表1

图4

表2

表3

表4

参考文献 21

[1]	郁润. 基于计算机视觉的施工现场工人不安全行为识别方法研究[D]. 北京: 清华大学, 2019.
	YU Run. Computer-vision-based method for the recognition of construction workers' unsafe behaviors[D]. Beijing: Tsinghua University, 2019.
[2]	范冰倩, 董秉聿, 王彪, 等. 基于深度学习的地铁施工作业人员不安全行为识别与应用[J]. 中国安全科学学报, 2023, 33(1):41-47. doi: 10.16265/j.cnki.issn1003-3033.2023.01.0874
	FAN Bingqian, DONG Bingyu, WANG Biao, et al. Identification and application of unsafe behaviors of subway construction workers based on deep learning[J]. China Safety Science Journal, 2023, 33(1): 41-47. doi: 10.16265/j.cnki.issn1003-3033.2023.01.0874
[3]	左明成, 焦文华. 面向煤矿井下作业场景的安全帽佩戴识别算法[J]. 中国安全科学学报, 2024, 34(3):237-246. doi: 10.16265/j.cnki.issn1003-3033.2024.03.1985
	ZUO Mingcheng, JIAO Wenhua. Helmet-wearing recognition algorithm for coal mine underground operation scenarios[J]. China Safety Science Journal, 2024, 34(3): 237-246. doi: 10.16265/j.cnki.issn1003-3033.2024.03.1985
[4]	李健, 奥帅, 张在成, 等. 基于多模态的装配式建筑起重伤害预警模型[C]. 2022年工业建筑学术交流会论文集(上册),2022:440-444.
[5]	谢定坤. 多模态融合的施工现场工人不安全行为识别方法研究[D]. 武汉: 华中科技大学, 2020.
	XIE Dingkun. A multimodal fusion approach for identifying unsafe behavior in Construction[D]. Wuhan: Huazhong University of Science and Technology, 2020.
[6]	孙昕璐. 基于生理心理多模态监测的施工现场隐患识别能力评估[D]. 北京: 清华大学, 2020.
	SUN Xinlu. A multimodal study to assess hazard recognition ability on construction site[D]. Beijing: Tsinghua University, 2020.
[7]	ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[R]. OpenAI, 2023.
[8]	LIU Haotian, LI Chunyuan, WU Qingyang, et al. Visual instruction tuning[J]. Advances In Neural Information Processing Systems, 2023, 36: 34 892-34 916.
[9]	LIU Haotian, LI Chunyuan, WU Qingyang, et al. Llava-plus: learning to use tools for creating multimodal agents[C]. European Conference on Computer Vision, 2024: 126-142.
[10]	RAMESH A, PAVLOV M, GOH G, et al. Zero-shot text-to-image generation[C]. International Conference on Machine Learning, 2021: 8821-8831.
[11]	LIU Haotian, LI Chunyuan, WU Qingyang, et al. Improved baselines with visual instruction tuning[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2024: 26 296-26 306.
[12]	JIA Chao, YANG Yinfei, XIA Ye, et al. Scaling up visual and vision-language representation learning with noisy text supervision[C]. International Conference on Machine Learning, 2021: 4904-4916.
[13]	RADFORD A, KIM J, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. International Conference On Machine Learning, 2021: 8748-8763.
[14]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[C]. International Conference on Learning Representations,2020: 1-20.
[15]	BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[J]. Advances In Neural Information Processing Systems, 2020,33:1877-1901.
[16]	DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep bidirectional transformers for language understanding[C]. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 :Long and Short Papers, 2019: 4171-4186.
[17]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017, 30: 6000-6010.
[18]	傅贵, 陈奕燃, 许素睿, 等. 事故致因“2-4”模型的内涵解析及第6版的研究[J]. 中国安全科学学报, 2022, 32(1):12-19. doi: 10.16265/j.cnki.issn1003-3033.2022.01.002
	FU Gui, CHEN Yiran, XU Surui, et al. Detailed explanations of 24Model and development of its 6^thversion[J]. China Safety Science Journal, 2022, 32(1):12-19. doi: 10.16265/j.cnki.issn1003-3033.2022.01.002
[19]	安思齐. 建筑业安全信用评价研究[D]. 葫芦岛: 辽宁工程技术大学, 2023.
	AN Siqi. Research on safety credit evaluation of construction industry[D]. Huludao: Liaoning Technical University, 2023.
[20]	广东省住房和城乡建设厅. 关于推广使用《广东省建筑施工安全生产隐患识别图集(三)》的通知[EB/OL]. (2023-05-06). https://zfcxjst.gd.gov.cn/gkmlpt/content/4/4180/post_4180540.html#1422.
[21]	温国锋, 房颖, 张帆帆. 复杂工程项目施工阶段行为风险评价模型[J]. 中国安全科学学报, 2017, 27(8):162-168. doi: 10.16265/j.cnki.issn1003-3033.2017.08.028
	WEN Guofeng, FANG Ying, ZHANG Fanfan. Model for evaluating behavior risk in construction stage of complex construction project[J]. China Safety Science Journal, 2017, 27(8): 162-168. doi: 10.16265/j.cnki.issn1003-3033.2017.08.028

标签类别	数量	比例/%
不违规	211	8.42
抽烟	286	11.41
攀爬翻越安全护栏	374	14.92
未佩戴安全帽与反光背心	991	39.55
接打电话	317	12.65
摔倒	327	13.05
合计	2506	100

模型	精确率
LLaVA-CS	0.722 2
EfficientNet	0.705 6
Swin Transformer	0.637 5
ResNet	0.362 4

特性	GPT-4	LLaVA-1.5	LLaVA-CS
模型类型	通用大规模语言模型,具备多模态能力(文本/图像)	多模态模型(文本/图像)	专用多模态模型,针对建筑安全隐患识别进行领域优化
架构	基于Transformer架构	基于Transformer结合视觉编码器	在LLaVA-1.5基础上引入动态特征融合与上下文适应机制
输入数据	文本输入为主,但支持图像输入	同时处理图像与文本数据	建筑工地图像与文本数据
主要应用	文本生成、问答、对话等通用自然语言处理任务	图像描述、视觉问答等跨模态任务	建筑安全隐患的自动识别与解释,为施工安全监控提供决策支持
训练数据	多领域大规模文本数据,部分版本包含图像数据	综合图像与文本数据集,专注于跨模态信息融合任务	定制化多模态数据集,专注于建筑工地复杂场景中的安全隐患识别

请根据图中的内容判断图中是否存在安全隐患,并给出判断
—
LLaVA-CS	这是不安全的,因为图中人物在吸烟且缺少防护措施	这是不安全的,因为图中内容为缺少防护措施	图中不存在安全隐患,因为图中内容为违规脱安全帽	这是安全的
GPT-4	这是不安全的,因为图中工人在高处工作时没有使用安全带,并且施工平台可能存在坠落风险	图中的工人也戴着安全帽和反光背心,看起来他在施工现场。这是安全的行为,没有明显的安全隐患可见	基于图中内容的判断,没有发现明显的安全隐患	这是不安全的,因为图中的人物正在做出不雅手势
LLaVA-1.5	这是不安全的,因为图中人物在没有使用安全带的情况下站在高处	这是不安全的,因为图中人物在没有使用任何个人保护设备的情况下,进入了一个正在施工的区域	这是不安全的。一个工人在高处作业时没有使用安全带	这是不安全的,因为图中人物在高处作业,没有使用安全带