中国安全科学学报 ›› 2025, Vol. 35 ›› Issue (9): 185-192.doi: 10.16265/j.cnki.issn1003-3033.2025.09.1298

• 安全工程技术 • 上一篇    下一篇

集成多模态大模型的施工安全隐患识别

安思齐1(), 蔡昂林2, 马子程2, 朱宝岩1,**()   

  1. 1 辽宁工程技术大学 安全科学与工程学院,辽宁 葫芦岛 125105
    2 中国矿业大学(北京) 理学院,北京 100083
  • 收稿日期:2025-04-11 修回日期:2025-06-15 出版日期:2025-09-28
  • 通信作者:
    **朱宝岩(1963—),男,辽宁建昌人,博士,教授,主要从事风险预控、安全行为、安全经济、职业健康管理体系、安全与应急管理等方面的研究。E-mail:
  • 作者简介:

    安思齐 (1998—),男,黑龙江大庆人,硕士,主要从事安全管理、安全信用评价等方面的工作。E-mail:

Multimodal large model-based approach for construction safety hazard recognition

AN Siqi1(), CAI Anglin2, MA Zicheng2, ZHU Baoyan1,**()   

  1. 1 School of Safety Science and Engineering, Liaoning Technical University, Huludao Liaoning 125105, China
    2 School of Science, China University of Mining and Technology-Beijing, Beijing 100083, China
  • Received:2025-04-11 Revised:2025-06-15 Published:2025-09-28

摘要:

为提升施工场景中安全隐患的自动识别和安全管理水平,构建一个集成多模态大模型的施工安全隐患识别模型,进而构成其核心组件——多模态安全隐患识别模型LLaVA-CS(用于施工场景(Construction Site,CS)下的多模态视觉-文本大语言模型(LLaVA));该系统将图像(施工现场照片)与安全操作规程(工人行为描述)相结合,利用多模态学习和深度学习技术,实时监控和分析施工现场;为支持系统的有效运行,构建一个涵盖不同光照、遮挡和多人场景等复杂条件的多模态数据集,弥补现有公开数据集的空白。结果表明:通过对LLaVA-1.5模型进行提示调优,LLaVA-CS模型能有效融合视觉与文本信息,提升安全隐患识别的精度和可解释性。集成该模型的施工安全隐患识别方法在多个实际施工项目中识别准确率达到0.722 2,能够实时生成详细的解释文本,帮助管理人员快速理解安全隐患的具体情境,增强安全管理的决策支持。将多模态大模型应用于施工安全管理系统,有助于提供实时、可解释的安全监控解决方案。

关键词: 多模态大模型, 施工安全隐患, 复杂施工场景, 安全管理, 提示调优

Abstract:

In order to enhance the automatic recognition of safety hazards and improve safety management in construction scenarios, a multimodal large-model-based method for construction safety hazard recognition was proposed and its core component—the multimodal safety hazard recognition model, LLaVA(Large Language and Vision Assistant)-CS(Construction Site), was implemented. The system integrated images (construction site photos) with safety operating procedures (worker behavior descriptions), leveraging multimodal learning and deep learning technologies to perform real-time monitoring and analysis of construction sites. To support the system's effective operation, a multimodal dataset covering complex conditions such as varying lighting, occlusions, and multi-person scenarios was constructed, addressing the gaps in existing public datasets. Through prompt tuning of the LLaVA-1.5 model, the LLaVA-CS model effectively integrated visual and textual information, enhancing the accuracy and interpretability of safety hazard recognition. Experimental results show that this method achieves an accuracy of 0.722 2 in multiple real-world construction projects, generating detailed explanatory texts in real time to help managers quickly understand specific safety hazard contexts, thereby improving decision-making in safety management. This study innovatively applies multimodal large models to construction safety management systems, providing real-time, interpretable safety monitoring solutions and offering new technical support and optimization directions for construction safety management.

Key words: multimodal large model, construction safety hazards, complex construction scenarios, accident prevention, tips for tuning

中图分类号: