中国安全科学学报 ›› 2025, Vol. 35 ›› Issue (10): 106-114.doi: 10.16265/j.cnki.issn1003-3033.2025.10.1435

• 安全工程技术 • 上一篇    下一篇

基于视觉语言多模态的建筑施工安全智能问答模型

王喆1,2(), 黄海辰1,2, 李瑞钦1,2, 魏永长3   

  1. 1 武汉理工大学 安全科学与应急管理学院,湖北 武汉 430070
    2 武汉理工大学 中国应急管理研究中心,湖北 武汉 430070
    3 中南财经政法大学 工商管理学院,湖北 武汉 430073
  • 收稿日期:2025-06-10 修回日期:2025-08-13 出版日期:2025-10-28
  • 作者简介:

    王 喆 (1980—),男,湖北武汉人,博士,副教授,博士生导师,主要从事应急决策、人工智能和施工安全等方面的研究。E-mail:;

    魏永长 副教授

  • 基金资助:
    教育部人文社会科学研究青年基金资助(21YJA630094); 教育部人文社会科学研究青年基金资助(20YJC630154); 中央高校基本科研业务费专项资金项目(104972024DZB0003)

Intelligent question answering model for construction safety hazards based on vision-language multimodality

WANG Zhe1,2(), HUANG Haichen1,2, LI Ruiqin1,2, WEI Yongchang3   

  1. 1 School of Safety Science and Emergency Management, Wuhan University of Technology, Wuhan Hubei 430070, China
    2 China Emergency Management Research Center, Wuhan University of Technology, Wuhan Hubei 430070, China
    3 College of Business Administration, Zhongnan University of Economics and Law, Wuhan Hubei 430073, China
  • Received:2025-06-10 Revised:2025-08-13 Published:2025-10-28

摘要: 为提升建筑施工复杂环境下安全问题的智能化诊断水平,提出一种基于视觉语言多模态的建筑施工安全智能问答模型,构建建筑施工安全隐患图文对数据集,采用视觉编码器完成安全隐患图像的视觉编码,利用语言模型实现安全隐患问答文本的编码,通过多模态特征融合模块达成图像与文本信息的有效交互;构建适配建筑施工安全隐患场景视觉问答的特定提示模板,基于矩阵低秩分解对模型微调训练,并通过多轮提示词引导模型生成精确答案。结果表明:相较于现有对比模型,建筑施工安全智能问答模型在自动评估指标、GPT-4评价和专家评价中均表现更优,生成文本的流畅性与语义相关性显著提升;消融试验进一步验证了各子模块的有效性,证实矩阵低秩分解微调和多轮推理的协同作用是模型达成最优性能的关键,且合理设置低秩矩阵的秩参数可有效避免过拟合问题。

关键词: 视觉语言, 多模态, 建筑施工安全, 安全隐患, 智能问答模型, 矩阵低秩分解

Abstract:

In order to enhance the intelligent diagnosis level of safety problems in complex construction environments, an intelligent question-answering model for construction safety hazards based on vision-language multimodality was proposed. A dataset of image-text pairs related to construction safety hazards was constructed. A visual encoder was used to complete the visual encoding of safety hazard images, and a language model was employed to encode the question-answering texts about safety hazards. A multimodal feature fusion module was adopted to achieve effective interaction between image and text information. A specific input template for visual question answering adapted to the scenario of construction safety hazards was constructed. The model was fine-tuned based on matrix low-rank decomposition, and multi-round prompts were used to guide the model in generating accurate answers. The results show that compared with existing contrastive models, the intelligent question-answering model for construction safety hazards performs better in automatic evaluation metrics, Generative Pre-trained Transformer(GPT)-4 evaluation, and expert evaluation, with significantly improved fluency and semantic relevance of the generated texts. Ablation experiments further verify the effectiveness of each sub-module, confirming that the synergistic effect of matrix low-rank decomposition fine-tuning and multi-round reasoning is the key for the model to achieve optimal performance, and that reasonably setting the rank parameter of the low-rank matrix can effectively avoid the overfitting problem.

Key words: vision-language, multimodality, construction safety, safety hazard, intelligent question answering model, matrix low-rank decomposition

中图分类号: