Intelligent question answering model for construction safety hazards based on vision-language multimodality

doi:10.16265/j.cnki.issn1003-3033.2025.10.1435

Abstract

Abstract:

In order to enhance the intelligent diagnosis level of safety problems in complex construction environments, an intelligent question-answering model for construction safety hazards based on vision-language multimodality was proposed. A dataset of image-text pairs related to construction safety hazards was constructed. A visual encoder was used to complete the visual encoding of safety hazard images, and a language model was employed to encode the question-answering texts about safety hazards. A multimodal feature fusion module was adopted to achieve effective interaction between image and text information. A specific input template for visual question answering adapted to the scenario of construction safety hazards was constructed. The model was fine-tuned based on matrix low-rank decomposition, and multi-round prompts were used to guide the model in generating accurate answers. The results show that compared with existing contrastive models, the intelligent question-answering model for construction safety hazards performs better in automatic evaluation metrics, Generative Pre-trained Transformer(GPT)-4 evaluation, and expert evaluation, with significantly improved fluency and semantic relevance of the generated texts. Ablation experiments further verify the effectiveness of each sub-module, confirming that the synergistic effect of matrix low-rank decomposition fine-tuning and multi-round reasoning is the key for the model to achieve optimal performance, and that reasonably setting the rank parameter of the low-rank matrix can effectively avoid the overfitting problem.

Key words: vision-language, multimodality, construction safety, safety hazard, intelligent question answering model, matrix low-rank decomposition

CLC Number:

WANG Zhe, HUANG Haichen, LI Ruiqin, WEI Yongchang. Intelligent question answering model for construction safety hazards based on vision-language multimodality[J]. China Safety Science Journal, 2025, 35(10): 106-114.

Figures/Tables 11

Fig.1

Table 1

Fig.2

Fig.3

Table 2

Fig.4

Table 3

Table 4

Fig.5

Table 5

Fig.6

References 17

[1]	中华人民共和国住房和城乡建设部. 住房和城乡建设部办公厅关于2020年房屋市政工程生产安全事故情况的通报[EB/OL].(2022-10-27). https://www.mohurd.gov.cn/gongkai/zc/wjk/art/2022/art_17339_768565.html.
[2]	郭纯兵, 阎卫东, 李宇鹏, 等. 基于系统动力学的智慧工地安全脆弱性研究[J]. 中国安全生产科学技术, 2024, 20(8): 42-50.
	GUO Chunbing, YAN Weidong, LI Yupeng, et al. Study on safety vulnerability of smart construction sites based on system dynamics[J]. Journal of Safety Science and Technology, 2024, 20(8): 42-50.
[3]	中华人民共和国住房和城乡建设部, 中华人民共和国国家发展和改革委员会, 中华人民共和国科学技术部, 等. 住房和城乡建设部等部门关于推动智能建造与建筑工业化协同发展的指导意见[EB/OL].(2020-07-28). https://www.gov.cn/zhengce/zhengceku/2020-07/28/content_5530762.htm.
[4]	赵江平, 刘星星, 张想卓. 基于改进YOLOv5s的外脚手架隐患图像识别技术[J]. 中国安全科学学报, 2023, 33(12): 60-66. doi: 10.16265/j.cnki.issn1003-3033.2023.12.2011
	ZHAO Jiangping, LIU Xingxing, ZHANG Xiangzhuo. Research on image recognition technology for external scaffold hidden danger based on improved YOLOv5s[J]. China Safety Science Journal, 2023, 33(12): 60-66. doi: 10.16265/j.cnki.issn1003-3033.2023.12.2011
[5]	郑楚伟, 林辉. 基于Swin Transformer的YOLOv5安全帽佩戴检测方法[J]. 计算机测量与控制, 2023, 31(3): 15-21.
	ZHENG Chuwei, LIN Hui. YOLOv5 helmet wearing detection method based on Swin Transformer[J]. Computer Measurement and Control, 2023, 31(3): 15-21.
[6]	石雪洁. 基于机器视觉技术的高层建筑施工现场危险区域识别方法[J]. 佳木斯大学学报:自然科学版, 2023, 41(3): 104-107.
	SHI Xuejie. Research on the identification method of hazardous areas in high rise building construction sites based on machine vision technology[J]. Journal of Jiamusi University: Natural Science Edition, 2023, 41(3): 104-107.
[7]	王仁超, 张毅伟, 毛三军. 水电工程施工安全隐患文本智能分类与知识挖掘[J]. 水力发电学报, 2022, 41(11): 96-106.
	WANG Renchao, ZHANG Yiwei, MAO Sanjun. Intelligent text classification and knowledge mining of hidden safety hazards in hydropower engineering construction[J]. Journal of Hydroelectric Engineering, 2022, 41(11): 96-106.
[8]	周倡弘, 王聚全, 杜渂, 等. 基于大语言模型的消防应急决策支持技术研究[J]. 电信快报, 2024(5): 19-27.
	ZHOU Changhong, WANG Juquan, DU Wen, et al. Research on fire emergency decision support technology based on large language model[J]. Telecommunications Information, 2024(5): 19-27.
[9]	洪亮, 郭瑶, 刘兴丽, 等. 基于RAG的煤矿安全智能问答模型[J]. 黑龙江科技大学学报, 2024, 34(3): 487-492.
	HONG Liang, GUO Yao, LIU Xingli, et al. Intelligent Q & A model of coal mine safety based on RAG[J]. Journal of Heilongjiang University of Science and Technology, 2024, 34(3): 487-492.
[10]	张飞飞, 张建庆, 屈思佳, 等. 跨模态视觉问答与推理研究进展[J]. 数据采集与处理, 2023, 38(1): 1-20.
	ZHANG Feifei, ZHANG Jianqing, QU Sijia, et al. Recent advances in visual question answering and reasoning[J]. Journal of Data Acquisition and Processing, 2023, 38(1): 1-20.
[11]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. International Conference on Machine Learning, 2021: 8748-8763.
[12]	LI Junnan, LI Dongxu, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]. International Conference on Machine Learning, 2023: 19 730-19 742.
[13]	廖宁, 曹敏, 严骏驰. 视觉提示学习综述[J]. 计算机学报, 2024, 47(4): 790-820.
	LIAO Ning, CAO Min, YAN Junchi. Visual prompt learning: a survey[J]. Chinese Journal of Computers, 2024, 47(4): 790-820.
[14]	熊若鑫, 宋元斌, 王宇轩, 等. 基于CNN的3D姿势估计在建筑工人行为分析中的应用[J]. 中国安全科学学报, 2019, 29(7): 64-69. doi: 10.16265/j.cnki.issn1003-3033.2019.07.011
	XIONG Ruoxin, SONG Yuanbin, WANG Yuxuan, et al. Application of convolutional neural network-based 3D posture estimation in behavioral analysis of construction workers[J]. China Safety Science Journal, 2019, 29(7): 64-69. doi: 10.16265/j.cnki.issn1003-3033.2019.07.011
[15]	DING Ming, YANG Zhuoyi, HONG Wenyi, et al. Cogview: mastering text-to-image generation via transformers[J]. Advances in Neural Information Processing Systems, 2021, 34: 19 822-19 835.
[16]	LIU Haotian, LI Chunyuan, WU Qingyang, et al. Visual instruction tuning[J]. Advances in Neural Information Processing Systems, 2023, 36: 34 892-34 916.
[17]	WANG Weihan, LYU Qingsong, YU Wenmeng, et al. CogVLM: visual expert for pretrained language models[J]. Advances in Neural Information Processing Systems, 2024, 37: 121 475-121 499.

隐患类型	安全隐患
高处作业	坠落基准面超2 m的临空侧未设置临边防护等
作业脚手架	作业脚手架未按要求设置剪刀撑等
高处作业吊篮	吊篮未落地停放,作业人员未从地面进出吊篮等
施工用电	电气设备不带电的外露可导电部分未做保护接零等
模板支架	未按规范要求设置扫地杆等
基坑工程	基坑周边堆载不符合规范等
施工机具	圆盘锯未设置安全防护罩等
消防	部分消防栓内无水枪及水带等
文明施工	现场焚烧建筑垃圾等
其他	施工现场人员未佩戴安全帽等

序号	Q_H	Q_S
1	图中包含什么安全隐患	应该采取什么措施消除图片中的安全隐患
2	图片展示了哪些潜在的安全隐患	为了保障施工安全,图片中的安全隐患应如何处理
︙	︙	︙
k	通过这张图片你能发现什么安全隐患	图片中的安全隐患可以被什么措施消除

模型	BLEU	ROUGE-L	METEOR
LLaVA	0.4	10.0	9.7
CogVLM	4.2	19.5	21.3
CSIQM	24.5	38.1	44.7

模型	GPT-4评价	专家评价
模型	GPT-4评价	流畅性	相关性
LLaVA	15	60	20
CogVLM	26	75	40
CSIQM	61	80	70

方法	BLEU/%	ROUGE-L/%	METEOR/%	GPT-4
基线	1.0	14.0	15.3	27
多轮推理	2.6	17.2	19.3	30
低秩分解微调	17.2	29.1	32.7	42
低秩分解微调+ 多轮推理	24.5	38.1	44.7	61