China Safety Science Journal ›› 2025, Vol. 35 ›› Issue (10): 106-114.doi: 10.16265/j.cnki.issn1003-3033.2025.10.1435

• Safety engineering technology • Previous Articles     Next Articles

Intelligent question answering model for construction safety hazards based on vision-language multimodality

WANG Zhe1,2(), HUANG Haichen1,2, LI Ruiqin1,2, WEI Yongchang3   

  1. 1 School of Safety Science and Emergency Management, Wuhan University of Technology, Wuhan Hubei 430070, China
    2 China Emergency Management Research Center, Wuhan University of Technology, Wuhan Hubei 430070, China
    3 College of Business Administration, Zhongnan University of Economics and Law, Wuhan Hubei 430073, China
  • Received:2025-06-10 Revised:2025-08-13 Online:2025-10-28 Published:2026-04-28

Abstract:

In order to enhance the intelligent diagnosis level of safety problems in complex construction environments, an intelligent question-answering model for construction safety hazards based on vision-language multimodality was proposed. A dataset of image-text pairs related to construction safety hazards was constructed. A visual encoder was used to complete the visual encoding of safety hazard images, and a language model was employed to encode the question-answering texts about safety hazards. A multimodal feature fusion module was adopted to achieve effective interaction between image and text information. A specific input template for visual question answering adapted to the scenario of construction safety hazards was constructed. The model was fine-tuned based on matrix low-rank decomposition, and multi-round prompts were used to guide the model in generating accurate answers. The results show that compared with existing contrastive models, the intelligent question-answering model for construction safety hazards performs better in automatic evaluation metrics, Generative Pre-trained Transformer(GPT)-4 evaluation, and expert evaluation, with significantly improved fluency and semantic relevance of the generated texts. Ablation experiments further verify the effectiveness of each sub-module, confirming that the synergistic effect of matrix low-rank decomposition fine-tuning and multi-round reasoning is the key for the model to achieve optimal performance, and that reasonably setting the rank parameter of the low-rank matrix can effectively avoid the overfitting problem.

Key words: vision-language, multimodality, construction safety, safety hazard, intelligent question answering model, matrix low-rank decomposition

CLC Number: