中国安全科学学报 ›› 2022, Vol. 32 ›› Issue (12): 53-62.doi: 10.16265/j.cnki.issn1003-3033.2022.12.2727

• 安全科学理论与安全系统科学 • 上一篇    下一篇

基于数据增强的HSE检查纪要命名实体识别

夏占杰(), 张贝克, 高东**()   

  1. 北京化工大学 信息科学与技术学院,北京 100029
  • 收稿日期:2022-07-30 修回日期:2022-10-23 出版日期:2022-12-28 发布日期:2023-06-28
  • 通讯作者: 高东
  • 作者简介:

    夏占杰 (1993—),男,山东菏泽人,硕士,主要从事安全数据挖掘、自然语言处理、数据增强等方面的研究。E-mail:

  • 基金资助:
    国家自然科学基金资助(61703026)

Named entity recognition of HSE inspection minutes based on data enhancement

XIA Zhanjie(), ZHANG Beike, GAO Dong**()   

  1. School of Information and Technology, Beijing University of Chemical Technology, Beijing 100029, China
  • Received:2022-07-30 Revised:2022-10-23 Online:2022-12-28 Published:2023-06-28
  • Contact: GAO Dong

摘要:

为解决用深度学习模型对安全检查纪要进行文本挖掘时,面临的数据集规模小、样本数据分布不均衡、命名实体识别(NER)效果差等问题,提出一种新的NER数据增强方法。首先,将数据集中的命名实体分离,并随机替换同类命名实体,避免数据增强技术对命名实体信息的破坏,使命名实体分布更加均匀;然后,通过优化其他部分的噪声数据和比例参数,进一步提高NER的效果;最后,通过自动标注分离后的数据,重新组合,以避免需要手动标注大量数据的弊端。结果表明:该方法可快速解决数据集数据量太小和数据集命名实体分布不均匀等问题;与更简单有效的数据增强(AEDA)方法相比,该方法在健康安全环境(HSE)检查纪要等数据集上取得更好的识别效果,使模型在1倍扩充数据上的综合评价指标从92.83%提升至97.23%;同时,能够得到建筑施工过程中安全隐患在空间上的分布规律和强关联规则。

关键词: 数据增强, 健康安全环境(HSE), 检查纪要, 命名实体识别(NER), 安全隐患, 文本挖掘

Abstract:

In order to solve the problems faced by deep learning model in text mining of safety inspection minutes, such as small data set size, uneven distribution of sample data and poor effect of NER, a new data enhancement method for NER was proposed. First of all, the named entities in the data set were separated and the same kind of named entities were replaced randomly, which could not only avoid the damage of data enhancement technology to the information of named entities, but also make the distribution of named entities more uniform. Then, by optimizing the noise data and scale parameters of other parts, the effect of NER was further improved. Finally, the separated data was automatically labeled and recombined to avoid the disadvantage of manually marking a large amount of data. The results show that this method can quickly solve the problems such as the small amount of data and the uneven distribution of named entities in the dataset. Compared with the latest AEDA (An Easier Data Augmentation) method, this method achieves better recognition results on data sets such as HSE inspection minutes, and improves the comprehensive evaluation index of the model on one-fold expanded data from 92.83% to 97.23%. At the same time, the spatial distribution and strong association rules of safety hazards in construction process can be obtained.

Key words: data enhancement, health safety environment(HSE), inspection minutes, named entity recognition(NER), hidden danger, text mining