基于数据增强的HSE检查纪要命名实体识别

doi:10.16265/j.cnki.issn1003-3033.2022.12.2727

摘要/Abstract

摘要：

为解决用深度学习模型对安全检查纪要进行文本挖掘时,面临的数据集规模小、样本数据分布不均衡、命名实体识别(NER)效果差等问题,提出一种新的NER数据增强方法。首先,将数据集中的命名实体分离,并随机替换同类命名实体,避免数据增强技术对命名实体信息的破坏,使命名实体分布更加均匀;然后,通过优化其他部分的噪声数据和比例参数,进一步提高NER的效果;最后,通过自动标注分离后的数据,重新组合,以避免需要手动标注大量数据的弊端。结果表明:该方法可快速解决数据集数据量太小和数据集命名实体分布不均匀等问题;与更简单有效的数据增强(AEDA)方法相比,该方法在健康安全环境(HSE)检查纪要等数据集上取得更好的识别效果,使模型在1倍扩充数据上的综合评价指标从92.83%提升至97.23%;同时,能够得到建筑施工过程中安全隐患在空间上的分布规律和强关联规则。

关键词: 数据增强, 健康安全环境(HSE), 检查纪要, 命名实体识别(NER), 安全隐患, 文本挖掘

Abstract:

In order to solve the problems faced by deep learning model in text mining of safety inspection minutes, such as small data set size, uneven distribution of sample data and poor effect of NER, a new data enhancement method for NER was proposed. First of all, the named entities in the data set were separated and the same kind of named entities were replaced randomly, which could not only avoid the damage of data enhancement technology to the information of named entities, but also make the distribution of named entities more uniform. Then, by optimizing the noise data and scale parameters of other parts, the effect of NER was further improved. Finally, the separated data was automatically labeled and recombined to avoid the disadvantage of manually marking a large amount of data. The results show that this method can quickly solve the problems such as the small amount of data and the uneven distribution of named entities in the dataset. Compared with the latest AEDA (An Easier Data Augmentation) method, this method achieves better recognition results on data sets such as HSE inspection minutes, and improves the comprehensive evaluation index of the model on one-fold expanded data from 92.83% to 97.23%. At the same time, the spatial distribution and strong association rules of safety hazards in construction process can be obtained.

Key words: data enhancement, health safety environment(HSE), inspection minutes, named entity recognition(NER), hidden danger, text mining

夏占杰, 张贝克, 高东. 基于数据增强的HSE检查纪要命名实体识别[J]. 中国安全科学学报, 2022, 32(12): 53-62.

XIA Zhanjie, ZHANG Beike, GAO Dong. Named entity recognition of HSE inspection minutes based on data enhancement[J]. China Safety Science Journal, 2022, 32(12): 53-62.

图/表 17

图1

图2

图3

图4

图5

图6

表1

表2

表3

表4

表5

表6

图7

图8

图9

图10

图11

参考文献 15

[1]	张仕廉, 聂李琴. 基于DEMATEL方法的建筑施工安全管理行为影响因素分析[J]. 安全与环境工程, 2017, 24(1): 121-125.
	ZHANG Shilian, NIE Liqin. Analysis of influencing factors of construction safety management behavior based on DEMATEL method[J]. Safety and Environmental Engineering, 2017, 24(1): 121-125.
[2]	黄亚春. 基于自然语言处理的建筑工程安全事故报告风险研究[D]. 武汉: 华中科技大学, 2019.
	HUANG Yachun. Research on the risk of construction engineering safety accident reporting based on natural language processing[D]. Wuhan: Huazhong University of Science and Technology, 2019.
[3]	XIE Qizhe, DAI Zihuang, HOVY E, et al. Unsupervised data augmentation for consistency training[J]. Advances in Neural Information Processing Systems, 2020, 33: 6256-6268.
[4]	CHEN Jiaao, YANG Zichao, YANG Diyi. MixText: linguistically-Informed interpolation of hidden space for semi-supervised text classification[C]. Association for Computational Linguistics, 2020:2147-2157.
[5]	刘彤, 刘琛, 倪维健. 多层次数据增强的半监督中文情感分析方法[J]. 数据分析与知识发现, 2021, 5(5): 1-8.
	LIU Tong, LIU Chen, NI Weijian. Semi-supervised chinese sentiment analysis method based on multi-level data augmentation[J]. Data Analysis and Knowledge Discovery, 2021, 5(5): 1-8.
[6]	WEI J, ZOU Kai. EDA: easy data augmentation techniques for boosting performance on text classification tasks[C]. Conference on Empirical Methods in Natural Language Processing, 2019:6383-6389.
[7]	AKBAR K, LEONARDO R, ANDREA P. AEDA: an easier data augmentation technique for text classification[C]. Conference on Empirical Methods in Natural Language Processing, 2021: 2 748-2 754.
[8]	朱颢东, 杨立志, 丁温雪, 等. 基于主题标签和CRF的中文微博命名实体识别[J]. 华中师范大学学报:自然科学版, 2018, 52(3): 316-321.
	ZHU Haodong, YANG Lizhi, DING Wenxue, et al. Chinese weibo named entity recognition based on topic tags and CRF[J]. Journal of Central China Normal University: Natural Science Edition, 2018, 52(3): 316-321.
[9]	LI Xiaobing, PENMETSA P, LIU Jun, et al. Severity of emergency natural gas distribution pipeline incidents: application of an integrated spatio-temporal approach fused with text mining[J]. Journal of Loss Prevention in the Process Industries, 2021, 69: 104 383-104 394.
[10]	李芳国, 张贝克, 高东. HAZOP知识图谱构建方法[J]. 化工进展, 2021, 40(8): 4666-4677.
	LI Fangguo, ZHANG Beike, GAO Dong. Construction method of HAZOP knowledge graph[J]. Chemical Industry andEngineering Progress, 2021, 40(8): 4666-4677.
[11]	XIE Zi'ang, SIDA I, LI Jiwei, et al. Data noising as smoothing in neural network language models[J]. International Conference on Learning Representations, 2017: DOI:10.48550/ArXiv: 1703: 02573. doi: 10.48550/ArXiv: 1703: 02573
[12]	KINGMA D P, BA J. Adam: a method for stochastic optimization[J]. Computer Science, 2015: DOI:10.48550/arXiv.1412.6980. doi: 10.48550/arXiv.1412.6980.
[13]	WAHLBECK K, TUUNAINEN A, AHOKAS A, et al. Dropout rates in randomised antipsychotic drug trials[J]. Psychopharmacology, 2001, 155(3): 230-233. pmid: 11432684
[14]	陈述, 习俊博, 王建平, 等. 水电工程施工安全隐患关联规则挖掘[J]. 中国安全科学学报, 2021, 31(8): 75-82. doi: 10.16265/j.cnki.issn1003-3033.2021.08.011
	CHEN Shu, XI Junbo, WANG Jianping, et al. Mining association rules for hidden dangers in hydropower project construction[J]. China Safety Science Journal, 2021, 31(8): 75-82. doi: 10.16265/j.cnki.issn1003-3033.2021.08.011
[15]	况宇琦, 赵挺生, 蒋灵, 等. 塔式起重机事故案例关联规则挖掘与分析[J]. 中国安全科学学报, 2021, 31(7): 137-142. doi: 10.16265/j.cnki.issn 1003-3033.2021.07.019
	KUANG Yuqi, ZHAO Tingsheng, JIANG Ling, et al. Mining and analysis of association rules for tower crane accident cases[J]. China Safety Science Journal, 2021, 31(7): 137-142. doi: 10.16265/j.cnki.issn 1003-3033.2021.07.019

语料	HSE	HAZOP	BMES	DEMO	人民日报
原始数据	92.83	76.82	88.06	68.26	86.37
1倍扩充数据 (比例0.25)	97.23	75.43	87.45	69.90	90.34
调整插入比例的数据	97.23	77.56	90.08	69.90	90.34

模型	P	R	F₁
BiLSTM+CRF	70.68	69.98	70.33
BiLSTM+CRF +AEDA	69.38	70.48	69.93
BiLSTM+CRF +ADA-NER	71.19	75.25	73.16
BiLSTM+Attention+CRF	92.00	93.67	92.83
BiLSTM+Attention+CRF +AEDA	95.53	96.06	97.79
BiLSTM+Attention+CRF +ADA-NER	99.40	99.55	99.47
BERT+BiLSTM+CRF	97.92	98.21	98.07
BERT+BiLSTM+CRF +AEDA	98.97	99.19	99.08
BERT+BiLSTM+CRF +ADA-NER	99.08	99.28	99.18

实体类别	模型	P	R	F₁
LOC	BiLSTM+Attention+CRF	78.52	81.92	80.18
	BiLSTM+Attention+CRF + AEDA	95.93	97.04	96.49
	BiLSTM+Attention+ CRF+ADA-NER	98.35	98.79	98.57
实体类别	模型	P	R	F₁
TIME	BiLSTM+Attention+CRF	98.65	99.32	98.99
	BiLSTM+Attention+CRF + AEDA	95.13	95.50	95.32
	BiLSTM+Attention+ CRF+ADA-NER	99.93	99.77	99.85
ORG	BiLSTM+Attention+CRF	99.32	99.77	99.55
	BiLSTM+Attention+CRF + AEDA	95.51	95.63	95.57
	BiLSTM+Attention+CRF + ADA-NER	99.84	100.00	99.92

模型	数据集占比/%
模型	20	40	60	80	100
Nomal	73.49	81.62	90.21	92.09	92.83
ADA-NER+BiLSTM+ Attention+CRF	94.25	97.22	98.67	98.71	99.47
ADA-NER+BERT+ BiLSTM+CRF	92.70	97.33	98.63	99.07	99.18

模型	扩充倍数	P	R	F1
ERR	1	92.08	95.29	93.66
NDO	1	95.08	96.67	95.87
ERR+NDO	1	96.49	97.98	97.23
ERR	3	96.68	99.88	98.25
NDO	3	99.32	99.37	99.35
ERR+NDO	3	99.40	99.55	99.47