中国安全科学学报 ›› 2022, Vol. 32 ›› Issue (6): 109-114.doi: 10.16265/j.cnki.issn1003-3033.2022.06.2732

• 安全工程技术 • 上一篇    下一篇

多维字符特征表示的铁路设备事故信息抽取方法

张鹏翔()   

  1. 中国铁道科学研究院集团有限公司 标准计量研究所,北京 100081
  • 收稿日期:2022-01-14 修回日期:2022-04-11 出版日期:2022-06-28 发布日期:2022-12-28
  • 作者简介:

    张鹏翔 (1988—),男,甘肃通渭人,硕士,工程师,主要从事铁路运输方面的工作。E-mail:

Information extraction method for railway equipment accidents based on multi-dimensional character feature representation

ZHANG Pengxiang()   

  1. Standards & Metrology Research Institute, China Academy of Railway Sciences Corporation Limited, Beijing 100081, China
  • Received:2022-01-14 Revised:2022-04-11 Online:2022-06-28 Published:2022-12-28

摘要:

为解决铁路设备事故调查报告数据分析困难的问题,提出基于多维字符特征表示设备事故信息抽取方法,在数据预处理阶段,提出主题模式匹配方法,抽取命名实体所属的主题段落;在文本特征表示中,提出多维特征表示方法将文本转化为特征向量;采用长短时记忆网络(BiLSTM)与条件随机场(CRF)神经网络实现铁路设备事故命名实体识别模型训练;采用铁路设备事故调查报告进行试验验证。结果表明:通过主题模式匹配预处理,多维字符特征+BiLSTM+CRF模型的综合评价指标提升22.86%,多维字符特征表示方法相比word2vec特征表示方法,能够使BiLSTM+CRF模型的综合评价指标提升4.89%。

关键词: 多维字符特征, 铁路设备事故, 信息抽取, 主题模式匹配, 命名实体识别

Abstract:

In order to address difficulty in data analysis in investigation reports of railway equipment accidents, an accident information extraction method based on multi-dimensional character feature representation was proposed. Firstly, a subject pattern matching method was put forward for data preprocessing stage to extract subject paragraphs to which named entity belonged. For text feature representation, a multi-dimensional feature representation method was proposed to transform text into feature vector, and training of named entity recognition model was carried out by using bidirection long short term memory(BiLSTM)+ conditional random fields(CRF) neural network. Finally, accident investigation report was used for experimental verification. The results show that the comprehensive evaluation index of multi-dimensional character +BiLSTM+CRF model is improved by 22.86% through preprocessing of subject pattern matching. And compared with word2vec feature representation, multi-dimensional one can improve evaluation index of BiLSTM+CRF model by 4.89%.

Key words: multi-dimensional character feature, railway equipment accident, information extraction, subject pattern matching, named entity recognition.