中国安全科学学报 ›› 2024, Vol. 34 ›› Issue (2): 37-44.doi: 10.16265/j.cnki.issn1003-3033.2024.02.0121

• 安全社会科学与安全管理 • 上一篇    下一篇

基于字词向量融合的民航智慧监管短文本分类

王欣1(), 干镞锐1, 许雅玺2,**(), 史珂3, 郑涛1   

  1. 1 中国民用航空飞行学院 计算机学院, 四川 广汉 618307
    2 中国民用航空飞行学院 经济与管理学院,四川 广汉 618307
    3 中国民用航空飞行学院 民航监察员培训学院,四川 广汉 618307
  • 收稿日期:2023-08-14 修回日期:2023-11-20 出版日期:2024-02-28
  • 通讯作者:
    ** 许雅玺(1976—),女,四川成都人,硕士,副教授,硕士生导师,主要从事决策分析与优化、数据挖掘等方面的研究。E-mail:
  • 作者简介:

    王 欣 (1973—),男,四川绵阳人,博士,教授,硕士生导师,主要从事机器学习、数据挖掘、自然语言处理方面的研究。E-mail:

    史 珂 高级工程师

    郑 涛 副教授

  • 基金资助:
    国家自然科学基金资助(U2033213); 中央高校基本科研业务费专项资金资助(J2022-048); 中央高校基本科研业务费专项资金资助(J2019-045)

Short text classification of civil aviation intelligent supervision based on character-word fusion

WANG Xin1(), GAN Zurui1, XU Yaxi2,**(), SHI Ke3, ZHENG Tao1   

  1. 1 School of Computer, Civil Aviation Flight University of China, Guanghan Sichuan 618307, China
    2 School of Economics and Management, Civil Aviation Flight University of China, Guanghan Sichuan 618307, China
    3 Institute of Civil Aviation Supervisor Training, Civil Aviation Flight University of China, Guanghan Sichuan 618307, China
  • Received:2023-08-14 Revised:2023-11-20 Published:2024-02-28

摘要:

为解决民航监管事项所产生的检查记录仅依靠人工进行分类分析导致效率低的问题,提出一种基于数据增强与字词向量融合的双通道特征提取的短文本分类模型,探讨民航监管事项的分类,包括与人、设备设施环境、制度程序和机构职责等相关问题。为解决类别不平衡问题,采用数据增强算法在原始文本上进行变换,生成新的样本,使各个类别的样本数量更加均衡。将字向量和词向量按字融合拼接,得到具有词特征信息的字向量。将字词融合的向量分别送入到文本卷积神经网络(TextCNN)和双向长短期记忆(BiLSTM)模型中进行不同维度的特征提取,从局部的角度和全局的角度分别提取特征,并在民航监管事项检查记录数据集上进行试验。结果表明:该模型准确率为0.983 7,F1值为0.983 6。与一些字嵌入模型和词嵌入模型相对比,准确率提升0.4%。和一些常用的单通道模型相比,准确率提升3%,验证了双通道模型提取的特征具有全面性和有效性。

关键词: 字词向量融合, 民航监管, 短文本, 文本卷积神经网络(TextCNN), 双向长短期记忆(BiLSTM)

Abstract:

In order to address the inefficiencies in manually classifying and analyzing inspection records about civil aviation supervision, a dual-channel feature extraction short text classification model was proposed. The model combined data augmentation techniques and character-word vector fusion. The model aimed to tackle classification issues related to people, equipment and facilities, institutional procedures and institutional responsibilities in civil aviation supervised matters. In order to tackle the issue of class imbalance, data augmentation algorithms were employed to generate new samples by transforming the original texts, thereby balancing the sample sizes across different categories. The word vectors and character vectors were fused by combining them at the character level, resulting in character vectors that retain word-level features. These fused character vectors were then fed into TextCNN and BiLSTM for feature extraction at different dimensions. By extracting features from both local and global perspectives, this dual-channel approach aimed to capture comprehensive and effective information from the inspection records dataset in civil aviation regulatory matters. Experimental results on the civil aviation regulatory matter inspection record dataset demonstrate that the proposed model achieves an accuracy of 0.983 7 and an F1 score of 0.983 6. Compared with some existing word embedding models and character embedding models, the accuracy is improved by 0.4%. Furthermore, when compared with commonly used single-channel models, the accuracy is increased by 3%, which validates the effectiveness and comprehensiveness of the features extracted by the dual-channel model.

Key words: character-word vector fusion, civil aviation supervision, short text, text convolutional neural networks(TextCNN), bi-directional long short-term memory(BiLSTM)

中图分类号: