基于Bigram的安全隐患文本分类研究

doi:10.16265/j.cnki.issn1003-3033.2017.08.027

摘要/Abstract

摘要： 鉴于传统文本分类研究缺少针对性,在安全隐患文本分类实际应用中表现不佳,以及企业安全隐患文本文本长度短、特征单元选取困难,为高效地从大量安全隐患文本数据中提取、分析有效信息,更好地掌握安全隐患的发生和变化过程,提出利用Bigram二字串作为特征单元,结合支持向量机(SVM)数据挖掘算法的安全隐患文本分类方法。以潞安集团司马煤业有限公司2009—2015年安全隐患记录为数据源,通过试验,验证该方法的分类效果。结果表明:新的安全隐患分类方法具有较高的准确率、召回率及F-值,与传统方法相比,显著提升了分类的准确度。

关键词: 安全隐患, Bigram二字串, 特征单元, 支持向量机(SVM), 文本分类

Abstract: In view of low pertinency of traditional text classification researches and the poor performance of the actual categorization effect, and in consideration of short text and difficult selection of feature units in the field of enterprises' hidden danger textual data, in order to efficiently and quickly extract and analyze effective information from a large number of hidden danger textual data, a new text categorization method was worked out for hidden dangers on the basis of both the support vector machine data mining algorithm and Bigram string as a feature unit. The method was verified experimentally, by means of all the hidden danger records of Sima Coal Industry Co,Ltd of Lu'an Group in 2009-2015. The results show that the new hidden danger categorization method has a higher precision rate, recall rate and F-measure, and dramatically improves the categorization accuracy compared with that by the traditional methods.

Key words: hidden danger, Bigram, feature unit, support vector machine(SVM), text categorization

中图分类号:

X915.1

陈孝慈, 谭章禄, 单斐, 高青. 基于Bigram的安全隐患文本分类研究[J]. 中国安全科学学报, 2017, 27(8): 156-161.

CHEN Xiaoci, TAN Zhanglu, SHAN Fei, GAO Qing. Research on text categorization for hidden dangers based on Bigram[J]. China Safety Science Journal, 2017, 27(8): 156-161.

参考文献

[1] 隋鹏程. 伤亡事故分析与预防原理[J]. 工业安全与环保,1982,8(5):3-10.
SUI Pengcheng. Analysis and prevention principle of casualty accident[J]. Industrial Safety and Environmental Protection, 1982,8(5):3-10.
[2] 孟现飞, 李克业, 刘飞. 基于3级嵌套安全管理模式的煤矿安全风险预控研究[J].中国安全科学学报,2013,23(4):102-107.
MENG Xianfei, LI Keye, LIU Fei. Study on coal mine safety risk pre-control based on three-levels nested management mode[J]. China Safety Science Journal, 2013,23(4):102-107.
[3] 张大伟. 基于OLAM的煤矿企业安全隐患趋势分析[J]. 煤炭工程,2015,47(5):139-142.
ZHANG Dawei. Analysis of coal mine safety hidden danger trends based on OLAM[J]. Coal Engineering, 2015,47(5):139-140.
[4] 谭章禄, 王泽, 陈晓. 基于LDA的煤矿安全隐患主题发现研究[J]. 中国安全科学学报,2016,26(6):123-128.
TAN Zhanglu, WANG Ze, CHEN Xiao. Research on topic extraction for coal mine hidden danger based on LDA[J].China Safety Science Journal, 2016,26(6):123-128.
[5] 许铭, 吴宗之, 罗云,等. 基于LOP模型的事故隐患分类分级研究[J]. 中国安全科学学报,2014,24(7):15-20.
XU Ming, WU Zongzhi, LUO Yun, et al. Study on classification and ranking of APs based on LOP model[J]. China Safety Science Journal, 2014,24(7):15-20.
[6] LI Jingyang, SUN Maosong, ZHANG Xian. A comparison and semi-quantitative analysis of words and character-Bigrams as features in Chinese text categorization[C]. Proceedings of the 2006 Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics, 2006:545-552.
[7] HU Rong, MAC N B, DELANY S J. Active learning for text classification with reusability[J]. Expert Systems with Applications, 2016, 45: 438-449.
[8] CHANG Chihchung, LIN Chihjen. LIBSVM: a library for support vector machine[J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 1-25.
[9] SHU Xinxin, WANG Junhui, SHEN Xiaotong, et al. Word segmentation in Chinese language processing[J]. Statistics and Its Interface, 2017, 10(2): 165-173.
[10] ZHOU Lina, ZHANG Dongsong. NLPIR: a theoretical framework for applying natural language processing to information retrieval[J]. Journal of the Association for Information Science and Technology, 2003, 54(2): 115-123.
[11] YANG Yiming, PEDERSEN J O. A comparative study on feature selection in text categorization[C]. Fourteenth International Conference on Machine Learning, 1998:412-420.
[12] MARTIN S, LIERMANN J, NEY H. Algorithms for Bigram and trigram word clustering[J]. Speech Communication, 1998, 24(1): 19-37.
[13] XUE Dejun, SUN Maosong. Eliminating high-degree biased character Bigrams for dimensionality reduction in Chinese text categorization[C]. European Conference on Information Retrieval, 2004: 197-208.
[14] SUNDARAM S, RAMAKRISHNAN A G. Bigram language models and reevaluation strategy for improved recognition of online handwritten Tamil words[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2015, 14(2): 1-28.
[15] SRIRAM B, FUHRY D, DEMIR E, et al. Short text classification in twitter to improve information filtering[C]. International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010:841-842.
[16] MANNING C D, RAGHAVAN P, SCHTZE H. Introduction to information retrieval[M]. Cambridge: Cambridge University Press, 2008:139-159.
[17] LAN M, TAN C L, LOW H B, et al. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines[C]. Special Interest Tracks and Posters of the 14^th International Conference on World Wide Web, 2005: 1 032-1 033.
[18] WU Xindong, KUMAR V, QUINLAN J R, et al. Top 10 algorithms in data mining[J]. Knowledge and Information Systems, 2008, 14(1):1-37.
[19] 奉国和. 文本分类性能评价研究[J]. 情报杂志,2011,30(8):66-70.
FENG Guohe. Review of performance evaluation of text classification[J]. Journal of Information, 2011,30(8):66-70.
[20] RIPLEY B D. Neural networks for pattern recognition[M]. Cambridge: Cambridge University Press, 1996:169-195.