中国安全科学学报 ›› 2023, Vol. 33 ›› Issue (3): 111-117.doi: 10.16265/j.cnki.issn1003-3033.2023.03.0784

• 公共安全 • 上一篇    下一篇

基于Swin Transformer的弱监督人群计数研究

冉瑞生(), 李进, 董殊宏   

  1. 重庆师范大学 计算机与信息科学学院,重庆 401331
  • 收稿日期:2022-10-16 修回日期:2023-01-09 出版日期:2023-03-28
  • 作者简介:

    冉瑞生 (1976—),男,重庆人,博士,教授,主要从事机器学习、计算机视觉等方面的研究。E-mail:

  • 基金资助:
    重庆市技术创新与应用发展专项面上项目(cstc2020jscx-msxmX0190); 重庆市教委科学技术研究重点项目(KJZD-K202100505)

Research on weakly supervised crowd counting based on Swin Transformer

RAN Ruisheng(), LI Jin, DONG Shuhong   

  1. College of Computer & Information Science, Chongqing Normal University, Chongqing 401331, China
  • Received:2022-10-16 Revised:2023-01-09 Published:2023-03-28

摘要:

为降低人群聚集引发安全事故的概率,解决完全监督方法数据标注成本高,而现有弱监督方法性能欠佳的问题,提出一种基于Swin Transformer的弱监督人群计数模型。首先,引入具有全局感受野且能够有效提取语义人群信息的Transformer模型,来应对基于卷积神经网络(CNN)的弱监督人群计数方法感受野有限、性能欠佳的问题;然后,采用具有层级设计并且拥有多尺度、层次化计算图像特征能力的Swin Transformer模型作为主干网络,以加强对不同尺度特征的学习,使模型能够更好地应对人群尺度变化的问题;最后,选择只需要人群数量作为监督信息的弱监督方式进行训练,避免对图像中每个人的头部进行标注这一繁琐易错的工作。结果表明:所提模型在ShanghaiTech Part A、ShanghaiTech Part B、UCF-QNRF数据集上的平均绝对误差依次为66.1、8.7、97.1,均方误差依次为106.2、14.9、165.8,在主流数据集上计数性能较好;该模型的性能优于此前的弱监督方法和部分完全监督方法。

关键词: Swin Transformer, 弱监督, 人群计数, 卷积神经网络(CNN), 数据集

Abstract:

In order to reduce the probability of safety accidents caused by crowd gathering, research is carried out on the crowd counting task. For the problem of the high data labeling cost of the full supervision method and poor performance of the existing weak supervision method, a weak supervision crowd counting model based on Swin Transformer is designed. First, a Transformer model with a global receptive field and the ability to effectively extract semantic crowd information was introduced to deal with the problem of the limited receptive field and poor performance of the weakly supervised crowd counting method based on CNN. Then, a hierarchical design was adopted. The Swin Transformer model with multi-scale and hierarchical computing image features was used as the backbone network to strengthen the learning of different scale features, so that the model can better deal with the problem of crowd scale changes. Finally, the selection only needs the number of people as supervisory information. Weakly supervised training of information, avoiding the tedious and error-prone work of labeling each person's head in the image. The results show that the average absolute error of the method in this paper on ShanghaiTech Part A, ShanghaiTech Part B, and UCF-QNRF datasets is 66.1, 8.7, and 97.1, and the mean square error is 106.2, 14.9, and 165.8, which is better than the previous weakly supervised method and partially fully supervised methods.

Key words: Swin Transformer, weakly supervised, crowd counting, convolutional neural network(CNN), datasets