基于Swin Transformer的弱监督人群计数研究

doi:10.16265/j.cnki.issn1003-3033.2023.03.0784

摘要/Abstract

摘要：

为降低人群聚集引发安全事故的概率,解决完全监督方法数据标注成本高,而现有弱监督方法性能欠佳的问题,提出一种基于Swin Transformer的弱监督人群计数模型。首先,引入具有全局感受野且能够有效提取语义人群信息的Transformer模型,来应对基于卷积神经网络(CNN)的弱监督人群计数方法感受野有限、性能欠佳的问题;然后,采用具有层级设计并且拥有多尺度、层次化计算图像特征能力的Swin Transformer模型作为主干网络,以加强对不同尺度特征的学习,使模型能够更好地应对人群尺度变化的问题;最后,选择只需要人群数量作为监督信息的弱监督方式进行训练,避免对图像中每个人的头部进行标注这一繁琐易错的工作。结果表明:所提模型在ShanghaiTech Part A、ShanghaiTech Part B、UCF-QNRF数据集上的平均绝对误差依次为66.1、8.7、97.1,均方误差依次为106.2、14.9、165.8,在主流数据集上计数性能较好;该模型的性能优于此前的弱监督方法和部分完全监督方法。

关键词: Swin Transformer, 弱监督, 人群计数, 卷积神经网络(CNN), 数据集

Abstract:

In order to reduce the probability of safety accidents caused by crowd gathering, research is carried out on the crowd counting task. For the problem of the high data labeling cost of the full supervision method and poor performance of the existing weak supervision method, a weak supervision crowd counting model based on Swin Transformer is designed. First, a Transformer model with a global receptive field and the ability to effectively extract semantic crowd information was introduced to deal with the problem of the limited receptive field and poor performance of the weakly supervised crowd counting method based on CNN. Then, a hierarchical design was adopted. The Swin Transformer model with multi-scale and hierarchical computing image features was used as the backbone network to strengthen the learning of different scale features, so that the model can better deal with the problem of crowd scale changes. Finally, the selection only needs the number of people as supervisory information. Weakly supervised training of information, avoiding the tedious and error-prone work of labeling each person's head in the image. The results show that the average absolute error of the method in this paper on ShanghaiTech Part A, ShanghaiTech Part B, and UCF-QNRF datasets is 66.1, 8.7, and 97.1, and the mean square error is 106.2, 14.9, and 165.8, which is better than the previous weakly supervised method and partially fully supervised methods.

Key words: Swin Transformer, weakly supervised, crowd counting, convolutional neural network(CNN), datasets

冉瑞生, 李进, 董殊宏. 基于Swin Transformer的弱监督人群计数研究[J]. 中国安全科学学报, 2023, 33(3): 111-117.

RAN Ruisheng, LI Jin, DONG Shuhong. Research on weakly supervised crowd counting based on Swin Transformer[J]. China Safety Science Journal, 2023, 33(3): 111-117.

图/表 10

图1

图2

图3

图4

表1

表2

表3

表4

表5

表6

参考文献 26

[1]	TOPKAYA I S, ERDOGAN H, PORIKLI F. Counting people by clustering person detector outputs[C]. 2014 11^th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2014: 313-318.
[2]	CHAN A B, VASCONCELOS N. Bayesian poisson regression for crowd counting[C]. 2009 IEEE 12^th International Conference on Computer Vision. IEEE, 2009: 545-551.
[3]	ZHANG Yingying, ZHOU Desen, CHEN Siqin, et al. Single-image crowd counting via multi-column convolutional neural network[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: 589-597.
[4]	徐丹, 代勇, 纪军红. 基于卷积神经网络的驾驶人行为识别方法研究[J]. 中国安全科学学报, 2019, 29(10): 12-17. doi: 10.16265/j.cnki.issn1003-3033.2019.10.003
	XU Dan, DAI Yong, JI Junhong. Research on driver behavior recognition method based on convolutional neural network[J]. China Safety Science Journal, 2019, 29(10): 12-17. doi: 10.16265/j.cnki.issn1003-3033.2019.10.003
[5]	熊若鑫, 宋元斌, 王宇轩, 等. 基于CNN的3D姿势估计在建筑工人行为分析中的应用[J]. 中国安全科学学报, 2019, 29(7): 64-69. doi: 10.16265/j.cnki.issn1003-3033.2019.07.011
	XIONG Ruoxin, SONG Yuanbin, WANG Yuxuan, et al. Application of convolutional neural network-based 3D posture estimation in behavioral analysis of construction workers[J]. China Safety Science Journal, 2019, 29(7): 64-69. doi: 10.16265/j.cnki.issn1003-3033.2019.07.011
[6]	吴思, 张旭光, 方银锋. 基于注意力机制的人群计数方法[J]. 中国安全科学学报, 2022, 32(1): 127-134. doi: 10.16265/j.cnki.issn1003-3033.2022.01.017
	WU Si, ZHANG Xuguang, FANG Yinfeng. Method of crowd counting based on attention mechanism[J]. China Safety Science Journal, 2022, 32(1): 127-134. doi: 10.16265/j.cnki.issn1003-3033.2022.01.017
[7]	LI Yuhong, ZHANG Xiaofan, CHEN Deming. CSRNet: dilated convolutional neural networks for understanding the highly congested scenes[C]. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 1091-1100.
[8]	CHAN A B, LIANG Z S J, VASCONCELOS N. Privacy preserving crowd monitoring: counting people without people models or tracking[C]. 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2008: 1-7.
[9]	GUO Bin, WANG Zhu, YU Zhiwen, et al. Mobile crowd sensing and computing: the review of an emerging human-powered sensing paradigm[J]. ACM Computing Surveys (CSUR), 2015, 48(1): 1-31.
[10]	SHENG Xiang, TANG Jian, XIAO Xuejie, et al. Leveraging GPS-less sensing scheduling for green mobile crowd sensing[J]. IEEE Internet of Things Journal, 2014, 1(4): 328-336. doi: 10.1109/JIOT.2014.2334271
[11]	YANG Yifan, LI Guorong, WU Zhe, et al. Weakly-supervised crowd counting learns from sorting rather than locations[C]. European Conference on Computer Vision. Springer, Cham, 2020: 1-17.
[12]	LEI Yinjie, LIU Yan, ZHANG Pingping, et al. Towards using count-level weak supervision for crowd counting[J]. Pattern Recognition (PR), 2021, 109:DOI:10.1016/j.patcog.2020.107616. doi: 10.1016/j.patcog.2020.107616
[13]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Advances in Neural Information Processing Systems, 2017: 5998-6008.
[14]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is worth 16x16 words: transformers for image recognition at scale[J]. arXiv preprint, 2020:DOI:10.48850/arXiv.2010.11929. doi: 10.48850/arXiv.2010.11929
[15]	LIU Ze, LIN Yutong, CAO Yue, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 10 012-10 022.
[16]	LIANG Dingkang, CHEN Xiwu, XU Wei, et al. TransCrowd: weakly-supervised crowd counting with transformers[J]. Science China Information Sciences, 2022, 65(6): 1-14.
[17]	LIN Min, CHEN Qiang, YAN Shuicheng. Network in network[J]. arXiv Preprint, 2013: DOI:10.48550/arXiv.1312.4400. doi: 10.48550/arXiv.1312.4400
[18]	IDREES H, TAYYAB M, ATHREY K, et al. Composition loss for counting, density map estimation and localization in dense crowds[C]. Proceedings of the European Conference on Computer Vision (ECCV), 2018: 532-546.
[19]	LIU Xialei, VAN D W J, BAGDANOV A D. Exploiting unlabeled data in cnns by self-supervised learning to rank[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8): 1862-1878. doi: 10.1109/TPAMI.2019.2899857 pmid: 30794168
[20]	KALYANI G, JANAKIRAMAIAH B, PRASAD L V, et al. Efficient crowd counting model using feature pyramid network and ResNeXt[J]. Soft Computing, 2021, 25(15): 10 497-10 507.
[21]	MA Zhiheng, WEI Xing, HONG Xiaopeng, et al. Bayesian loss for crowd count estimation with point supervision[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 6142-6151.
[22]	WANG Boyu, LIU Huidong, SAMARAS D, et al. Distribution matching for crowd counting[J]. Advances in Neural Information Processing Systems, 2020, 33: 1595-1607.
[23]	JIANG Xiaolong, XIAO Zehao, ZHANG Baochang, et al. Crowd counting and density estimation by trellis encoder-decoder networks[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6133-6142.
[24]	LIU Weizhe, SALZMANN M, FUA P. Context-aware crowd counting[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 5099-5108.
[25]	XIONG Haipeng, LU Hao, LIU Chengxin, et al. From open set to closed set: counting objects by spatial divide-and-conquer[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 8362-8371.
[26]	LIU Lingbo, QIU Zhilin, LI Guanbin, et al. Crowd counting with deep structured scale integration network[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 1774-1783.

数据集	分辨率	图片数量	最多人数	最少人数	平均人数	总体人数
ShanghaiTech Part A	不统一	482	3 139	33	501	241 667
ShanghaiTech Part B	768×1 024	716	578	9	123	88 488
UCF-QNRF	不统一	1 535	12 865	49	815	1 251 642

方法	标签		MAE	MSE
方法	位置	人数	MAE	MSE
MCNN^[3]	√	√	110.2	173.2
L2R^[19]	√	√	73.6	112.0
ResNeXtFP^[20]	√	√	69.3	104.7
CSRNet^[7]	√	√	68.2	115.0
SCANet^[6]	√	√	66.0	104.3
BL^[21]	√	√	62.8	101.8
DM-Count^[22]	√	√	59.7	95.7
Sorting^[11]	×	√	104.6	145.2
MATT^[12]	×	√	80.1	129.4
TransCrowd-G^[16]	×	√	70.5	115.1
TransCrowd-T^[16]	×	√	69.8	109.5
Ours	×	√	66.1	106.2

方法	标签		MAE	MSE
方法	位置	人数	MAE	MSE
MCNN^[3]	√	√	26.4	41.3
L2R^[19]	√	√	13.7	21.4
CSRNet^[7]	√	√	10.6	16.0
SCANet^[6]	√	√	8.4	13.9
BL^[21]	√	√	7.7	12.7
DM-Count^[22]	√	√	7.4	11.8
Sorting^[11]	×	√	12.3	21.2
MATT^[12]	×	√	11.7	17.5
TransCrowd-T^[16]	×	√	10.6	23.0
TransCrowd-G^[16]	×	√	9.6	16.3
Ours	×	√	8.7	14.9

方法	标签		MAE	MSE
方法	位置	人数	MAE	MSE
MCNN^[3]	√	√	277.0	426.0
CL^[18]	√	√	132.0	191.0
L2R^[19]	√	√	124.0	196.0
TEDnet^[23]	√	√	113.0	188.0
CAN^[24]	√	√	107.0	183.0
S-DCNet^[25]	√	√	104.4	176.1
DSSI-Net^[26]	√	√	99.1	159.2
BL^[21]	√	√	88.7	154.8
DM-Count^[22]	√	√	85.6	148.3
Sorting^[11]	×	√	—	—
MATT^[12]	×	√	—	—
TransCrowd-T^[16]	×	√	99.7	172.3
TransCrowd-G^[16]	×	√	99.0	169.3
Ours	×	√	97.1	165.8

方法	参数量		FLOPs		运行时间
BL^[21]	21.5M		60.8G		—
DM-Count^[22]	21.5M		60.8G		—
MATT^[12]	16.3M		60.9G		—
TransCrowd^[16]		86.4M		49.3G		1.44 s
Ours		87.2M		44.5G		1.46 s