Abstract: Spatio-temporal action detection networks, which need to simultaneously extract and fuse spatial and temporal features, often result in existing models becoming bloated and difficult to run ...