Crowd counting aims to count the number of instantaneous people in a crowded space, and many promising solutions have been proposed for single image crowd counting. With the ubiquitous video capture devices in public safety field, how to effectively apply the crowd counting technique to video content has become an urgent problem. In this paper, we introduce a novel framework based on temporal aware modeling of the relationship between video frames. The proposed network contains a few dilated residual blocks, and each of them consists of the layers that compute the temporal convolutions of features from the adjacent frames to improve the prediction. To alleviate the expensive computation and satisfy the demand of fast video crowd counting, we also introduce a lightweight network to balance the computational cost with representation ability. We conduct experiments on the crowd counting benchmarks and demonstrate its superiority in terms of effectiveness and efficiency over previous video-based approaches.