A New Large Scale Dynamic Texture Dataset with Application to ConvNet Understanding

Contributors

  • Isma Hadji
  • Richard P. Wildes

Overview

In this work we introduce a new large scale dynamic texture dataset. With over 10,000 videos, our Dynamic Texture DataBase (DTDB) is two orders of magnitude larger than any previously available dynamic texture dataset. DTDB comes with two complementary organizations, one based on dynamics independent of spatial appearance and one based on spatial appearance independent of dynamics. The complementary organizations allow for uniquely insightful experiments regarding the abilities of major classes of spatiotemporal ConvNet architectures to exploit appearance vs. dynamic information. We also present a new two-stream ConvNet that provides an alternative to the standard optical-flow-based motion stream to broaden the range of dynamic patterns that can be encompassed. The resulting motion stream is shown to outperform the traditional optical flow stream by considerable margins. Finally, the utility of DTDB as a pretraining substrate is demonstrated via transfer learning on a different dynamic texture dataset as well as the companion task of dynamic scene recognition resulting in a new state-of-the-art.

A new dataset: The Dynamic Texture DataBase (DTDB)

The following video shows sample video sequences from the new Dynamic Texture Database and illustrates the specfications of the proposed dynamics vs. appearance based categories of DTDB.

Please refer to our dataset page and our paper for a more detailed description of DTDB.

DTDB is available for download at the dataset page.

A new algorithm: MSOE-two-stream

In this work we propose a different outlook on the construction of two-stream spatiotemporal ConvNets. The standard two-stream architecture [3] operates in two parallel pathways, one for processing appearance (using RGB frames) and the other for motion (using stacks of optical flow fields).

Problem: Optical flow is known to be a poor representation for many dynamic textures, especially those exhibiting decidedly non-smooth and/or stochastic characteristics.
fig_1
Figure1. Sample optical flow fields extracted from a fireworks sequence.

Solution: An interesting alternative to optical flow in the present context is appearance Marginalized Spatiotemporal Oriented Energy (MSOE) filtering [1]. This approach applies 3D, (x, y, t), oriented filters to a video stream to capture moving patterns along various directions. Also, it marginalizes appearance information to abstract away from spatial appearance and emphsize dynamic information.
fig_2
Figure2. A sample of MSOE channels capturing 10 directions of motion (only 3 shown here) extracted from the same fireworks sequence.

MSOE-two-stream: As a novel two-stream architecture, we replace input optical flow stacks in the motion stream with stacks of MSOE filtering results. The resulting architecture is able to capture a wider range of dynamics in comparison to what can be captured by optical flow, while maintaining the ability to model appearance.

Empirical results

DTDB in its two organizations was used to better understand strengths and weaknesses of learning based spatiotemporal ConvNets and evaluate the proposed MSOE-two-stream

tab_1
  • The dynamics organization of DTDB revealed that traditional motion networks (i.e. C3D and the flow stream) are particularly hampered when similar appearances are present across different dynamic categories.
  • The complexity of sequences in the dataset illustrated that optical flow fails on most categories where the sequences break the fundamental optical flow assumptions while the MSOE stream remains robust in those situations.
  • The complementary organization of DTDB confirmed that two-stream networks are better able to disentangle motion from appearance information.
  • The proposed MSOE-two-stream architecture proved superior in capitalizing on both motion and appearance information.

Please refer to our paper for more experiments and detailed discussions.

Related Paper

References

[1] K. Derpanis and R. P. Wildes, "Spacetime texture representation and recognition based on spatiotemporal orientation analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, pp. 1193-1205, 2012.
[2] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," in ICCV, 2015.
[3] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in NIPS, 2014.
[4] I. Hadji and R. P. Wildes, "A spatiotemporal oriented energy network for dynamic texture recognition," in ICCV, 2017.

Last updated: October 3, 2018
π