关键词:
Computer science
摘要:
Video Understanding is a fundamental task for computer vision and artificial intelligence (AI). Human’s visual system can rapidly recognize different objects and their status while in a dynamic process and support us to interact with the environment in real-time. The representative applications are autopilot, sports analysis, and human-robot interaction. Convolutional Neural Networks (CNNs) have gained great success in tasks such as image classification and object detection. Most of the previous work has focused on still images, but in the real world, changes never stop, it is necessary to extend the detection tasks from static images to videos. Different from still images, to understand the content of a video, a computer needs to understand the changes of all the objects this video contains along the temporal dimension, make clear of their relationship, and then infer what happens in this video. For instance, with a still image, if there is a person standing in front of a chair, he might just stand up, or is going to sit on it. It is hard to guess what the real action is. However, with the help of temporal information, if this person is sitting on the chair in the following several frames, it is not difficult to infer that this is a sit-down action. Following this idea, in this dissertation, we explore video understanding with two detection tasks: 1) video object detection, which focuses on localizing and categorizing objects in videos while their appearances and locations may change in continuous frames and 2) action detection, which aims to localize where the actions happen, and categorize which actions they are. It needs to consider not only the relationship of each object contains in the video, such as people and football, but also how that relationship changes across frames. We explore video object detection with the polyp localization task in colonoscopy video. We propose a two-stream structure to detect polyps with both spatial and temporal representation and then