The purpose of this project was to produce an image processing algorithm that could be used for image segmentation and feature extraction in every frame of video feed, in real time. Challenges associated with moving cameras, low-resolution imagery, and significant variation in orientation, scale, posture, and illumination of target objects made existing algorithms unsuitable, especially given the real-time constraint. This algorithm was to be used as input to a variety of localization, tracking, and identification algorithms, so it was desirable to detect multiple feature types. Detected features include: Color-based Maximally Stable Extremal Regions (MSER), Canny lines, corners, and region contours. So far, the algorithm (in various stages of completion) has been used in a handful of demonstrations, shown in the following pages. The 'Person Tracking' page shows initial attempts at foreground segmentation and object tracking. The 'Tabletop Tracking' page shows the algorithm being used for simple behavior identification. The 'Nao Interaction' page shows the same tabletop scenarios being run on a small bipedal robot, which is programmed to verbally and physically interact with the user. A brief description of the segmentation and feature detection algorithm is provided below.
Region Detection, Tracking, & Identification
The multiple components of the project's architecture are summarized below. Each part is illustrated using representative screen shots that were taken while a person interacted with objects related to the “homework” scenario. It should be noted that our primary motivation for building the system was to achieve real-time operation on a standard laptop, while still providing a high degree of functionality. For every part of the system, more accurate algorithms likely exist. Though, in most cases, these algorithms are too computational for real-time operation, especially when put together into a complete system. Some of this work has been published previously in book chapters (Kelley, et al., 2013) (King, et al., 2012) (Espina, et al., 2011) (Kelley, Tavakkoli, King, Nicolescu, & Nicolescu, 2010), conference articles (Kelley, King, Ambardekar, Nicolescu, Nicolescu, & Tavakkoli, 2010) (Tavakkoli, Kelley, King, Nicolescu, Nicolescu, & Bebis, 2008) (Kelley, Tavakkoli, King, Nicolescu, Nicolescu, & Bebis, 2008) (Tavakkoli, Kelley, King, Nicolescu, Nicolescu, & Bebis, 2007) (King, Palathingal, Nicolescu, & Nicolescu, 2007) (King, Palathingal, Nicolescu, & Nicolescu, 2007), and journal articles (Kelley, Tavakkoli, King, Ambardekar, Nicolescu, & Nicolescu, 2012) (King, Palathingal, Nicolescu, & Nicolescu, 2009) (Kelley, King, Tavakkoli, Nicolescu, Nicolescu, & Bebis, 2008).
Region & Feature Detection
At the lowest feature level, the framework combines the detection of Canny edges (Canny, 1986) with that of the color-based Maximally Stable Extremal Regions (MSER) (Forssen & Lowe, 2007), and does so in a way that provides detection results that surpass results obtained by running the algorithms independently. Since the detected regions are identified using both their region-related properties, and by the edges surrounding their perimeter, only minimal processing is needed to identify additional features including line segments, corners, contours, color histograms, and others. Another advantage of the approach is that the set of detected regions provides dense image coverage, allowing it to be used for traditional image segmentation. If a sparse representation is desirable, a stability threshold can be applied to identify a subset of regions that are considered most stable. In both cases, our segmentation offers multiple interpretations of how an image can be segmented, with relative confidence estimates produced for each.

Region detection: Left image shows detected detected regions and lines (green). Right image shows segmented regions using best-fit ellipses.

Line & corner detection: Image shows the results of the perimeter lines & corners algorithm. The center picture shows the lines and corners from all regions. The surrounding pictures show the lines and corners of seven individual regions. Lines are shown in green, line end-points are shown as small green circles, and corners are shown using a red 'x'.

Color-perimeter representation: During region detection, smaller sub-regions are also identified. The sub-regions that lie around the perimeter are used as simple trackable features. The above image displays these clusters. The center picture shows the perimeters from all detected regions. The surrounding pictures show the color-perimeter formed around eleven individual regions. Circles are shaded using the cluster’s average color. Green borders identify clusters that are on the internal edge of the perimeter. Red identifies clusters lying on the outer edge. Every inner cluster is paired with one outer cluster.
Region Tracking
During the iterative growth of the detected regions, pixels are clustered into smaller sub-regions (containing 12-18 pixels). The sub-regions found around each perimeter are used as the basic unit for tracking regions between frames. Perimeter-segments are defined using the RGB color of a sub-region on the internal edge of the perimeter, the RGB color of a sub-region on the external edge of the perimeter, and a directional component extracted from the gradient at that location. The perimeter-segment elements are matched to elements from subsequent frames and a greedy algorithm is used to estimate a likely correspondence between the frames.

Cluster tracking: Image illustrates inter-frame cluster matching. Green lines extend from the cluster in the current frame, to the cluster from the previous frame.

Flow fields: Image shows the detected flow fields at different points in the "Homework" sequence. A red '+' marks image patches that have sufficient texture for motion estimation. Blue lines represent the most prominent motion vector found in each patch.

Region tracking: Images show movement lines associated with each region at different points in the "Homework" sequence. Red circles show the region centers. Green lines show computed motion between frames.
Foreground Segmentation
When foreground models of objects of interest are available, they are used directly to segment foreground objects from the background. When foreground models are not known, segmentation can be achieved by identifying objects that display motion that differs significantly from that of the background.

Foreground segmentation: Images show regions that the system has determined to be foreground. Regions displaying consistent motion are shown in color. All other regions are shown in blue.
Foreground Classification
Regions are modeled and classified using simple color histograms compiled from the contained pixel-clusters over multiple frames. Since complex object recognition and behavioral analysis was not the focus of this dissertation, simple labels were manually assigned to stored models (e.g. book, food, drink, etc.).

Foreground classification: Image shows classified regions highlighted using color-coded ellipses with object names displayed to the right using the same colors.
Action Classification
Foreground objects are classified as ‘active objects’ (those that were observed to move under their own power) and ‘passive objects’ (those that were only observed to move when in close proximity of another object). Interactions are computed between object pairs that contain at least one ‘active object’. Interactions are a function of proximity and the relative direction of the movement vectors. These are generally classified into the three simple categories. In cases where both objects are ‘active’, the categories are ‘converging’, ‘diverging’, and ‘moving together’. In cases where only one object is ‘active’ and the other is ‘passive’, categories are ‘reaching’, ‘dropping’, and ‘carrying’. For purpose of demonstration, only simple predefined behaviors are recognized. Behaviors are assigned to every object/interaction combination. When a combination is observed with a relatively high probability over a sufficient number of frames, the system labels the behavior and initiates the response sequence.

Action classification: Image shows classified regions highlighted using color-coded ellipses with labels displayed to the right of the image showing the object name and the likely interaction that is occurring with that object. The apparent “height” of the label is a measure of the interaction likelihood. In this frame, the system has identified the interaction between the “hand” and the “book” as being the most likely, and has drawn a blue line to connect the object pair.
Automated Response
Predefined responses were assigned to behaviors in some of the demonstrations. When these behaviors are identified by the system, it initiates a response sequence involving investigation by one or more robots.

Robot response: Image shows a person who opened a laptop that he was not authorized to use. The system sent a patrol robot to take high resolution pictures of the person.
References
Canny, J. (1986). A Computational Approach To Edge Detection. IEEE Trans. Pattern Analysis and Machine Intelligence 8(6), 679-698.
Espina, M., Grech, R., Jager, D., Remagnino, P., Iocchi, L., Marchetti, L., et al. (2011). Multi-Robot Teams for Environmental Monitoring. Innovations in Defence Support Systems – Intelligent Paradigms in Security, Springer-Verlag, 183-209.
Forssen, P. (2007). Maximally stable colour regions for recognition and matching. CVPR.
Kelley, R., King, C., Ambardekar, A., Nicolescu, M., Nicolescu, M., & Tavakkoli, A.(2010). Integrating Context into Intent Recognition Systems. Proceedings of the International Conference on Informatics in Control, Automation and Robotics, (pp. 315-320). Funchal, Madeira, Portugal.
Kelley, R., King, C., Tavakkoli, A., Nicolescu, M., Nicolescu, M., & Bebis, G. (2008). An Architecture for Understanding Intent Using a Novel Hidden Markov Formulation. International Journal of Humanoid Robotics - Special Issue on Cognitive Humanoid Robots, 5(2), 203-224.
Kelley, R., Tavakkoli, A., King, C., Ambardekar, A., Nicolescu, M., & Nicolescu, M.(2012). Context-Based Bayesian Intent Recognition. IEEE Transactions on Autonomous Mental Development - Special Issue on Biologically-Inspired Human-Robot Interactions, 4(3), 215-225.
Kelley, R., Tavakkoli, A., King, C., Ambardekar, A., Wigand, L., Nicolescu, M., et al. (2013). Intent Recognition for Human-Robot Interaction. Plan, Activity, and Intent Recognition, Elsevier.
Kelley, R., Tavakkoli, A., King, C., Nicolescu, M., & Nicolescu, M. (2010). Understanding Activities and Intentions for Human-Robot Interaction. Human-Robot Interaction, 288-305.
Kelley, R., Tavakkoli, A., King, C., Nicolescu, M., Nicolescu, M., & Bebis, G. (2008). Understanding Human Intentions via Hidden Markov Models in Autonomous Mobile Robots. Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction, (pp. 367-374). Amsterdam, Netherlands.
King, C., Palathingal, X., Nicolescu, M., & Nicolescu, M. ( 2007). A Vision-Based Architecture for Long-Term Human-Robot Interaction. Proceedings of the International Conference on Human-Computer Interaction, (pp. 1-6). Chamonix, France.
King, C., Palathingal, X., Nicolescu, M., & Nicolescu, M. (2007). A Control Architecture for Long-Term Autonomy of Robotic Assistants. Proceedings of the International Symposium on Visual Computing, (pp. 375-384). Lake Tahoe, Nevada.
King, C., Palathingal, X., Nicolescu, M., & Nicolescu, M. (2009). A Flexible Control Architecture for Extended Autonomy of Robotic Assistants. Journal of Physical Agents, 3(2), 59-69.
King, C., Valera, M., Grech, R., Mullen, R., Remagnino, P., Iocchi, L., et al. (2012). Multi-Robot and Multi-Camera Patrolling. Handbook on Soft Computing for Video Surveillance, 255–286.
Tavakkoli, A., Kelley, R., King, C., Nicolescu, M., Nicolescu, M., & Bebis, G. (2007). A Vision-Based Architecture for Intent Recognition. Proceedings of the International Symposium on Visual Computing, (pp. 173-182). Lake Tahoe, Nevada.
Tavakkoli, A., Kelley, R., King, C., Nicolescu, M., Nicolescu, M., & Bebis, G. (2008). A Visual Tracking Framework for Intent Recognition in Videos. Proceedings of the International Symposium on Visual Computing, (pp. 450-459). Las Vegas, Nevada.