Gesture-Based Automatic Picture Taking

2011Stanford University

OpenCVMatlabImage ProcessingComputer VisionAndroidViola-JonesMAP Classifier

A Stanford EE368 Digital Image Processing project exploring hands-free group photography. The challenge: when everyone needs to be in the picture, no one is available to press the shutter — and camera timers mean sprinting into frame for every shot.

The solution is a two-phase algorithm: Viola-Jones face detection first waits for subjects to hold still, then a skin-color MAP classifier detects a specific hand gesture — two hands forming an enclosed shape below the face — to trigger the camera automatically. Face detection ran in real-time on Android via OpenCV, with gesture recognition prototyped in Matlab.

Read the paper

Live Demo

Experience the algorithm with your own camera. After face detection and stability check, a skin-color classifier creates a binary mask of the chest region. Region labeling then searches for a non-skin area enclosed by skin — indicating two hands forming a closed shape. Adjust the MAP threshold to tune sensitivity.

Loading demo...

How It Works

Camera Frame Capture

The Android device camera continuously captures preview frames at 640×480 resolution. Each frame is passed to OpenCV for processing through the detection pipeline.

Face Detection (Viola-Jones)

A Haar-like feature cascade scans the frame with a sliding window at multiple scales. Integral images enable constant-time feature evaluation regardless of window size. Detected faces produce bounding boxes with position coordinates.

Face Stability Check

Face bounding box positions are compared across consecutive frames using root mean square (RMS) difference. When the RMS drops below a threshold for several frames, subjects are considered stationary and the system advances to gesture detection.

Skin Color Detection (MAP Classifier)

A Maximum A Posteriori classifier trained on RGB pixel distributions identifies skin-colored regions in the area below detected faces. A tunable threshold parameter controls the sensitivity of skin vs. non-skin classification.

Mask Post-Processing & Region Labeling

Small isolated regions are removed and morphological dilation fills gaps in the binary skin mask. Connected-component labeling then identifies distinct regions, searching for a non-skin region fully enclosed by skin — indicating hands forming a closed shape.

Gesture Recognition & Capture

When an enclosed non-skin region is detected within the skin mask, the algorithm recognizes the “ready” gesture. A short countdown fires and the camera captures the image automatically.

Technical Highlights

Two-Phase Detection Pipeline

Separating face detection from gesture detection reduced false positives significantly. Gesture processing only activates after face positions stabilize, avoiding wasted computation on moving subjects. The algorithm was prototyped across Android (OpenCV for real-time face detection) and Matlab (for rapid gesture recognition iteration), validating the full pipeline before committing to a complete mobile implementation.

Efficient Feature Cascade & Skin Model

The Viola-Jones cascade with integral images enables near-real-time face detection on mobile hardware, with early rejection stages discarding non-face windows quickly. The RGB-space MAP skin classifier, trained on diverse skin tones, provides more robust segmentation than simple hue thresholding — with a tunable threshold to balance precision and recall across lighting conditions.

Improved Version: Hand Landmark Detection

Curious how this concept could work with modern ML? An alternate version replaces pixel-level skin classification with MediaPipe hand landmark tracking, detecting 21 landmarks per hand for precise fingertip positioning.

Try the advanced demo