Python 010: Augmented Reality with OpenCV

This will help you:

Analyze images or videos to identify features (such as faces or eyes), and add images which follow the detected features.

Computer vision is a complicated subject, but recent advances in computer vision technology have enabled computers to identify all kinds of things from a camera feed. On top of that, there is computer vision software available to the public so that anyone with a little knowledge of code can use computer vision for a project. One example of such software is OpenCV, a library which can be used with Python to analyze images. In this activity, you'll use OpenCV to detect faces and eyes in a video, overlay one image onto another, and finally, add an overlay to a video that "sticks on" to your face.

Time: 2-4 hours / Level: C1

You should already:

  • Be familiar with Python functions and slice notation for indexing lists.

  • Install NumPy: try typing pip install numpy in the terminal.

  • Install OpenCV: OpenCV's installation instructions are bad. Try typing pip install opencv-python in the terminal.

Get the code and resources for this activity by clicking below. It will allow you to download the files from a Google Drive folder. Unzip the folder and save it in a sensible location.


  • Capture - a stream of images from the camera, processed one by one.

  • Image - a multi-dimensional array containing data for each color channel at each x and y value. See Step 3 explanation.

  • Multi-dimensional array - a list is a 1D array. A grid can be represented by a 2D array. Images in OpenCV are generally 3D arrays.

  • Color channel - one part of color in a certain mode of representing color, for example, the red values in RGB color.

  • Slice indexing - indicating positions in an array using square brackets [] and indexes (locations) in the array. See here.

  • Classifier - data from pre-training a machine learning model to recognize an image, so that a the model can be re-created to recognize images.

  • Feature - a type of image or shape which a classifier data set is trained to recognize, in some visibilty conditions with some accuracy.

  • AR/augmented reality - taking sensory information from the real world and combining it with computer-created sensory information, in this case, adding computer images to a webcam feed.

  • Overlay/filter - in our usage, a computerized image placed over another image or video.

Step 1: Warm-up - Show a video

Open and read the comments to try to understand what is happening. Then, run the program by running python in the terminal.

If your computer has a built-in webcam which is properly configured, cv2.VideoCapture(0) should create a video capture feed from it. If you have a secondary camera, for example a USB webcam in addition to one built-in, you may pass in a different device number, such as cv2.VideoCapture(1). To stop the program, hit 'Q' on your keyboard (while focused on the display window).

If you want to read about one of the objects or functions used, here is the documentation:

There is a block of code responsible for (optionally) saving the video that you record. It assumes you are running the program in a folder with a subfolder named "output". It uses the mp4 video format by default, but if you can't view the output video, try a different format. It uses VideoCapture.get()cv2.VideoWriter and VideoWriter.write(). Some of the code in this activity comes from this tutorial on showing and saving videos.

Make sure you understand this code before moving on.

Step 2: Warm-up - Detect faces

Open and read the comments to try to understand what is happening. Then, run the program by running python in the terminal.

If you want to read about one of the objects or functions used, here is the documentation:

Some of the code in this activity comes from this tutorial on detecting faces. The classifier files included in this activity come from here.

There is a face classifier file included in the classifiers folder with this activity. Download a different classifier online by searching "Haar cascade classifier ____" for whatever you want to classify. You should download an .xml file and put it in the classifiers folder. Then, change the file path in the code where it says classifier='classifiers/????.xml'. Run the program again and see if the classifier data you found is effective.

Currently, there is a minimum size for detected features. If features are not showing up, you can decrease it. You can also set a maximum size using maxSize=(w,h). These may help to narrow down what is detected, but they may cause you to miss detections.

Make sure you understand this code before moving on.

Step 3: Warm-up - Overlay images

Open and read the comments to try to understand what is happening. Then, run the program by running python in the terminal.

If you want to read about one of the objects or functions used, here is the documentation:

You can (and should) read more about image arithmetic here. It goes through an example of creating a mask and using it to overlay two images.

Images in OpenCV are treated as 3-dimensional arrays. Imagine a 3-dimensional spreadsheet, where each row represented a row of pixels in the image, each column was a column, and the depth (or spreadsheet tab) indicated one of the color channels - like blue, green and red in BGR color, or red, green, blue and transparency in RGBA color. OpenCV allows indexing images by row, column, and channel using image[0, 0, 1], for example, to get the green value in the top-left pixel of an image. : in list indexing means "all the rest", so image[20:, :, :] cuts off the left side by 20 pixels. overlay[:, :, 3], as used in this code, gets all the x and y values, and only the 4th color channel, which is transparency for a PNG image. 0 is totally transparent, so wherever the image was transparent, the mask will exclude.

See what happens if you change the indices in mask = overlay[:, :, 3], or swap some uses of mask and mask_inv. Try using cv2.imshow("label", ______) to show some of the intermediate images roimaskmask_invoverlayimg_bgimg_fg, or dst.

Make sure you understand this code before moving on.

Step 4: Activity - Detect multiple features

Open and read the comments to try to understand what is happening. Then, run the program by running python in the terminal.

This program defines a separate function, detectAndOverlay(), for processing a single frame. It does all the complicated work - it flips the frame, detects faces in it, and detects eyes within the region of interest (ROI) of the face. Then it draws ellipses over those features and returns the modified frame.

The main loop just does the routine things - loading classifier files, creating a video capture stream, reading the stream, calling the above function, and showing the modified video.

The code in this activity is based on this tutorial. It provides an explanation of how feature detection works in OpenCV.

Try modifying the code to detect a different facial feature using the included classifier files, or some you find on your own. You could try detecting just the mouth, or both eyes as one feature (currently, they're detected as multiple occurences of one feature.) If you're looking for a feature that should only occur once, add a break statement at the end of the innermost loop. The loop is convenient because the results come as a list, but sometimes you only want to display the first thing found.

Step 5: Activity - AR filters

Open and read the comments to try to understand what is happening. Start near the bottom, after return frame - that's all the detectAndOverlay() function, and it's pretty long due to positioning the overlay.

The first section creates classifiers, just like The next section loads an image and creates a mask, just like The background will come from the video, so we don't need to deal with it, and we're going to modify the mask to fit the video at each frame, so we add orig_ to the mask names. The step of combining the masks and images comes at the end of detectAndOverlay() (scroll up a bit).

After that, the routine video steps happen: create a feed, read frames, process, show. There's also code for creating a file to save the video, like in

Now, run the program by running python in the terminal.

Once you understand the program except for the frame processing step, scroll up and read through detectAndOverlay(). The first, outer loop just finds a rectangle with a face, selects it out of the frame by indexing the appropriate x and y coordinates, and then detects eyes in that region of interest. The inner loop deals with the eyes detected - in this case, they're detected as a single wide, rectangular region, rather than 2. The rest of the code resizes and positions the overlay, making sure it fits in the face region of interest. Then the overlay, face region of the image, and masks are combined. The region of the face modified by the mask is pasted back in, and the modified image is returned.

The code in this activity is based on the tutorial here.

Step 6: Make it your own

Download your own PNG images to use as overlays, and add them to any video or image you want. You'll likely want to change what feature (or features) are being detected, using the included classifier files or ones you find online.

You'll probably need to change the scaling and alignment of the overlay. Do this where center_x and center_y are first set for the alignment, and where overlay_width and overlay_height are first set for the scaling. You may want to change which of width and height are computed first; just swap the words 'width' and 'height' in those two lines and the math will work out.

Step 7: Going further

Option 1: Combine this with the image processing activity to use your own filters and effects on the video, in addition to the AR overlays.

Option 2: Edit an existing video instead of your camera feed to make it funnier. Instead of passing 0 into cv2.VideoCapture(), pass in a string with a file path to a video. For example, cap = cv2.VideoCapture('my/video/path.mp4')).

If the program runs too slowly, try processing every 2nd or 3rd frame. Replace these lines:

while True:
    ret, frame =

With these:

frame_count = 0 # initialize frame counter
while True:
    ret, frame =
    frame_count += 1 # increment the frame counter
    if frame_count % 2 != 0: # if the frame count is not divisible by 2;
      continue # continue through the loop without doing it this time.

The counter should be initialized right before the while loop, and the divisibility check should happen immediately after the video capture feed is read. The higher the number (2 in this example), the more video frames will be skipped.

continue is a Python command that works inside of loops and causes the loop to immediately jump to the end of its code. That means whenever you call continue, the rest of the loop will be skipped over and it will start over with the next iteration. It doesn't stop the loop entirely - that's what break does.