Building a Custom Mask RCNN Model with TensorFlow Object Detection Building a Custom Mask RCNN Model with TensorFlow Object Detection
Doing cool things with data! You can now build a custom Mask RCNN model using TensorFlow Object Detection Library! Mask RCNN is an instance segmentation... Building a Custom Mask RCNN Model with TensorFlow Object Detection

Doing cool things with data!

You can now build a custom Mask RCNN model using TensorFlow Object Detection Library! Mask RCNN is an instance segmentation model that can identify pixel-by-pixel location of any object. This article is the second part of my popular post where I explain the basics of Mask RCNN model and apply a pre-trained mask model on videos.

Training a mask model is a bit more involved that training an object detection model since the object mask is also needed at training time. It took me a few iterations to figure out the process and I have shared the key details here. I trained a Mask RCNN model on a toy. See demo below:

Custom Mask RCNN Model on a toy

You can find the code on my GitHub repo.

If you have an interesting project using Mask RCNNs and need help, please reach out to me at

1) Collecting Data and Creating Masks

A regular object detection model requires you to annotate the object in an image using a bounding box. However the input for a mask model is a PNG file with masks. See example below:

Object Mask — Toy

With this binary mask image, the model can extract both the coordinates of the bounding box as well as the pixel-wise location of the object.

The tool I used for creating masks is Pixel Annotation Tool. The output from this tool is the PNG file in the format that the API wants. You can open the image in the annotation tool and use a brush to “color” the toy. It is also important to color the outside and mark it as outside the region of interest. It took me about 20 seconds to color and save each mask image, which isn’t too bad. If you want the masks to be very accurate, then use a fine brush at the edges. Through my experimentation, I observed that training a Mask RCNN model requires fewer images than training a Faster RCNN model to get to the same accuracy.

2. Generating TF Records

The input to a TensorFlow Object Detection model is a TFRecord file which you can think of as a compressed representation of the image, the bounding box, the mask, etc., so that at the time of training the model has all the information in one place. The easiest way to create this file is to use a similar script available for TFRecord for Pets dataset and modifying it a bit for our case. I have shared the script I used on my GitHub repo.

You will also need to create a label.pbtxt file that is used to convert label name to a numeric id. For my case it was as simple as:

item {
 id: 1
 name: ‘toy’

3. Selecting Model Hyper Parameters

Now you can choose the Mask Model you want to use. The TensorFlow API provides four model options. I chose the Mask RCNN Inception V2, which means that Inception V2 is used as the feature extractor. This model is the fastest at inference time though it may not have the highest accuracy. The model parameters are stored in a config file. I used the config file for the coco model of the same type and updated the number of classes and the paths keeping most of the model parameters the same.

4. Training the Model

With the input files and the parameters locked, you can start the training. I was able to train this model on a CPU in a few hours. You can start the training job and the evaluation jobs on two separate terminals at the same time. And initiate TensorBoard to monitor performance. I stopped training when I saw the loss plateauing.

The coolest thing in TensorBoard is that it allows you to visualize the predictions on sample images from test set as training progresses. The gif below shows the model becoming certain of its mask and bounding box predictions as the training progresses.

5. Test the Model on your Custom Video

To test the model, we first select a model checkpoint (usually the latest) and export it to a frozen inference graph. The script for this is also on my GitHub. I tested the model on a new video recorded on my iPhone. As in my previous article, I used the Python moviepy library to parse the video into frames and then run object detector on each frame and collate results back into the video.

Next Steps

Additional explorations for the future:

  • I would love to extend this model to multiple categories of objects in the same image. The TFRecord creator script would need some modifications so it can properly assign each object the correct label and mask
  • As I mentioned, I used the most lightweight model for this project. I would love to see how the other models in the suite — that are slower — perform in terms of accuracy of detection

Other writings

PS: I have my own deep learning consultancy and love to work on interesting problems. I have helped several startups deploy innovative AI based solutions. Check us out at

If you have a project that we can collaborate on, then please contact me through my website or at


Priya Dwivedi

Priya Dwivedi

Priya Dwivedi has 10+ years experience as a data scientist. She now runs her own data analytics consultancy that builds deep learning models for Computer Vision and NLP problems. She has helped many startups deploy innovative AI based solutions. For more info please see the link — If you are interested in collaborating with her then please contact her at

Open Data Science - Your News Source for AI, Machine Learning & more