As we know, plastic pollution is a severe problem that impacts both the ecosystem and our daily lives. One
of the major sources of plastic is oceans, and an estimated amount of 5.25 trillion pieces of plastic and
microplastic are currently floating around the ocean, of which 15% of them will eventually land on our
beaches. In response to the plastic problem, we would like to develop a deep learning PET bottle
detection model for mapping plastic pollution on beaches. The data allows government, councils, NGO to have
an overview of the current situation and to estimate how effective their proposal is to reduce the impact of
plastic wastes. Researchers will begin working to prepare and implement an effective plan for prevention and
cleanup effort on where they should focus. Our ultimate goal is to help to reduce the amount of plastic
pollution in the ocean.
Using our drone and phones, we took 718 images along coastlines of beaches including Cheung Chau and Cheung
Sha in Hong Kong, and uploaded them to CVAT(Computer vision annotation tool) - a website provided by
Clearbot, a company which creates marine plastic clearing robots, to create ground truth instances for the
Fig 1. Students using a drone to capture images on beaches
Together with the human practice team, we have organized a plastic tagging activity within our school, we
invited around 5 students from each class, with a total amount of 60 students. During the activity, we
taught them how to trace polygons around a plastic object in an image in CVAT. Together with the help of
some of our team members, we had generated the training data.
Then we train a model using the data obtained with Detectron from Facebook AI Research to obtain plastic
The detection algorithm we’re using is Mask-RCNN, which is an object detection algorithm developed by
Facebook, the algorithm extends the concept of Faster-RCNN, which only creates bounding boxes around
detected objects. We decided to use Detectron2 as a framework to create a model. We use the baselines
from Detectron’s Model Zoo for transfer learning to improve model performance.
To have an intuitive understanding of the model structure. Let's say we have a video that we want to detect PET bottles from, we then process each frame from the video by passing it to a Convolutional Neural Network, which uses filters as kernels to extract features from the images, such as shapes, reflections, highlights, etc. And creating a feature map.
Fig.2 Structure of feature extracting backbone
The feature map is then passed into a Region Proposal Network (RPN), which is a small neural network that
generates region proposals as bounding boxes and whether there is an object in them. Together with the
feature map from the last section, ROI align is applied to each region of interest in the feature map to get
a fixed-dimensioned input for the next section. The output of the ROI align layer is passed into two
networks a Fully Connected Layer and a small Fully convolutional network, together they can create a mask
and classify the object.
Fig.3 Structure of RPN, ROI Align and following two neural network layers
To optimize the results, we need to train the network until the output of the network for the training data
is close to the ground truth results. In other words, the training objective is to minimize the difference
between them. The difference between the ground truth and model output can be defined with a Loss Function,
in the case of Mask-RCNN, it’s defined from the error of bounding box prediction, classification, and mask
prediction. For validation of training results, we use COCO mAP for calculation.
Local Linux environment With:
or Google Colab Notebook
Files and guidelines for the codes can be found in our
We performed data augmentations on the training images, including random flipping, and brightness and
contrast, between the scale 0.9 and 1.1. Each iteration loops through 2 images from the training data for
mini-batch stochastic gradient descent. Our training and validation dataset contains 718 images in total,
and are split in an 8:2 ratio, where there are 574(1145 instances) and 144 (193 instances) images for
training and validation. We trained the models in both Google Colaboratory and Kaggle using both Nividia
P100 and K80 GPU, for 1000 iterations. For validation, we collected the Mean Average Precisions of the model
performing on the validation dataset and losses for every 20 iterations. And plotted them in a line graph.
We trained the model in different baselines from the Detectron Model Zoo, which includes X101-FPN,
R101-FPN, R50-FPN. Also, performed power estimations for the training dataset in fractions of 0.25, 0.5,
0.75 to find out the size of the training data to maximize model performance.
Table 1: The mAP results were tested on the validation dataset by models with different baselines, as we
can see that X101-FPN outperforms other models we tested.
Table 2: The mAP results from the Mask-RCNN paper.
As shown in Table 1, the X101-FPN backbone, which combines the concept of ResNet and InceptionNet,
out-performs the R50-FPN and R101-FPN baseline in terms of overall Mean Average Precision(+3.7, +2.5).
Comparing with the results obtained from the Mask-RCNN paper, which they train their model from the COCO
dataset, consisting 330k images of 80 object categories. Our models clearly have a higher mAP. Sample
detection images are shown below.
Fig. 4: 3 sample images from the validation set detected by the model with X101-FPN baseline
Although we have a high Average Precision, the detections were not perfect, there were clearly some false
positives existing in the sample images.The first reason is probably that our model only detects one object category - PET bottles. Compared with the Mask-RCNN benchmark, they detect 80 categories in total, which drags down their AP. Besides, there are flaws existing in our dataset: The most obvious point is the lack of both training and validation data in our dataset, compared to large scale datasets such as Pascal VOC, imageNet, and COCO, consisting of >10000 instances per category. For training, it inhibits the ability of our model to learn more features of PET bottles. Moreover, our dataset contains images of similar objects of different angles, this causes the model to rely on these duplicate data, and only be able to detect bottles with features similar to them, causing bias in our model.
Finally, we didn’t use some advanced validation methods, such as K-Fold Cross-validation, where the dataset
is divided into k sections, and average the AP obtained by using the distinct sections for validation.
Fig 5. The training curves of X101FPN, R101FPN, and R50FPN (Up to down)
Fig. 5 are the training curves for the models, where the blue line represents the mAP, and the purple line
represents the loss. We can see that the mAP of the R50FPN and R101FPN model reached stability at around 500
iterations, while the X101FPN model reaches stability at around 400 iterations, early than the rest. And
their losses all reached constancy at around 600 iterations. And after the 600 iteration mark, there is no
sudden spike drop in the mAP, so we know that the model wasn’t experiencing overfitting.
As Clearly shown in the graphs, our small scaled dataset is entirely not enough to achieve maximum
performance. The trend of the AP against the training image percentage shows an exponential increase,
which means if we further expand our dataset it will continue to show AP improvements.