Team:HK GTC/Deep learning

HK_GTC 2021 Homepage


Deep Learning

Detection of Plastic Bottles

As we know, plastic pollution is a severe problem that impacts both the ecosystem and our daily lives. One of the major sources of plastic is oceans, and an estimated amount of 5.25 trillion pieces of plastic and microplastic are currently floating around the ocean, of which 15% of them will eventually land on our beaches[1]. In response to the plastic problem, we would like to develop a deep learning PET bottle detection model for mapping plastic pollution on beaches. The data allows government, councils, NGO to have an overview of the current situation and to estimate how effective their proposal is to reduce the impact of plastic wastes. Researchers will begin working to prepare and implement an effective plan for prevention and cleanup effort on where they should focus. Our ultimate goal is to help to reduce the amount of plastic pollution in the ocean.

Our Workflow

Photo Taking

Using our drone and phones, we took 718 images along coastlines of beaches including Cheung Chau and Cheung Sha in Hong Kong, and uploaded them to CVAT(Computer vision annotation tool)[2] - a website provided by Clearbot, a company which creates marine plastic clearing robots, to create ground truth instances for the training process.

Fig 1. Students using a drone to capture images on beaches

Volunteer Plastic Tagging

Together with the human practice team, we have organized a plastic tagging activity within our school, we invited around 5 students from each class, with a total amount of 60 students. During the activity, we taught them how to trace polygons around a plastic object in an image in CVAT. Together with the help of some of our team members, we had generated the training data.

Then we train a model using the data obtained with Detectron from Facebook AI Research to obtain plastic detecting models.

Model Description

The detection algorithm we’re using is Mask-RCNN[3], which is an object detection algorithm developed by Facebook, the algorithm extends the concept of Faster-RCNN, which only creates bounding boxes around detected objects. We decided to use Detectron2[4] as a framework to create a model. We use the baselines from Detectron’s Model Zoo[5] for transfer learning to improve model performance.


To have an intuitive understanding of the model structure. Let's say we have a video that we want to detect PET bottles from, we then process each frame from the video by passing it to a Convolutional Neural Network, which uses filters as kernels to extract features from the images, such as shapes, reflections, highlights, etc. And creating a feature map.

Fig.2 Structure of feature extracting backbone

The feature map is then passed into a Region Proposal Network (RPN), which is a small neural network that generates region proposals as bounding boxes and whether there is an object in them. Together with the feature map from the last section, ROI align is applied to each region of interest in the feature map to get a fixed-dimensioned input for the next section. The output of the ROI align layer is passed into two networks a Fully Connected Layer and a small Fully convolutional network, together they can create a mask and classify the object.

Fig.3 Structure of RPN, ROI Align and following two neural network layers


To optimize the results, we need to train the network until the output of the network for the training data is close to the ground truth results. In other words, the training objective is to minimize the difference between them. The difference between the ground truth and model output can be defined with a Loss Function, in the case of Mask-RCNN, it’s defined from the error of bounding box prediction, classification, and mask prediction. For validation of training results, we use COCO mAP[6] for calculation.

Model Usage


Local Linux environment With:

Jupyter Notebook
pytorch 1.8

or Google Colab Notebook

Files and guidelines for the codes can be found in our Github.

Model Results

Training / Validation Configurations

We performed data augmentations on the training images, including random flipping, and brightness and contrast, between the scale 0.9 and 1.1. Each iteration loops through 2 images from the training data for mini-batch stochastic gradient descent. Our training and validation dataset contains 718 images in total, and are split in an 8:2 ratio, where there are 574(1145 instances) and 144 (193 instances) images for training and validation. We trained the models in both Google Colaboratory and Kaggle using both Nividia P100 and K80 GPU, for 1000 iterations. For validation, we collected the Mean Average Precisions of the model performing on the validation dataset and losses for every 20 iterations. And plotted them in a line graph.

We trained the model in different baselines from the Detectron Model Zoo[4], which includes X101-FPN, R101-FPN, R50-FPN. Also, performed power estimations for the training dataset in fractions of 0.25, 0.5, 0.75 to find out the size of the training data to maximize model performance.

Baseline results

Table 1: The mAP results were tested on the validation dataset by models with different baselines, as we can see that X101-FPN outperforms other models we tested.

Table 2: The mAP results from the Mask-RCNN paper.

As shown in Table 1, the X101-FPN backbone, which combines the concept of ResNet and InceptionNet, out-performs the R50-FPN and R101-FPN baseline in terms of overall Mean Average Precision(+3.7, +2.5). Comparing with the results obtained from the Mask-RCNN paper, which they train their model from the COCO dataset, consisting 330k images of 80 object categories. Our models clearly have a higher mAP. Sample detection images are shown below.

Fig. 4: 3 sample images from the validation set detected by the model with X101-FPN baseline

Potential reasons for High AP & inaccurate performance

Although we have a high Average Precision, the detections were not perfect, there were clearly some false positives existing in the sample images.

The first reason is probably that our model only detects one object category - PET bottles. Compared with the Mask-RCNN benchmark, they detect 80 categories in total, which drags down their AP.

Besides, there are flaws existing in our dataset: The most obvious point is the lack of both training and validation data in our dataset, compared to large scale datasets such as Pascal VOC, imageNet, and COCO, consisting of >10000 instances per category. For training, it inhibits the ability of our model to learn more features of PET bottles. Moreover, our dataset contains images of similar objects of different angles, this causes the model to rely on these duplicate data, and only be able to detect bottles with features similar to them, causing bias in our model.

Finally, we didn’t use some advanced validation methods, such as K-Fold Cross-validation, where the dataset is divided into k sections, and average the AP obtained by using the distinct sections for validation.

Fig 5. The training curves of X101FPN, R101FPN, and R50FPN (Up to down)

Fig. 5 are the training curves for the models, where the blue line represents the mAP, and the purple line represents the loss. We can see that the mAP of the R50FPN and R101FPN model reached stability at around 500 iterations, while the X101FPN model reaches stability at around 400 iterations, early than the rest. And their losses all reached constancy at around 600 iterations. And after the 600 iteration mark, there is no sudden spike drop in the mAP, so we know that the model wasn’t experiencing overfitting.

Power Estimations of the dataset

As Clearly shown in the graphs, our small scaled dataset is entirely not enough to achieve maximum performance. The trend of the AP against the training image percentage shows an exponential increase, which means if we further expand our dataset it will continue to show AP improvements.

Future Plans/Implementations


  • Increase the size of training and validation data to improve model accuracy and ensure a reliable mAP
  • Train our dataset on other algorithms e.g. YOLOv3.


  • Verify actions done to coastline plastic pollution from different stakeholders (e.g. Effects of producer responsibility schemes by the government, Duration of effect of beach cleanups by NGOs)
  • Use the model as a backbone of a plastic cleanup robot around the coastline.
  • Mapping of plastic waste around the coastline using drones to generate a cleanup plan.



Follow Us!