Computer Vision with ConvNets

Monitoring the railroad using Neural Networks CS870 project report. August 2017 Vyacheslav Derevyanko David R. Cheriton School of Computer Science University of Waterloo

[email protected] ABSTRACT

2.

In this project various Neural Network architectures are surveyed performing a single task: locating a small object on a complex scene image. Experiments are performed on a large number of NN architectures, gradually increasing in complexity: starting with training a Logistic Regression classifier, simple 3-layered NNs, moving on to Convolutional NNs, and concluding with state-of-the-art CNNs used for object detection and localization. Source code for NN models used is available online1 . Experiments prove superiority of modern CNNs for image classification/object detection tasks over other non-CNN approaches.

Initial experiments were based around the task of finding train’s location on the track given that only one train is running on it. As with any supervised learning system, a neural network needs to be trained on training data before it can be used for image classification tasks on real data stream. I’ve initially collected around 1500 images of a single train running on different parts of the track. The task of finding train’s location was organized as a classification task for a neural network - the track was divided into small segments, and NN was used to classify which segment the locomotive is currently in. Comparing my dataset to image datasets available on the web, the number of images available as training data might sound quite low: even a simpler dataset of handwritten digits MNIST2 has 50000 images in the training set for 10 output categories. I still decided to start experimenting with a small training dataset, thinking that it would provide good feedback on what can be considered a good size for a dataset. Images collected were with 960x480 resolution, which may sound large, as a neural network input layer would need to consist of 460,800 neurons. I’ve halved image resolution to 480x240, and converted them to grayscale to lower the number of neurons in the input layer to 115,200. Nonetheless, the actual train on such image is small - 40x10 pixels, and at that zoom level distinctive features of a train are quite blurry, and so it wasn’t completely clear how a neural network would be able to identify and distinguish trains on input images.

Keywords Convolutional Neural Networks, Machine Learning, image classification, supervised learning, object detection.

1.

INTRODUCTION

Following our class discussions on how CNNs can be applied for computer vision tasks [6, 13], I’ve decided it would be interesting to base my project around application of neural networks for image recognition, and evaluation of how various neural network architectures, parameters and techniques can affect the quality of image recognition and classification. As the basis for this experimentation, I’ve come up with the following setup: a toy model of a railroad would be pictured from above, and image recognition tasks would evolve around recognizing the location of a train on the track. This simulates a real-world scenario of satellite imagery of on-ground activity. Additional experiments would be based around relative locations of multiple trains, such as whether it’s safe for a train to proceed in a particular direction with given velocity, given the location of other trains on the track. As NN classification heavily depends on the quality and quantity of training data, all of the experiments require a large collection of training images. Images for experiments were gathered at the CS452 class lab, which has a large model of a railroad track and control systems that allow to run trains on the track. The lab also has a camera mounted on the ceiling, allowing to take pictures of the track from above.

2.1

https://github.com/slavik112211/cs870 neural networks

Implementing a linear classifier

To be able to train a neural network on collected images, these images needed to be manually classified and labelled according to predetermined track segments. The more finegrained a division into track segments would be, the more precisely would each segment represent a location of the train. Considering that it wasn’t clear how many images were to be labelled, and what would be a good number of categories, I’ve realized that manual labelling wouldn’t be such a great idea. I’ve implemented a custom images classifier that is based on luminosity3 . After converting images to grayscale, a train can be recognized on the image as a dark rectangle of size 40x10. To evaluate whether a train is present at particular location, a 40x10 bounding box is positioned at that location along the track, and a threshold is applied on a summed value of luminosity of all pixels within that 40x10 rectangle. I’ve estimated the luminosity 2

1

EXPERIMENTS: DETECTING A TRAIN

3

http://yann.lecun.com/exdb/mnist/ trains location linear classifier.py

Figure 2: Sigmoid and its derivative at x=6.5 Figure 1: Railroad track divided into 10 segments to be lower than 35000 when a train is within the bounding box, and above 43000 when it’s not. A list of pixels within a bounding box is retrieved by positioning that 40x10 bounding box at the origin point (0,0) and multiplying pixel’s coordinate vector by transformation matrix for each pixel within the bounding box to get pixel’s new coordinates: 0 cos θ − sin θ x x = (1) sin θ cos θ y y0 This way, each train position on the track can be described with only three values: the location of the center of the bounding box (x,y), and its rotation angle θ. I’ve defined 88 potential train locations, as that’s the amount of categories required to express the train’s location with a precision of about half of a train size. Sometimes an image would be classified into multiple categories, and a category with a smaller luminosity value would be picked as a ground truth (i.e. a bigger part of a train shows in one category over the other). The output from linear classifier was inspected visually to ensure that threshold luminosity values are correctly classifying images and cutting out images that don’t belong in such categories.

2.2

Training simple neural networks

Using the output from linear classifier as ground truth labels, I’ve come up with 1436 labelled images depicting a train at 88 different locations on track that in total completely cover the area of the track. After reserving 1/6 of this dataset as testing data (239 images), the training dataset got further reduced to a total of 1197 labelled images. Given that images are divided across categories in a uniform manner, each category was represented by over 10 training images. These images were provided as input to neural networks as 2-dimensional arrays, with each pixel represented by the luminosity score for that pixel - a real value ranging from 0 to 1, where 0 represents black, and 1 represents white respectively.

2.2.1

Custom 3-layer NN

Before using any deep learning libraries I decided to train a custom implementation of a simple 3-layer neural network4 taken from [9], implemented using Python’s Numpy linear algebra package5 . Before attempting to train the network on my dataset, I’ve ensured that it works correctly recognizing digits from MNIST dataset. This network explicitly programs the code for backpropagation and gradient 4

simple neural net.py 5 http://www.numpy.org/

descent, which is good for debugging and learning purposes. To simplify the classification case, 88 categories were joined together to form 10 larger categories (track path chunks), as shown on Fig. 1. This network had the following configuration: 3 layers - 115,200 neurons input (480x240), 30 neurons hidden layer, and 10 neurons output; sigmoid activation function, and sigmoid output layer (not softmax); mean squared error as the cost function. Network’s performance was validated on the test dataset after each training epoch, i.e. each time after training the network over 100% of training data (1197 images). The network didn’t show any capacity for learning, with accuracy not getting higher than 10-20%. Considering that this network implementation works fine with MNIST dataset, I’ve tried to identify the reason for bad performance. I’ve observed that network outputs 1 for multiple categories per each input image. This behavior is undesirable, as network’s output should signify a single category that the image belongs to, not multiple. The network has no way to distinguish between multiple categories that receive a score of 1. This issue slows down the learning process: error cost isn’t attributed properly, and it can’t correctly estimate how wrong the model is in each particular case. It would be possible to prevent this effect by using the softmax function as the output layer, which is a logistic function that allows to represent NN output as a probability distribution, where each output category receives a probability score with a total across all categories summing to 1: ezj σ(z)j = PK k=1

ezk

, for j = 1, ..., K

(2)

where K is the number of output categories, and z is a K-dimensional vector of arbitrary real values, that is being fed into the output layer. Other than that, I believe I’ve hit a classic case of learning slowdown, when a sigmoid activation function is used together with quadratic cost function. Sigmoid function (3) σ(x) =

1 ; 1 + e−x

σ 0 (x) = σ(x) · (1 − σ(x))

(3)

gets very flat in the extremes (when its input is either much larger than 0, or much smaller than 0). This warrants that its derivative tends to get very small, when sigmoid input is not close to 0 (Fig. 2). Quadratic cost function, defined as C=

(y − a)2 , 2

(4)

where a is actual neuron’s output, and y is a desired neuron’s output, tends to cause learning slowdown, as sigmoid derivative σ 0 (x) is present in ∂C/∂w and ∂C/∂b gradients

when quadratic cost is used: ∂C = aσ 0 (x); ∂w

∂C = aσ 0 (x) ∂b

(5)

Given that backpropagation is based on consequent application of the chain rule (multiplication of derivatives), small σ 0 (x) makes resulting weights/biases updates negligible. Cross-entropy cost function, that’s defined as C=−

1X [y ln a + (1 − y) ln(1 − a)] , n x

(6)

summing over training inputs x (total of n), doesn’t exhibit similar problems, as its ∂C/∂w and ∂C/∂b gradients do not depend on σ 0 (x), and aren’t affected by it being close to 0:

Figure 3: Accuracy and cost function charts for 2-layer NN using TensorFlow (88 categories, 600 epochs)

I’ve started with a very simple implementation of the neural network using TensorFlow: no hidden layers, all input neurons directly linked to the output layer. This implementation is provided as part of TensorFlow tutorials6 , where its used to classify MNIST digits, and I only needed to adapt it a little to be able to feed my own training data into that model7 . The fact that this model doesn’t use hidden layers makes it identical to another very commonly used classification model - Multinomial Logistic Regression. Neural networks are different in that a sigmoid activation function is used at each layer of neurons, and backpropagation algorithm [11] is used to acquire cost function gradients for every parameter of the network (weights and biases). But when there’re no hidden layers the model gets radically simpler: backpropagation algorithm isn’t required anymore, as cost function gradients can be acquired by applying differential calculus. An activation function is never applied to the input layer of NN (only to consequent layers), and when there’re no hidden layers input is directly fed into the output layer. Output layer of the neural network can use softmax as the activation function, same as with Logistic Regression. Finally, both models

use the same method to gradually adjust model parameters: Stochastic Gradient Descent. In the first experiment I’ve used the following configuration: 2 layers, 10 output categories, softmax on output layer, cross-entropy cost function, 1197 training images, 239 testing images. Again, I’ve first verified that the model is able to train using MNIST dataset. Since the training dataset is that small, I’ve setup the training to proceed for 300 epochs (approx. 5 minutes), where 1 epoch signifies that all 100% of the training data has been fed into the network. The accuracy was verified using the test dataset each 50 epochs. Accuracy goes over 50% after first 100 epochs, and rises to over 95% after 300 epochs. The model was successfully able to classify images into 10 categories, i.e. it’s able to correctly identify on which path chunk (color-coded on Fig. 1) the train is located. Nonetheless, the original problem was identifying the precise location of the train, and thus it’s more interesting to see the accuracy of NN model when train locations are classified into the original 88 categories, instead of 10 larger categories that these 88 were combined into. As custom luminosity classifier assigned image labels for full 88 categories, changing the setup was simple. With this setup the model doesn’t learn as rapidly, but after training for 400 epochs the testing data was categorized with acceptable 80% accuracy (Fig. 3) To cross-validate this successful result I’ve re-implemented this 2-layer Neural Network8 using Theano9 , a deep learning framework by UMontreal lab MILA. This network had the following parameters: 2 layers, input - 480x240 neurons, output - 10 categories, softmax activation, cross-entropy as a cost function. Input consisted of the same collection of 1197 training images and 239 testing images, network weights were updated after each batch of 10 input images, and training continued for over 300 epochs. At first, training procedure didn’t provide any meaningful results: network’s accuracy was verified each 10 epochs, and training data classification didn’t seem to improve. Moreover, crossentropy cost didn’t decrease as the training progressed, its value was unstable and jumped back and forth. This clearly indicated that weights updates weren’t applied correctly. Adjusting input batch size (10), gradient descent step size (0.15), and other hyperparameters didn’t provide any measurable improvement. One major difference between TensorFlow and Theano implementations was the fact that TensorFlow was randomly reshuffling the order of input images after each training epoch. I’ve noticed that my training data was highly ordered, with input images belonging to

6

8

∂C 1X = xj (σ(z) − y); ∂wj n x

∂C 1X = (σ(z) − y) ∂b n x

(7)

After changing the cost function to cross-entropy, and simplifying the model to two layers by removing the hidden layer, the network was successfully trained to classify railroad images, achieving 95% accuracy after training for about 200 epochs. A comparable 2-layer network using quadratic cost function wasn’t successful: the classification accuracy didn’t increase at all from epoch to epoch, whereas the cost function output would decrease, but incredibly slowly. Given that this same network would provide over 90% accuracy classifying MNIST dataset after just a single training epoch, I concluded that quadratic cost can still be used on simpler datasets, but it’s way less effective than using crossentropy as the cost function. Instead of continuing to improve this custom network implementation I decided to reimplement this simple 3-layer network based on TensorFlow, as that would allow to simply reuse existing implementations of softmax, cross-entropy and other functions from vast TensorFlow collection.

2.2.2

7

2-layer NN using TensorFlow & Theano

https://www.tensorflow.org/get started/mnist/beginners trains tensorflow classifier.py

9

logistic sgd theano.py http://deeplearning.net/tutorial/logreg.html

one category being put together. It turned out that this was the reason why network’s training procedure couldn’t show progress. It was enough to reshuffle input images just once (at the beginning of training) to get Theano’s network to properly update weights, and reach comparable levels of accuracy to TensorFlow’s implementation: 95% after 300 epochs. Another experiment with input data separated into more fine-grained 88 categories has confirmed the previous result (achieved with TensorFlow implementation), and the network has reached over 85% accuracy after training for about 500 epochs.

2.2.3

Enlarging the dataset using Gaussian noise

Although 2-layered networks did relatively well in classification task, first attempts at training 3-layered models didn’t show any signs of incremental accuracy improvement. I assumed that one of the main reasons for bad training results could have been the lack of input training data, a phenomena known as underfitting. NN with 1-hidden layer has a tremendous amount of hidden parameters (weights and biases), that we’re trying to learn. Back of the envelope calculations provide the following numbers. Input layer size is 480*240=115,200. Each input neuron is linked to each neuron in the hidden layer (size 30), which is 3,456,000 connections only between input and hidden layers. And to learn the correct values for this massive amount of parameters I’ve only provided a total of 1197 labelled images. To increase the size of the dataset, I’ve decided to apply a Gaussian random noise filter on the images. For that, each pixel’s luminosity value was slightly modified by using a random value drawn from a normal Gaussian distribution. Considering that each pixel’s initial value is in the range 0-255, random variable was generated with mean of 0 and standard deviation of 8. Noise filter was applied 10 times for each original image, thus generating 10x times more training data (14360 images). The labels were assigned to images according to the labels for non-noisy original images. Results were confirmed visually: noise distorted the images, but not much - image’s objects were still easily recognizable. Both TensorFlow-based and Theano-based NN implementations have displayed classification accuracy improvements. Networks were configured as follows: 2 layers - input/output, 14360 input images (11967/2393 training/test), 88 output categories. Both networks have reached accuracy of 90% after 40 epochs, and over 95% after 100 epochs. Surely, the training took a bit longer, as 11967 images had to be processed each epoch, not 1197. Both frameworks were configured to do processing on a GPU10 , which significantly improved training performance - approx. 10 times in most experiments.

2.2.4

Training a NN with 1 hidden layer

After increasing the volume of training data tenfold I again tried to train 3-layered networks. It was proven that neural nets with even a single hidden layer are able to approximate any function at all - the universality theorem [1]. According to this theorem, we can be certain that there exists a 1-hidden layer NN, that would be able to classify any dataset with 100% accuracy, including my custom dataset. Whereas 2-layered models are quite straightforward - the input is directly mapped to the output, introducing 1 hidden layer increments the amount of parameters (weights/bi10

NVidia GeForce GTX 750 Ti, 2GB

ases) to learn and makes the model much more complex, and thus harder to train. At the same time, it allows to learn more subtle non-linear dependencies between input data categories. A 1-hidden layer network architecture doesn’t have too many hyperparameters to adjust - input layer size is predetermined by input images resolution, output layer size is determined by the number of categories that we want to classify the images into. Thus, this architecture is only flexible in terms of the number of neurons in the hidden layer. Nonetheless, it still proved impossible to achieve any meaningful classification accuracy. The following configuration was used: dataset - 11967/2393 training/test images, hidden layer size - 1000 neurons (couldn’t do much more due to 2GB GPU RAM limit), 88 output categories, tanh as activation function in hidden layer, and negative log-likelihood as a cost function. The larger dataset didn’t seem to make much difference, with classification accuracy lingering at about 5%, no matter for how long the network was trained (ran for 400 epochs). This was confirmed with two implementations - using TensorFlow and Theano. The real reason of course is that the hidden layer of 1000 neurons is simply too small to hold any meaningful features of input image of size over 100,000 pixels. It’s generally recommended for a hidden layer to be at least 1/3 of the size of the input layer, but in case when NN is used to classify images this gives an enormous and prohibitive number of parameters in NN: 100, 000 ∗ 33, 000, or over 3 billion weights, whereas most modern CNNs don’t exceed even 50 million. Thus I concluded that 1 hidden layer nets aren’t applicable to the image classification/object detection task (except very small images, such as in MNIST: 28x28), and moved on to experimenting with CNNs.

3. 3.1

DETECTING OBJECTS USING CNN Classification using LeNet-5

Convolutional Neural Networks handle the aforementioned problem of exploding number of network weights by creating a hierarchy of hidden layers. Usage of CNNs for image classification tasks was popularized in 1998, when LeNet5 CNN[7] was created for classification of images of handwritten digits. Later in 2012 a breakthrough result in image classification was achieved by UToronto group of researchers [6], who created AlexNet - a deep neural net with 5 convolutional and 3 fully-connected layers, and won a large-scale image classification competition ILSVRC by a wide margin comparing to non-CNN approaches11 . In comparison to fully-connected NNs, CNNs aren’t simply classifying image pixels, but are rather able to recognize features such as lines, geometric shapes, color blobs on the input image, and such concepts increase in complexity with each next layer of feature kernels, as they combine the features learned in previous convolutional layers. Such feature kernels function like image filters that are applied to an input image in a sliding window manner. On the first convolutional layer size of the sliding window is relatively small (say 5x5 pixels), but the size of filters effectively doubles with each pooling layer that merges 2x2 groups of filters located nearby to form a filter that covers 10x10 pixels of the original image. Thus stacking combinations of convolutional/pooling layers together one after another allows to create 11

http://www.image-net.org/challenges/LSVRC/2012/results.html

larger and larger filters with each consequent layer. An important aspect of CNNs is that the weights learned for each filter at one layer are the same across all sliding window positions. This greatly reduces the amount of NN parameters that need to be learned in the process of training. Given that LeNet-5 is a simpler CNN than AlexNet, I decided to try applying it to my dataset first. Both TensorFlow and Theano provide reference implementations of LeNet-5 as part of their tutorials12,13 , and thus only the data adapter needed to be implemented to import my dataset into both models, and adjust sizes of convolutional and fullyconnected filters according to the size of input images. The network had the following configuration. Input layer: 115,200 neurons; 1st convlayer: 20 kernels (feature filters) of size 5x5 pixels, 2x2 pooling, tanh activation; 2nd convlayer: 40 kernels of size 5x5 (over 1st layer kernels, effectively 10x10 over original image), 2x2 pooling, tanh activation; 1st fully-connected layer: 500 neurons, tanh activation; output layer: 88 categories, negative log-likelihood cost function; SGD learning rate: 0.1; batch size: 10 images. Networks were trained with varying dataset sizes (clean dataset and 10x-sized dataset with Gaussian noise images), and varying number of output categories: 88 (more specific train location) and 10 (larger chunks of track specifying train location area). None of the tries to train LeNet-5 on this dataset have succeeded, and both TensorFlow and Theano implementations were showing classification accuracy of about 3%, even after training for over 100 epochs. In retrospective, I realized that trying to classify my dataset using LeNet-5 or even AlexNet would never succeed. Both of these CNNs are designed to classify images which contain a single object, and that object spans the whole image. In comparison, images in my dataset represent a table with a model of a railroad, and a train in each image is simply a small dark rectangle somewhere on the tracks. All of the images in my dataset are practically the same, with only difference of the location of that dark rectangle. Thus, LeNet-5/AlexNet architectures are serving a bit of a different purpose. A second reason why classifying using LeNet-5 wouldn’t work is as follows: original application of LeNet-5 was for classifying small 28x28 images from MNIST dataset, with little variation across images within 1 category. AlexNet handles larger and more complex images of size 224x224 from ImageNet dataset. This provides intuition that LeNet-5 architecture is overly simplistic to handle 480x240 images from my dataset, and AlexNet would be more appropriate in this case. For example, one important aspect is the size of feature kernels in the 1st convlayer with larger images it makes sense to increase their size as well (AlexNet uses 11x11 kernels for the 1st layer). To make use of AlexNet CNN with my dataset I’d need to reformulate my train localization problem as an image classification problem: a sliding window of 50x50 pixels would scan the input image, and run each 50x50 tile through AlexNet to identify whether any of the tiles classify as a train or not. It turned out that object detection is a separate wellresearched topic, and there exist appropriate CNN models that tackle this task.

12 13

https://www.tensorflow.org/get started/mnist/pros http://deeplearning.net/tutorial/lenet.html

3.2

Object localization

One of the most fundamental approaches to localizing smaller objects in a larger image using CNNs is based on Linear Regression. It works as follows. A pretrained CNN such as AlexNet has 5 convlayers which output feature detectors, and 3 fully-connected layers which classify input images into categories using feature detectors generated by convlayers. To localize an object in an image, such CNN has to also output coordinates of a bounding box (in addition to object class). That is achieved by substituting the 3 fully-connected layers classifying images (’classification head’) with 3 fully-connected layers that output 4 coordinates of a bounding box (’regression head’). The CNN is then re-trained with this ’regression head’, where training data contains 4 ground-truth coordinates for a bounding box for each image. A more advanced version of this architecture is introduced by OverFeat[12], and it’s using a sliding window approach to improve localization precision. As can be seen, this Linear Regression approach only works when the number of bounding boxes in all images is fixed (usually 1), as the number of output nodes for ’regression head’ is determined by the amount of coordinates that are being predicted, and it’s hard to generalize this approach to object detection scenario when the image might contain a varying number of objects in a scene. Since I wanted to be able to detect multiple trains in a scene, I skipped this model and continued searching for a model that better suits this scenario.

3.3

Object detection using R-CNN

A rather computationally expensive CNN architecture RCNN[3] (proposed in 2014) is able to detect multiple objects of different classes in a single image. This architecture uses a class-agnostic region proposal algorithm Selective Search[14] as a first step of the object detection pipeline. The algorithm isn’t based on any machine learning techniques, it analyzes image features such as color and texture looking for monotonic blobs to detect potential object candidates, and generates a large number of bounding box proposals (’image regions’) - up to 2000 per image. After that, each of such regions is treated as a separate input image for a CNN object localization pipeline similar to the one described above. Even though this architecture works well, it’s quite obvious that this method is tremendously slow - 47 seconds per image. Fast R-CNN[2], an improved architecture released by R-CNN authors a year later, swaps the order of pipeline operations - potential regions pre-calculated using Selective Search are applied directly on the feature kernels produced by CNN convolutions. This allows to save the time spent on running each proposed region through convolutional layers, and improves object detection speed almost 25x times at test time, with most of the time still spent on region proposal algorithm. In their next architecture (Faster R-CNN[10]), authors were able to get rid of the region proposal algorithm altogether. They introduced the Region Proposal Network (RPN), a class-agnostic object detector that is trained on feature kernels produced by CNN convlayers and generates potential regions. Thus, the whole pipeline becomes a monolithic CNN architecture. This latest architecture allows to achieve almost real-time object detection: 0.2 seconds per image. Faster R-CNN paired together with a recently released deep CNN ResNet (101 layers) provides an impressive precision in object detection - 59.0 mAP[4] (mean Av-

Figure 4: SSD (green) vs Faster R-CNN (violet) detection precision. Both models achieve over 95% mAP after training for 6000 batches (5 epochs) erage Precision metric, 0-100 scale) when trained on COCO dataset (dataset used for object detection evaluation, average of 7 objects per image). This result was state of the art in the beginning of 2016. I’ve decided to try experimenting with this architecture. Google has recently released TensorFlow Object Detection API14 , which includes implementation of the Faster R-CNN/ ResNet101 object detector. This model is pretrained on COCO dataset, so there’s no need to spend weeks to train the network. Nonetheless, since I wanted the network to classify my own dataset, the network had to be retrained a bit with my custom dataset15 using a technique called ’transfer learning’: output classification is augmented by introducing new object classes from an additional dataset. To use the model, I’ve reformulated the problem from image classification (88 categories denoting train location) into train object detection problem: each image is put through an object detector CNN to see whether it has any trains, and outputs their specific location. To import my dataset into this model, I’ve converted images from grayscale to RGB, since modern CNNs work with 3 color channels. The model allows to annotate each training image with a list of objects on it (and their bounding boxes). Since my images only contain one train (not multiple), I’ve provided a single bounding box annotation for each image, and introduced 1 additional object class - ’train’. The ground-truth bounding box was calculated as a 50x50 box around the center of a category that the image belongs to (out of previously defined 88 image categories - for ex. center point for category a05: [271, 44]). Training progress can be seen on Fig. 4. In general, the model achieves over 95% accuracy in about 5 epochs, which seems really good comparing to 2-layer NN tried previously. Nonetheless, training this CNN is only possible on a high-end GPU.

3.4

Object detection using SSD

Single-shot MultiBox Detector CNN[8], introduced by Google at the end of 2015 uses a slightly simpler approach than Faster R-CNN. Region proposals work in a similar fashion to Region Proposal Network introduced in Faster R-CNN, i.e. a fixed number of potential regions (’anchor boxes’) are considered. But rather than feeding these potential regions further through feature extractor as in Faster R-CNN to 14 15

github.com/tensorflow/models/tree/master/object detection create trains tf record.py

Figure 5: Train detection, SSD/Inception2 CNN assign an object class, SSD directly uses these regions and evaluates a score for each region for each object class. This simplifies the model, and allows for faster training/evaluation time without sacrificing detection quality much. SSD model is also provided as part of TensorFlow Object Detection API, where it’s paired together with GoogLeNet Inception2 CNN[13]. This model requires data in the same format as Faster R-CNN, and thus no additional setup was required. Authors of Object Detection API have written a thorough SSD vs Faster R-CNN comparison paper[5], where they mention that Inception2 is a way more compact network comparing to ResNet101 (10 million vs 42 million parameters respectively), and requires up to 3 times less GPU FLOPs, and way less GPU RAM. This has been confirmed in my tests, where SSD was easier to work with, and was generally faster. Object detection accuracy comparison to FasterRCNN/ResNet101 can be seen on Fig. 4. In general, precision improvement with SSD seems to go a little slower comparing to Faster R-CNN (in terms of training epochs), but both converge to over 95% precision in less than 5 epochs. SSD seems faster in wall-clock time though due to being a simpler model. Train detection results can be seen on Fig. 5. An interesting observation is that the trains located off-track (in the right track circle) aren’t detected at all. This is the desired outcome, and the reason for that is that these trains were never provided as ground-truth values on network training, and thus it didn’t learn to recognize them. As I provided 50x50 pixels rectangles as ground-truth boxes that represent a ’train’, instead of a dark stripe of 40x10 pixels, the network didn’t learn to recognize the real-world trains of size 40x10 pixels, but rather these 50x50 image tiles.

4.

CONCLUSION

This project allowed to survey various Neural Network models, and apply them to the task of object detection. In my experiments, I’ve gradually increased the complexity of NNs - from most simple 2-layer ones to present day state-ofthe-art CNNs used for object detection. The project allowed me to get a hands-on experience with training NNs, and get a good understanding of their capabilities and limitations. Experiments confirmed that simple NN architectures aren’t applicable to the case of object detection, and that recent progress in this field is significant. I haven’t experimented with trying to detect multiple train objects on track, but SSD/R-CNN object detection models can easily be used for detection of multiple objects in a scene, and thus I consider this goal accomplished as well.

5.

REFERENCES

[1] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303–314, Dec. 1989. [2] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015. [3] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014. [4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [5] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016. [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016. [9] M. A. Nielsen. Neural Networks and Deep Learning. Determination Press, 2015. [10] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. [11] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–538, 1986. [12] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013. [13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015. [14] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.

Computer Vision with ConvNets

Recommend Documents