Saturday, March 17, 2018

Object Detection DL training with Tensorflow on GPU AWS

Turns out that when if you want to train a model with say 5 types of different category of images you would need to make use of an Ec2 instance on AWS that has GPU capabilities.

Else what happens with EC2 CPU instances is that they quickly run out of memory on the first dozen steps and the process gets killed .

For that you would need at the very least a p2.xlarge and this is billed at around $0.9/hr ( at time of writing of this article) so still very expensive.  So make sure that this VM is turned off the moment its not in use.

I tried installing a vanilla p2.xlarge but ended up having issues with NVIDIA drivers so when you do launch an EC2 instance from AWS try to do so with an already configured AMI e.g AWS Deep Learning AMI.

Follow the following steps:
  1. NVDIA drivers are properly installed - if your using the AWS Deep Learning AMI then chances are you don't need to worry about that
  2. Then install Docker CE for Ubuntu
  3. Ensure that the following post-installation instructions  are also covered
  4. Then install NVIDIA Docker using instructions on the page.

You should then be able to launch your docker instance using the following command:


docker volume create notebooks

nvidia-docker run -it --name tensorflow -v notebooks:/notebooks -p 8888:8888 -p 6006:6006 gcr.io/tensorflow/tensorflow:latest-gpu


Once you got the container running then its just a question of following my other articles to continue with the training:

1. Setup of environment , in my case using Docker
2. Labeling and creation of tfRecord
3. Training Custom Object Detection

So typically you would use the GPU instances  to train your models and CPU instances only to run test against your frozen inference graph for example using jupyter as less expensive.

No comments: