TensorFlow was the new kid on the block when it was introduced in 2015, and since then, it’s become one of the most used deep learning framework.
I jumped on the train a few months after the first release and began my journey into deep learning during my master’s thesis. It took a while to get used to the computation graph and session model, but since then I’ve got my head around most of the quirks and twists.
This short article is not an introduction to TensorFlow, but it instead offers some quick tips, mostly focused on performance, that reveal common pitfalls and may boost your model and training performance to new levels.
We’ll start with pre-processing and your input pipeline, visit graph construction, and then move on to debugging and performance optimizations.
Table of contents
Pre-processing and input pipelines
Keep pre-processing clean and lean
Are you baffled at how long it takes to train your relatively simple model? Check your pre-processing! If you’re doing any heavy pre-processing like transforming data to neural network inputs, those can significantly slow down your inference speed.
In my case I was creating so-called ‘distance maps’—grayscale images used in “Deep Interactive Object Selection” as additional inputs, using a custom Python function.
My training speed topped out at around 2.4 images per second, even after I switched to a much more powerful GTX 1080. I then noticed the bottleneck, and after applying my fix, I was able to train at around 50 images per second.
If you notice such a bottleneck, the usual first impulse is to optimize the code. But a much more effective way to strip away computation time from your training pipeline is to move the pre-processing into a one-time operation that generates TFRecord files.
You only have to do this heavy pre-processing once to create TFRecords for all your training data, and your pipeline then boils down to loading the records. Even if you want to introduce some kind of randomness to augment your data, its worth to think about creating the different variations once instead of bloating your pipeline.
Watch your queues
A way to notice expensive pre-processing pipelines are the queue graphs in TensorBoard. These are generated automatically if you use the frameworks QueueRunners and store the summaries in a file.
The graphs show if your machine was able to keep the queues filled. If you notice negative spikes in the graphs, this means your system is unable to generate new data in the time your machine wants to process one batch.
One of the reasons for this was already discussed in the previous section. The most common reason in my experience is large min_after_dequeue values. If your queues try to keep lots of records in memory, they can easily saturate your capacities, which leads to swapping and slows down your queues significantly.
Other reasons could be hardware issues like disks that are too slow or the data is too large for your system to handle. Whatever it is, fixing it will speed up your training process.
Graph construction and training
Finalize your graph
TensorFlow’s separate graph construction and graph computation model are quite rare in day-to-day programming and can cause some confusion for beginners. This applies to bugs and error messages, which can occur in the code for the first time when the graph is built, and then again when it’s actually evaluated, which is counterintuitive when you’re used to code being evaluated just once.
Another issue is graph construction in combination with training loops. These loops are usually ‘standard’ Python loops and can therefore alter the graph and add new operations to it. Altering a graph while continuously evaluating it will create a major performance loss, but it’s rather hard to notice at first.
Thankfully, there’s an easy fix. Just finalize your graph before starting your training loop by calling tf.getDefaultGraph().finalize(). This will lock the graph and any attempts to add a new operation will throw an error. Exactly what we want.
Profile your graph
A less prominently advertised feature of TensorFlow is profiling. There is a mechanism to record run times and memory consumption of your graph’s operations. This can come in handy if you’re looking for bottlenecks or need to find out if a model can be trained on your machine without swapping to the hard drive.
To generate profiling data, you need to perform a single run through your graph with tracing enabled:
Afterwards, a timeline.json file is saved to the current folder and the tracing data becomes available in TensorBoard. You can now easily see how long an operation takes to compute and how much memory it consumes. Just open the graph view in TensorBoard, select your latest run on the left, and you should see performance details on the right.
On the one hand, this allows you to adjust your model in order to use your machine as much as possible. On the other hand, it lets you find bottlenecks in your training pipeline. If you prefer a timeline view, you can load the timeline.json file in Google Chrome’s Trace Event Profiling Tool.
Another nice tool is tfprof, which makes use of the same functionality for memory and execution time profiling, but offers more convenience features. Additional statistics require code changes.
Watch your memory
Profiling, as explained in the previous section, allows you to keep an eye on the memory usage of particular operations, but watching your whole model’s memory consumption is even more important. Always make sure that you don’t exceed your machine’s memory, as swapping will most certainly slow down your input pipeline and your GPU will start waiting for new data.
A simple top or, as explained in one of the previous sections, the queue graphs in TensorBoard should be sufficient for detecting such behavior. A detailed investigation can then be done using the aforementioned tracing.
Print is your friend
My main tool for debugging issues like stagnating loss or strange outputs is tf.Print. Due to the nature of neural networks, looking at the raw values of tensors inside your model usually doesn’t make much sense. Nobody can interpret millions of floating point numbers and see whats wrong.
But printing out shapes or mean values can give great insights. If you’re trying to implement an existing model, this allows you to compare your model’s values to the ones in the paper or article and can help you solve tricky issues or expose typos in papers.
TensorFlow has also included some tools to help with debugging: the TensorFlow Debugger, introduced with TF 1.0, and eager execution in 2.0. I haven’t had the chance to use either of these tools yet, but it’s good to know that the TensorFlow team is iterating on solutions for model debugging.
Set an operation execution timeout
So you’ve implemented your model, launched your session, and nothing happens. This is usually caused by empty queues, but if you have no idea, identifying which queue could be responsible for the mishap is relatively simple: Just enable the operation execution timeout when creating your session, and your script will crash when an operation exceeds your limit:
Using the stack trace you can then find out which op os causing the problem, fix the error, and continue training.
Of course, this is only a selection of the tips, tricks, and tools at your disposal when working with TensorFlow. And the TensorFlow team continues to add functionality that makes it easier to build, train, and deploy machine learning models.
Discuss this post on Hacker News.