On my not-too-shabby laptop CPU, I can run most common CNN models in (at most) 10–100 milliseconds with libraries like TensorFlow.
In 2019, even a smartphone can run “heavy” CNN models (like ResNet) in less than half a second. So imagine my surprise when I timed my own simple implementation of a convolution layer and found that it took over 2 seconds for a single layer!
It’s no surprise that modern deep learning libraries have production-level, highly-optimized implementations of most operations.
Continue reading “Anatomy of a High-Performance Convolution”