Architectures for Deep Neural Networks
DNN is here to stay because of the big data that is available. Lots of data to analyze and run from. Also the computational ability of current systems.
Accuracy have been increased drastically. due to DNN.
Now a time for computer architects to use the creativity to improve performance instead of process technologies iproving performance.
How do DNNs work?
Training & inference
- Training -> Large number of data. (Back propogate to learn)
- Inference -> Smaller varied number of incoming data
A lot of linear algebra! Output = Weight * Input
Learn the weight using back propagations
Locality: Batching! Multiple images at a time. Input matrices & output matrices (single images are vectors)
Convolutional stage and fully connected stage
Convolutional stages act as trained feature detection stage
6D loop! Dot product accumulator within the image.
- Run a larger(deeper 100s~1000s of levels)
- Run network faster
- Run network more efficiently
Limit of memory capacity (Bandwidth & capacity tradeoff!)
vNN: Virtualizing training!
Trend is larger and deeper neural networks
Recurrant neural networks! (Temporal factor??) Audio recognition
Training needs to keep the memory of each layer output activations so that it an be used to backpopagation!
Volume of that data is very large! GPU memory usage is proportional to the number of layers
Computing set of gradients!
We don’t need data of the first layers until the training goes to the end, then back propagate! These data take up a lot of space! Thus offload to the DRAM (more capacity)
(forward and backward propagation)
Pipeline Writeback and prefetching
Allows training much larger networks! Incurs little overhead relative to hypothetical large memory GPUs, oeverhead will drop with faster CPU & GPU links.
Optimizing DNNs for Inference
Accessing DRAM is far more expensive than computing. But then to have more accuracy we still need large networks!
Opportunities:
- Reduce numerical precision: FP32->FP16
- Prune networks: Lots of zeros as weights! Thus remove these unnecessary weights. ALso share weights among other elements in the weight matrix
- Share weights
- Compress netowrk
- Re-architect network
Importance of staying local! LPPDDR 640 Pj/word, SRAM(MB) 50 -> SRAM(KB) 5
Accumulate at greater precision than your multiplications!
Pruning
Lots of the wieghts are zeros! Prune unnecessary weights!

This allows reducing the network dramatically. You can also prune weights that are close to 0. The other wieghts can be used to recover from pruning! Retraining is important
Prune up to 90% of wegiths, but reiterative learning keeps the accuracy pretty high
Pruning can be aggressively done
Factor of 3x performance improvement of pruning
Weight Sharing
Similar weights can be clustered, and using simple bit indexes, point to fin-tuned centroids
Retrain to account for the change as well!
We can reduce weight table significantly!
Huffman encoding to further improve! up to 35x~49x reduction!
Rearchitect netowrks!
SqueezeNet -> Reduce filters! combination of 1×1 filters and 3×3? This allows really shrinking down the networks!
DNN Inference Accelerator Architectures
Eyeriss: ISSCC 2016 & ISCA 2016
Implications of data flow in DNNs
Do we keep filters stationary or input stationary, etc.
Maximize reuse of data within PE
Data compression! Non linear functios applied between activations. Used to be sigmoid. Now it is ReLU! (Negative values are 0)
Lots of negative values are calculated!
Reduce data footprints… Compress them before moving to offschip ememory.
Saves volume of transfered data
If there are multiplications by 0, then just disable the rest of the calculation, reduce the power!
Up to 25x Energy efficient compared to Jetson TK
EIE: Efficient inference Engine (compressed activations)
All weights are compressed and pruned, weight sharing, zero skipping!
Sparse matrix engine, and skip zeros.
Conclusion
Special purpose processors can maximize energy efficiencies







