## Architectures for Deep Neural Networks

DNN is here to stay because of the big data that is available. Lots of data to analyze and run from. Also the computational ability of current systems.

Accuracy have been increased drastically. due to DNN.

Now a time for computer architects to use the creativity to improve performance instead of process technologies iproving performance.

## How do DNNs work?

Training & inference

- Training -> Large number of data. (Back propogate to learn)
- Inference -> Smaller varied number of incoming data

A lot of linear algebra! Output = Weight * Input

Learn the weight using back propagations

Locality: Batching! Multiple images at a time. Input matrices & output matrices (single images are vectors)

**Convolutional stage and fully connected stage**

Convolutional stages act as trained feature detection stage

6D loop! Dot product accumulator within the image.

- Run a larger(deeper 100s~1000s of levels)
- Run network faster
- Run network more efficiently

Limit of memory capacity (Bandwidth & capacity tradeoff!)

## vNN: Virtualizing training!

Trend is larger and deeper neural networks

Recurrant neural networks! (Temporal factor??) Audio recognition

Training needs to keep the memory of each layer output activations so that it an be used to backpopagation!

Volume of that data is very large! GPU memory usage is proportional to the number of layers

Computing set of gradients!

We don’t need data of the first layers until the training goes to the end, then back propagate! These data take up a lot of space! Thus offload to the DRAM (more capacity)

(forward and backward propagation)

Pipeline Writeback and prefetching

Allows training much larger networks! Incurs little overhead relative to hypothetical large memory GPUs, oeverhead will drop with faster CPU & GPU links.

## Optimizing DNNs for Inference

Accessing DRAM is far more expensive than computing. But then to have more accuracy we still need large networks!

Opportunities:

- Reduce numerical precision: FP32->FP16
- Prune networks: Lots of zeros as weights! Thus remove these unnecessary weights. ALso share weights among other elements in the weight matrix
- Share weights
- Compress netowrk
- Re-architect network

Importance of staying local! LPPDDR 640 Pj/word, SRAM(MB) 50 -> SRAM(KB) 5

Accumulate at greater precision than your multiplications!

### Pruning

Lots of the wieghts are zeros! Prune unnecessary weights!

This allows reducing the network dramatically. You can also prune weights that are close to 0. The other wieghts can be used to recover from pruning! Retraining is important

Prune up to 90% of wegiths, but reiterative learning keeps the accuracy pretty high

Pruning can be aggressively done

Factor of 3x performance improvement of pruning

#### Weight Sharing

Similar weights can be clustered, and using simple bit indexes, point to fin-tuned centroids

Retrain to account for the change as well!

We can reduce weight table significantly!

Huffman encoding to further improve! up to 35x~49x reduction!

#### Rearchitect netowrks!

SqueezeNet -> Reduce filters! combination of 1×1 filters and 3×3? This allows really shrinking down the networks!

### DNN Inference Accelerator Architectures

##### Eyeriss: ISSCC 2016 & ISCA 2016

Implications of data flow in DNNs

Do we keep filters stationary or input stationary, etc.

Maximize reuse of data within PE

Data compression! Non linear functios applied between activations. Used to be sigmoid. Now it is ReLU! (Negative values are 0)

Lots of negative values are calculated!

Reduce data footprints… Compress them before moving to offschip ememory.

Saves volume of transfered data

If there are multiplications by 0, then just disable the rest of the calculation, reduce the power!

Up to 25x Energy efficient compared to Jetson TK

##### EIE: Efficient inference Engine (compressed activations)

All weights are compressed and pruned, weight sharing, zero skipping!

Sparse matrix engine, and skip zeros.

## Conclusion

Special purpose processors can maximize energy efficiencies