Prof. Steve Keckler Talk

June 23, 2016October 3, 2016 / heartinpiece / Leave a comment

Architectures for Deep Neural Networks

DNN is here to stay because of the big data that is available. Lots of data to analyze and run from. Also the computational ability of current systems.

Accuracy have been increased drastically. due to DNN.

Now a time for computer architects to use the creativity to improve performance instead of process technologies iproving performance.

How do DNNs work?

Training & inference

Training -> Large number of data. (Back propogate to learn)
Inference -> Smaller varied number of incoming data

A lot of linear algebra! Output = Weight * Input

Learn the weight using back propagations

Locality: Batching! Multiple images at a time. Input matrices & output matrices (single images are vectors)

Convolutional stage and fully connected stage

Convolutional stages act as trained feature detection stage

6D loop! Dot product accumulator within the image.

Run a larger(deeper 100s~1000s of levels)
Run network faster
Run network more efficiently

Limit of memory capacity (Bandwidth & capacity tradeoff!)

vNN: Virtualizing training!

Trend is larger and deeper neural networks

Recurrant neural networks! (Temporal factor??) Audio recognition

Training needs to keep the memory of each layer output activations so that it an be used to backpopagation!

Volume of that data is very large! GPU memory usage is proportional to the number of layers

Computing set of gradients!

We don’t need data of the first layers until the training goes to the end, then back propagate! These data take up a lot of space! Thus offload to the DRAM (more capacity)

(forward and backward propagation)

Pipeline Writeback and prefetching

Allows training much larger networks! Incurs little overhead relative to hypothetical large memory GPUs, oeverhead will drop with faster CPU & GPU links.

Optimizing DNNs for Inference

Accessing DRAM is far more expensive than computing. But then to have more accuracy we still need large networks!

Opportunities:

Reduce numerical precision: FP32->FP16
Prune networks: Lots of zeros as weights! Thus remove these unnecessary weights. ALso share weights among other elements in the weight matrix
Share weights
Compress netowrk
Re-architect network

Importance of staying local! LPPDDR 640 Pj/word, SRAM(MB) 50 -> SRAM(KB) 5

Cost of operations

Accumulate at greater precision than your multiplications!

Summary

Pruning

Lots of the wieghts are zeros! Prune unnecessary weights!

This allows reducing the network dramatically. You can also prune weights that are close to 0. The other wieghts can be used to recover from pruning! Retraining is important

Prune up to 90% of wegiths, but reiterative learning keeps the accuracy pretty high

Pruning can be aggressively done

Factor of 3x performance improvement of pruning

Weight Sharing

Similar weights can be clustered, and using simple bit indexes, point to fin-tuned centroids

Retrain to account for the change as well!

We can reduce weight table significantly!

Huffman encoding to further improve! up to 35x~49x reduction!

Rearchitect netowrks!

SqueezeNet -> Reduce filters! combination of 1×1 filters and 3×3? This allows really shrinking down the networks!

DNN Inference Accelerator Architectures

Eyeriss: ISSCC 2016 & ISCA 2016

Implications of data flow in DNNs

Do we keep filters stationary or input stationary, etc.

Maximize reuse of data within PE

Data compression! Non linear functios applied between activations. Used to be sigmoid. Now it is ReLU! (Negative values are 0)

Lots of negative values are calculated!

Reduce data footprints… Compress them before moving to offschip ememory.

Saves volume of transfered data

If there are multiplications by 0, then just disable the rest of the calculation, reduce the power!

Up to 25x Energy efficient compared to Jetson TK

EIE: Efficient inference Engine (compressed activations)

All weights are compressed and pruned, weight sharing, zero skipping!

Sparse matrix engine, and skip zeros.

Conclusion

Special purpose processors can maximize energy efficiencies

ISCA Day 3

June 23, 2016June 23, 2016 / heartinpiece / Leave a comment

DRAM2

DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric

Low-power High density reconfigurable RAM.

Low power & high capacity FPGA is desirable, use DRAMs instead of SRAM lookup table to provide low power & high capacity

Challenges of building DRAM-based FPGA

LUT is slower thanSRAM, destructive access (data lost after access)

Narrower MAT (DRAM array) from 1K -> 8~16 bits,

Destructive DRAM read is solved by PRE(charge) ACT(ivation) RST(restore)
followed by a wire transfer. These are sequential, but can be overlapped.
RST can be overlapped by wire transfer, and, etc.

Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs

Resistive Memory Write:

Slower writes have endurance benefits (knife sharpening example)

Adaptively use write speed choosing.

Make use of the idle memory(?) to slowly writeback

Bank-aware mellow writes: Choose banks with less blocks to writeback in the wb queue of memory controller. FOr those relatively free banks, issue slow writes.

Performance degradation is not noticeable, endurance improved by 87%

Eager Mellow Writes:

Predict that LLC dirty lines will not be dirtied again, and so writeback slowly to ReRAM

Does some epoch counting? to find cachelines…

Add a eager mellow write queue, lowest priority but uses memory bandwidth to writeback. Eagerly.

Eager writeback also improves performance as it reduces write queue congestion!!

Also employs lifetiem quota, where a lifetime is enforced.

More energy is used to write slower write!

MITTS: Memory Inter-arrival Time Traffic Shaping

CPU initiated memory bandwidth provisioning.

IaaS can charge the users on memory bandwidth usage (and arrival time)

HW mechanism to provision memory BW (Bulky vs. Bursty bandwidth)

Relative memory inter-arrival time, make into a histogram.

Credits per interarrival time in bins. Thus if you use all your credit, you need to wait, and use the next inter-arrival time credit bin

Array of registers that represent credits in each bin. Also, replenisher to fill the bins

Reliability 2

All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory

Command/Address trnasfer failed to speed up with newer DRAM specs.

CA more prone to erros than data! Due to DDR DIMM technology

CA-parity was introduced in DDR4.

Read address error1 (Read wrong codeword! Data & Data ECC are vailid within themselves, but wrong codeword!!)
Extended data ECC -> also address is encoded into the ECC.

Write Address Error! are even more severe!

Lots of problems possible with command/address

GPUs 2

ActivePointers: Software Address Translation Layer on GPUs

Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit

Agile Paging: Exceeding…

Virtualization has two problem:

Nested paging-> slower walks

Shadow paging-> fast walks, slow page table updates

Use shadow paging for most of the time, and then use nested page walk during the walk.

Observation!

Nested paging is better if the address space is being frequently switched! However, only a fraction of the address space is dynamic.

Only a few entries of teh page table are modified frequently!

Start page walk in shadow mode, optionally switch to nested mode.

One bit of the page table entry signifies that the lower levels require nested paging mode. (Do in nested mode!)

CASH: Supporting IaaS Customers with a Sub-core Configurable Architecture

Sharing architecture-> for each 10 million cycle exectuions, the maxima were different, when cache banks and # of slices (cores) were different.

Learning optimizer, has a feedback mechanism that goes through a Kalman filter(?)

ISCA Day 2

June 21, 2016June 21, 2016 / heartinpiece / Leave a comment

Neural Network

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Contribution

novel eenergy efficient CNN dataflow
Fabricated, verified chip
Taxomony of CNN dataflow

Convolution accounts for more than 90% of overall computation (runtime & energy consumption)

Simple arithmetic: Multiplay and accumulate (MAC)

Add local memory hierarchy: Lots of data is reused (due to nature of window sliding)

Image reuse
filter reuse,
Convolution reuse

Reduce read of filter/image by up to 500x

Taxomony:

Input Stationary
1. Filter is stationary
Output Stationary
1. Image is stationary
No Local Reuse
1. No local reuse….??
Row Stationary <– Contribution
1. Maximize local reuse and accumulation at RF
2. Row-based work dispatching?

Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory

Memory capacity & memory bandwidth issue:
Use PIM in HMC for the issue

We want a programmable (General purpose) & Energy Efficiency

Programmable, Scalable,

PIM in HMC logic die. (Thermal and Energy restriction in the standard?)

Programmability

Data will be pushed into the PE, and this event will trigger data processing

Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory

Design an ISA for Neural Network

Microcontroller

Decoupling Loads for Nano-Instruction Set Computers

McFarlin13 88% of OoO performance comes from scheduling, 12% from Dynamism (react to dynmaic events)

Ld: Compute address, access memory & write register, (Ordering)
Bne: Specify target, compute condition, transfer control

multiple action Loads makes compiler optimization difficult to reorder loads (due to aliasing concerns)

Decoupling Load -> execute load to cache ASAP, writeback to register in order (ordering)

Basically prefetch the data to the Register from the caches early?

These loads are stored using load tags. Thus loads are prefetched into load tags, and any stores to the same physical address as load tags will update load tags! thus consistency is guaranteed

I don’t understand the justification that this is not just prefetching!

But still awesome!

Future Vector Microprocessor Extensions for Data Aggregations

Data aggregation: SQL Group By

Reduction of key-value pairs of: sum, mix, man, average etc.

(slept ㅡㅡ;;)

Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading

Increase instruction windows in SMT

SMT degrades OoO benefits, as more threads are added.

True data dependencies take longer, thus in-order execution

Add a ‘Shelf’ that holds true dependencies(?)???

Shell instructions are executed in order, and form a FIFO for the shelf instructions.

Only OoO ROB(64) baseline: ROB+Shelf(64+64), ROB(128)
The performance is better for 128 ROB, but EDP is better by 64+64 ROB & Shelf

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Add a little bit of compute to memory controller

Dependent cache miss:

The following instructions are delayed due to the loads of previous instructions, that miss. But the dependent are also misses, thus dependent cache miss.

Pointer dependent workloads (pointer chasing, graph processing, etc.)

On-chip delays are pretty signifcant (Shared interconnect, and cache??)

Enhanced Memory Controller(EMC) is used to execute the dependent instructions.

Dependence chain generation generates the instructions to be executed in the EMC.

Miss Predictor Qureshi and Loh (MICRO 2012)

Great work!

Keynote

June 21, 2016June 21, 2016 / heartinpiece / Leave a comment

Improve Dram latency 30% 28% larger die size, 20% power increase, 8% performance improvements
Nvm: PCM, STT-mram, ReRAM

PCM, ReRAM are oriented for large storage, STT for latency?

ISCA Day 1

June 20, 2016June 21, 2016 / heartinpiece / Leave a comment

LAP: Loop-Block Aware Inclusion Properties for Energy-Efficient Asymmetric Last Level Caches

Adaptive selection between inclusive and exclusive cache based on the LLC misses and memory traffic?

This work is of interest, and I should take a look at it.

Loop blocks are used as guides to the adaptive inclusive/exclusive cache.

Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors

Scripting languages offer ease of programming and natural support for event-driven programming model. However, too slow.

Recurring inefficiency of bytecode dispatch loop.

Fetches a bytcode, and doecode, bounds checking, jump address calculation, jump, execute bytecode.

This dispatcher code takes 10~30% of total instruction counts.

There are a few problems:

Hard to predice inderict jumps.
Redundunt calculations
This work solves the two problems above by using the BTB (with bytecode as key!)

Use the BTB using the bytecode as key (not PC)

Hits short-circuit to correct bytecode hanlder
Miss falls back to the original slow path

A Measurement Study of ARM Virtualization Performance

KVM & Xen was used to compare ARM & x86

ARM on Xen can be 4x fast for Hypercalls
4x slower on KVM

This is because Xen is a bare-metal hypervisor.

KVM is type 2 hypervisor. Runs app & virtual machine.

ARM EL2 (hypervisor privelege) is designed for simple hardware (Xen)

KVM does a lot better for Virtualized I/O – Hosted Hypervisor is running with linux in the same level that is sophisticated enough to execute I/O

Xen requires switcing from VM->Xen->Dom0->Xen->HW, Thus a lot of traps!

VHE (Virtualization host extension?) Allows KVM ARM to run on EL2 (Hypervisor priviledge)