ISCA Day 2

Neural Network

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks


  • novel eenergy efficient CNN dataflow
  • Fabricated, verified chip
  • Taxomony of CNN dataflow

Convolution accounts for more than 90% of overall computation (runtime & energy consumption)

Simple arithmetic: Multiplay and accumulate (MAC)

Add local memory hierarchy: Lots of data is reused (due to nature of window sliding)

  • Image reuse
  • filter reuse,
  • Convolution reuse

Reduce read of filter/image by up to 500x


  1. Input Stationary
    1. Filter is stationary
  2. Output Stationary
    1. Image is stationary
  3. No Local Reuse
    1. No local reuse….??
  4. Row Stationary <– Contribution
    1. Maximize local reuse and accumulation at RF
    2. Row-based work dispatching?

Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory

Memory capacity & memory bandwidth issue:
Use PIM in HMC for the issue

We want a programmable (General purpose) & Energy Efficiency

Programmable, Scalable,

PIM in HMC logic die. (Thermal and Energy restriction in the standard?)


Data will be pushed into the PE, and this event will trigger data processing

Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory

Design an ISA for Neural Network


Decoupling Loads for Nano-Instruction Set Computers

McFarlin13 88% of OoO performance comes from scheduling, 12% from Dynamism (react to dynmaic events)

  • Ld: Compute address, access memory & write register, (Ordering)
  • Bne: Specify target, compute condition, transfer control

multiple action Loads makes compiler optimization difficult to reorder loads (due to aliasing concerns)

Decoupling Load -> execute load to cache ASAP, writeback to register in order (ordering)

Basically prefetch the data to the Register from the caches early?

These loads are stored using load tags. Thus loads are prefetched into load tags, and any stores to the same physical  address as load tags will update load tags! thus consistency is guaranteed

I don’t understand the justification that this is not just prefetching!

But still awesome!

Future Vector Microprocessor Extensions for Data Aggregations

Data aggregation: SQL Group By

Reduction of key-value pairs of: sum, mix, man, average etc.

(slept ㅡㅡ;;)

Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading

Increase instruction windows in SMT

SMT degrades OoO benefits, as more threads are added.

True data dependencies take longer, thus in-order execution

Add a ‘Shelf’ that holds true dependencies(?)???

Shell instructions are executed in order, and form a FIFO for the shelf instructions.

Only OoO ROB(64) baseline: ROB+Shelf(64+64), ROB(128)
The performance is better for 128 ROB, but EDP is better by 64+64 ROB & Shelf

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Add a little bit of compute to memory controller

Dependent cache miss:

The following instructions are delayed due to the loads of previous instructions, that miss. But the dependent are also misses, thus dependent cache miss.

Pointer dependent workloads (pointer chasing, graph processing, etc.)

On-chip delays are pretty signifcant (Shared interconnect, and cache??)

Enhanced Memory Controller(EMC) is used to execute the dependent instructions.

Dependence chain generation generates the instructions to be executed in the EMC.

Miss Predictor Qureshi and Loh (MICRO 2012)

Great work!




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s