Vector Computers - People

Vector Computers - People

CS252 Graduate Computer Architecture Lecture 9 Prediction (Cont) (Dependencies, Load Values, Data Values) February 22nd, 2010 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley Review: Typical Branch History Table Fetch PC 00 k I-Cache Instruction Opcode BHT Index 2k-entry BHT, n bits/entry offset +

Branch? Target PC Taken/Taken? 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions 2/22/10 cs252-S10, Lecture 9 2 Pipeline considerations for BHT Only predicts branch direction. Therefore, cannot redirect fetch stream until after branch target is determined. Correctly predicted taken branch penalty Jump Register penalty A P F B I J R

E PC Generation/Mux Instruction Fetch Stage 1 Instruction Fetch Stage 2 Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute Remainder of execute pipeline (+ another 6 stages) UltraSPARC-III fetch pipeline 2/22/10 cs252-S10, Lecture 9 3 Branch Target Buffer predicted target BPb Branch Target Buffer (2k entries)

IMEM k PC target BP BP bits are stored with the predicted target address. IF stage: If (BP=taken) then nPC=target else nPC=PC+4 later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb 2/22/10 cs252-S10, Lecture 9 4 Address Collisions in BTB 132 Jump +100 Assume a 128-entry BTB 1028 Add ..... target 236

BPb take Instruction What will be fetched after the instruction at 1028? Memory BTB prediction Correct target = 236 = 1032 kill PC=236 and fetch PC=1032 Is this a common occurrence? Can we avoid these bubbles? 2/22/10 cs252-S10, Lecture 9 5 BTB is only for Control Instructions BTB contains useful information for branch and jump instructions only Do not update it for other instructions For all other instructions the next PC is PC+4 ! How to achieve this effect without decoding the

instruction? 2/22/10 cs252-S10, Lecture 9 6 Branch Target Buffer (BTB) I-Cache 2k-entry direct-mapped BTB PC (can also be associative) Entry PC Valid predicted target PC valid target k =

match Keep both the branch PC and target PC in the BTB PC+4 is fetched if match fails Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded 2/22/10 cs252-S10, Lecture 9 7 Consulting BTB Before Decoding 132 Jump +100 entry PC 132 target 236 BPb take 1028 Add ..... The match for PC=1028 fails and 1028+4 is fetched eliminates false predictions after ALU instructions BTB contains entries only for control transfer instructions more room to store branch targets

2/22/10 cs252-S10, Lecture 9 8 Combining BTB and BHT BTB entries are considerably more expensive than BHT, but can redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR) BHT can hold many more entries and is more accurate BTB BHT in later pipeline stage corrects when BTB misses a predicted taken branch BHT A P F B

I J R E PC Generation/Mux Instruction Fetch Stage 1 Instruction Fetch Stage 2 Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute BTB/BHT only updated after branch resolves in E stage 2/22/10 cs252-S10, Lecture 9 9 Uses of Jump Register (JR) Switch statements (jump to address of matching case) BTB works well if same case used repeatedly Dynamic function call (jump to run-time function address) BTB works well if same function usually called, (e.g., in C++ programming, when objects have same type in virtual function call) Subroutine returns (jump to return address) BTB works well if usually return to the same place

Often one function called from many distinct call sites! How well does BTB work for each of these cases? 2/22/10 cs252-S10, Lecture 9 10 Subroutine Return Stack Small structure to accelerate JR for subroutine returns, typically much more accurate than BTBs. fa() { fb(); nexta: } fb() { fc(); nextb: } fc() { fd(); nextc: } Pop return address when subroutine return decoded Push return address when function call executed &nextc &nextb &nexta 2/22/10 cs252-S10, Lecture 9 k entries

(typically k=8-16) 11 Performance: Return Address Predictor Cache most recent return addresses: Call Push a return address on stack Return Pop an address off stack & predict as new PC Misprediction frequency 70% go m88ksim 60% cc1 50% compress 40% xlisp ijpeg 30%

perl 20% vortex 10% 0% 0 2/22/10 1 2 4 8 Return address buffer entries cs252-S10, Lecture 9 16 12 Correlating Branches Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch

Two possibilities; Current branch depends on: Last m most recently executed branches anywhere in program Produces a GA (for global adaptive) in the Yeh and Patt classification (e.g. GAg) Last m most recent outcomes of same branch. Produces a PA (for per-address adaptive) in same classification (e.g. PAg) Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table entry A single history table shared by all branches (appends a g at end), indexed by history value. Address is used along with history to select table entry (appends a p at end of classification) If only portion of address used, often appends an s to indicate set-indexed tables (I.e. GAs) 2/22/10 cs252-S10, Lecture 9 13 Exploiting Spatial Correlation Yeh and Patt, 1992 if (x[i] < y += if (x[i] < c -=

7) then 1; 5) then 4; If first condition false, second condition also false History register, H, records the direction of the last N branches executed by the processor 2/22/10 cs252-S10, Lecture 9 14 Correlating Branches For instance, consider global history, set-indexed BHT. That gives us a GAs history table. (2,2) GAs predictor Branch address First 2 means that we keep two bits of history Second means that we have 2 bit counters in each slot. Then behavior of recent branches selects between, say, four predictions of next branch,

updating just that prediction Note that the original two-bit counter solution would be a (0,2) GAs predictor Note also that aliasing is possible here... 2/22/10 2-bits per branch predictors Prediction Prediction Each slot is 2-bit counter 2-bit global branch history register cs252-S10, Lecture 9 15 Two-Level Branch Predictor (e.g. GAs) Pentium Pro uses the result from the last two branches to select one of the four sets of BHT bits (~95% correct) 00 Fetch PC k

2-bit global branch history shift register Shift in Taken/Taken results of each branch 2/22/10 cs252-S10, Lecture 9 Taken/Taken? 16 What are Important Metrics? Clearly, Hit Rate matters Even 1% can be important when above 90% hit rate Speed: Does this affect cycle time? Space: Clearly Total Space matters! Papers which do not try to normalize across different options are playing fast and lose with data Try to get best performance for the cost 2/22/10 cs252-S10, Lecture 9 17

Accuracy of Different Schemes 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 11% 16% 14% 12% 10% 8% 6% 6% 6% 5% 4% 4% 2% 1% Unlimited entries: 2-bits/entry cs252-S10, Lecture 9 li

eqntott espresso gcc fpppp spice doducd tomcatv 0% matrix300 0% 0% 1% 4,096 entries: 2-bits per entry 2/22/10 6% 5%

nasa7 Frequency of Mispredictions Frequency of Mispredictions 18% 18% 1,024 entries (2,2) 18 BHT Accuracy Mispredict because either: Wrong guess for that branch Got branch history of wrong branch when index the table 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% For SPEC92, 4096 about as good as infinite table How could HW predict this loop will execute 3 times using a simple mechanism? Need to track history of just that branch For given pattern, track most likely following branch direction Leads to two separate types of recent history tracking:

GBHR (Global Branch History Register) PABHR (Per Address Branch History Table) Two separate types of Pattern tracking 2/22/10 GPHT (Global Pattern History Table) PAPHT (Per Address Pattern History Table) cs252-S10, Lecture 9 19 Yeh and Patt classification GBHR PABHR GPHT GAg GPHT PAg PABHR PAPHT PAp

GAg: Global History Register, Global History Table PAg: Per-Address History Register, Global History Table PAp: Per-Address History Register, Per-Address History Table 2/22/10 cs252-S10, Lecture 9 20 Two-Level Adaptive Schemes: History Registers of Same Length (6 bits) PAp best: But uses a lot more state! GAg not effective with 6-bit history registers Every branch updates the same history registerinterference PAg performs better because it has a branch history table 2/22/10 cs252-S10, Lecture 9 21 Versions with Roughly same accuracy (97%) Cost:

GAg requires 18-bit history register PAg requires 12-bit history register PAp requires 6-bit history register PAg is the cheapest among these 2/22/10 cs252-S10, Lecture 9 22 Why doesnt GAg do better? Difference between GAg and both PA variants: GAg tracks correllations between different branches PAg/PAp track corellations between different instances of the same branch These are two different types of pattern tracking Among other things, GAg good for branches in straight-line code, while PA variants good for loops Problem with GAg? It aliases results from different branches into same table Issue is that different branches may take same global pattern and resolve it differently GAg doesnt leave flexibility to do this 2/22/10

cs252-S10, Lecture 9 23 Other Global Variants: Try to Avoid Aliasing PAPHT GPHT GBHR GBHR Address GAs GShare GAs: Global History Register, Per-Address (Set Associative) History Table Gshare: Global History Register, Global History Table with 2/22/10 cs252-S10, Lecture 9

Simple attempt at anti-aliasing 24 Branches are Highly Biased From: A Comparative Analysis of Schemes for Correlated Branch Prediction, by Cliff Young, Nicolas Gloy, and Michael D. Smith Many branches are highly biased to be taken or not taken Use of path history can be used to further bias branch behavior Can we exploit bias to better predict the unbiased branches? Yes: filter out biased branches to save prediction resources for the unbiased ones 2/22/10 cs252-S10, Lecture 9 25 Exploiting Bias to avoid Aliasing: Bimode and YAGS Address History Address Histor y TAG

Pred Pred = = 2/22/10 BiMode cs252-S10, Lecture 9 TAG YAGS 26 Is Global or Local better? Neither: Some branches local, some global From: An Analysis of Correlation and Predictability: What Makes Two-Level Branch Predictors Work, Evers, Patel, Chappell, Patt Difference in predictability quite significant for some branches! 2/22/10

cs252-S10, Lecture 9 27 Dynamically finding structure in Spaghetti Consider complex spaghetti code Are all branches likely to need the same type of branch prediction? No. ? What to do about it? How about predicting which predictor will be best? Called a Tournament predictor 2/22/10 cs252-S10, Lecture 9 28 Tournament Predictors Motivation for correlating branch predictors is 2bit predictor failed on important branches; by adding

global information, performance improved Tournament predictors: use 2 predictors, 1 based on global information and 1 based on local information, and combine with a selector Use the predictor that tends to guess correctly addr Predictor A 2/22/10 history Predictor B cs252-S10, Lecture 9 29 Tournament Predictor in Alpha 21264 1. 4K 2-bit counters to choose from among a global predictor and a local predictor 2. Global predictor (GAg): 4K entries, indexed by the history of the last 12 branches; each

entry in the global predictor is a standard 2-bit predictor 12-bit pattern: ith bit 0 => ith prior branch not taken; ith bit 1 => ith prior branch taken; 3. Local predictor consists of a 2-level predictor (PAg): Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted. Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180,000 transistors) 2/22/10 cs252-S10, Lecture 9 30 % of predictions from local predictor in Tournament Scheme 0% 20% 40% 60%

80% 98% 100% 94% 90% nasa7 matrix300 tomcatv doduc 55% spice 76% 72% 63% fpppp gcc espresso eqntott 37% 69% li

2/22/10 100% cs252-S10, Lecture 9 31 Accuracy of Branch Prediction 99% 99% 100% tomcatv 95% doduc 84% fpppp 86% 82% li 77%

97% 88% 98% 86% 82% espresso gcc 70% 0% 20% 40% 60% 80% 98% Profile-based 2-bit counter Tournament 96%

88% 94% fig 3.40 100% Branch prediction accuracy Profile: branch profile from last execution (static in that in encoded in instruction, but profile) 2/22/10 cs252-S10, Lecture 9 32 Accuracy v. Size (SPEC89) Conditional branch misprediction rate 10% 9% 8% Local 7% 6%

5% Correlating 4% 3% Tournament 2% 1% 0% 0 8 16 24 32 40 48 56 64

72 80 88 96 104 112 120 128 Total predictor size (Kbits) 2/22/10 cs252-S10, Lecture 9 33 Review: Memory Disambiguation Question: Given a load that follows a store in program order, are the two related? Trying to detect RAW hazards through memory Stores commit in order (ROB), so no WAR/WAW memory hazards. Implementation Keep queue of stores, in program order Watch for position of new loads relative to existing stores Typically, this is a different buffer than ROB! Could be ROB (has right properties), but too expensive When have address for load, check store queue:

If any store prior to load is waiting for its address????? If load address matches earlier store address (associative lookup), then we have a memory-induced RAW hazard: store value available return value store value not available return ROB number of source Otherwise, send out request to memory Will relax exact dependency checking in later lecture 2/22/10 cs252-S10, Lecture 9 34 In-Order Memory Queue Execute all loads and stores in program order => Load and store cannot leave ROB for execution until all previous loads and stores have completed execution Can still execute loads and stores speculatively, and out-of-order with respect to other instructions 2/22/10 cs252-S10, Lecture 9 35 Conservative O-o-O Load Execution st r1, (r2)

ld r3, (r4) Split execution of store instruction into two phases: address calculation and data write Can execute load before store, if addresses known and r4 != r2 Each load address compared with addresses of all previous uncommitted stores (can use partial conservative check i.e., bottom 12 bits of address) Dont execute load if any previous store address not known (MIPS R10K, 16 entry address queue) 2/22/10 cs252-S10, Lecture 9 36 Address Speculation st r1, (r2) ld r3, (r4) Guess that r4 != r2 Execute load before store address known Need to hold all completed but uncommitted load/store addresses in program order If subsequently find r4==r2, squash load and all following instructions => Large penalty for inaccurate address speculation 2/22/10 cs252-S10, Lecture 9

37 Memory Dependence Prediction (Alpha 21264) st r1, (r2) ld r3, (r4) Guess that r4 != r2 and execute load before store If later find r4==r2, squash load and all following instructions, but mark load instruction as store-wait Subsequent executions of the same load instruction will wait for all previous stores to complete Periodically clear store-wait bits 2/22/10 cs252-S10, Lecture 9 38 Speculative Loads / Stores Just like register updates, stores should not modify the memory until after the instruction is committed - A speculative store buffer is a structure introduced to hold speculative store data. 2/22/10 cs252-S10, Lecture 9

39 Speculative Store Buffer Speculative Store Buffer V V V V V V S S S S S S Tag Tag Tag Tag Tag Tag Load Address Data

Data Data Data Data Data L1 Data Cache Tags Store Commit Path Data Load Data On store execute: mark entry valid and speculative, and save data and tag of instruction. On store commit: clear speculative bit and eventually move data to cache On store abort: clear valid bit 2/22/10 cs252-S10, Lecture 9 40

Speculative Store Buffer Speculative Store Buffer V V V V V V 2/22/10 S S S S S S Tag Tag Tag Tag Tag Tag

Load Address Data Data Data Data Data Data L1 Data Cache Tags Store Commit Path Data Load Data If data in both store buffer and cache, which should we use: Speculative store buffer If same address in store buffer twice, which should we use: Youngest store older than load cs252-S10, Lecture 9 41 Memory Dependence Prediction Important to speculate?

Two Extremes: Nave Speculation: always let load go forward No Speculation: always wait for dependencies to be resolved Compare Nave Speculation to No Speculation False Dependency: wait when dont have to Order Violation: result of speculating incorrectly Goal of prediction: Avoid false dependencies and order violations From Memory Dependence Prediction using Store Sets, Chrysos and Emer. 2/22/10 cs252-S10, Lecture 9 42 Said another way: Could we do better? Results from same paper: performance improvement

with oracle predictor We can get significantly better performance if we find a good predictor Question: How to build a good predictor? 2/22/10 cs252-S10, Lecture 9 43 Premise: Past indicates Future Basic Premise is that past dependencies indicate future dependencies Not always true! Hopefully true most of time Store Set: Set of store insts that affect given load Example: Addr 0 4 8 12 Inst Store C Store A Store B Store C

28 Load B Store set { PC 8 } 32 Load D Store set { (null) } 36 Load C Store set { PC 0, PC 12 } 40 Load B Store set { PC 8 } Idea: Store set for load starts empty. If ever load go forward and this causes a violation, add offending store to loads store set Approach: For each indeterminate load: If Store from Store set is in pipeline, stall Else let go forward Does this work? 2/22/10 cs252-S10, Lecture 9 44 How well does an infinite tracking work? Infinite here means to place no limits on: Number of store sets Number of stores in given set Seems to do pretty well Note: Not Predicted means load had empty store set

Only Applu and Xlisp seems to have false dependencies 2/22/10 cs252-S10, Lecture 9 45 How to track Store Sets in reality? SSIT: Assigns Loads and Stores to Store Set ID (SSID) Notice that this requires each store to be in only one store set! LFST: Maps SSIDs to most recent fetched store When Load is fetched, allows it to find most recent store in its store set that is executing (if any) allows stalling until store finished When Store is fetched, allows it to wait for previous store in store set Pretty much same type of ordering as enforced by ROB anyway Transitivity loads end up waiting for all active stores in store set What if store needs to be in two store sets? Allow store sets to be merged together deterministically Two loads, multiple stores get same SSID Want periodic clearing of SSIT to avoid: 2/22/10 problems with aliasing across program Out of control merging

cs252-S10, Lecture 9 46 How well does this do? Comparison against Store Barrier Cache Marks individual Stores as tending to cause memory violations Not specific to particular loads. Problem with APPLU? Analyzed in paper: has complex 3-level inner loop in which loads occasionally depend on stores Forces overly conservative stalls (i.e. false dependencies) 2/22/10 cs252-S10, Lecture 9 47 Load Value Predictability Try to predict the result of a load before going to memory Paper: Value locality and load value prediction Mikko H. Lipasti, Christopher B. Wilkerson and John Paul Shen Notion of value locality Fraction of instances of a given load

that match last n different values Is there any value locality in typical programs? Yes! With history depth of 1: most integer programs show over 50% repetition With history depth of 16: most integer programs show over 80% repetition Not everything does well: see cjpeg, swm256, and tomcatv Locality varies by type: Quite high for inst/data addresses Reasonable for integer values Not as high for FP values 2/22/10 cs252-S10, Lecture 9 48 Load Value Prediction Table Instruction Addr Prediction LVPT

Results Load Value Prediction Table (LVPT) Untagged, Direct Mapped Takes Instructions Predicted Data Contains history of last n unique values from given instruction Can contain aliases, since untagged How to predict? When n=1, easy When n=16? Use Oracle Is every load predictable? No! Why not? Must identify predictable loads somehow 2/22/10 cs252-S10, Lecture 9 49 Load Classification Table (LCT) Instruction Addr LCT

Predictable? Correction Load Classification Table (LCT) Untagged, Direct Mapped Takes Instructions Single bit of whether or not to predict How to implement? Uses saturating counters (2 or 1 bit) When prediction correct, increment When prediction incorrect, decrement With 2 bit counter 0,1 not predictable 2 predictable 3 constant (very predictable) With 1 bit counter 2/22/10 0 not predictable 1 constant (very predictable) cs252-S10, Lecture 9 50

Accuracy of LCT Question of accuracy is about how well we avoid: Predicting unpredictable load Not predicting predictable loads How well does this work? Difference between Simple and Limit: history depth Simple: depth 1 Limit: depth 16 Limit tends to classify more things as predictable (since this works more often) Basic Principle: Often works better to have one structure decide on the basic predictability of structure Independent of prediction structure 2/22/10 cs252-S10, Lecture 9 51 Constant Value Unit Idea: Identify a load

instruction as constant Can ignore cache lookup (no verification) Must enforce by monitoring result of stores to remove constant status How well does this work? Seems to identify 6-18% of loads as constant Must be unchanging enough to cause LCT to classify as constant 2/22/10 cs252-S10, Lecture 9 52 Load Value Architecture LCT/LVPT in fetch stage CVU in execute stage Used to bypass cache entirely (Know that result is good) Results: Some speedups 2/22/10

21264 seems to do better than Power PC Authors think this is because of small first-level cache and in-order execution makes CVU more useful cs252-S10, Lecture 9 53 Conclusion Correlation: Recently executed branches correlated with next branch. Either different branches (GA) Or different executions of same branches (PA). Two-Level Branch Prediction Uses complex history (either global or local) to predict next branch Two tables: a history table and a pattern table Global Predictors: GAg, GAs, GShare, Bimode, YAGS Local Predictors: PAg, PAp, PAs Dependence Prediction: Try to predict whether load depends on stores before addresses are known Store set: Set of stores that have had dependencies with load in past

Last Value Prediction Predict that value of load will be similar (same?) as previous value Works better than one might expect 2/22/10 cs252-S10, Lecture 9 54

Recently Viewed Presentations

  • Pediatric Ward Orientation -

    Pediatric Ward Orientation -

    AM Handover. AM handover starts promptly at 7:15 or 7:35am. AM handover on weekends start at 8:30. Team on take that day has the later AM & PM handover times. Ensure you print a list and bring it to handover....
  • Diapositiva 1 - WEBFISIO

    Diapositiva 1 - WEBFISIO

    Widdowson, W Matthew WM; Gibney, James J. The effect of growth hormone replacement on exercise capacity in patients with GH deficiency: a metaanalysis. The Journal of clinical endocrinology and metabolism 93. 11 (November 2008): 4413-4417.
  • Missions, Objectives and Logic Models

    Missions, Objectives and Logic Models

    Logic Models What is a Logic Model? Tool for program planning and evaluation Picture of a program Graphic representation of "theory of action" Relationship between what we put in (inputs), what we do (outputs), and what results (outcomes) Logical chain...
  • Poetic Devices The Sounds of Poetry Get our

    Poetic Devices The Sounds of Poetry Get our

    Examples: There once was a girl from Chicago. I'm making a pizza the size of the sun. "'Twas the Night Before Christmas" 'Twasthe night before Christmas, when all through the house. Not a creature was stirring, not even a mouse;...
  • CS202 Introduction to Java  Introduction to Java  Philosophy

    CS202 Introduction to Java Introduction to Java Philosophy

    But! This is not a destructor. Java objects do not always get garbage collected - The garbage collector is only run after all references to an object have been released and memory is insufficient (or running low). It may just...
  • Lecture 12 NP Class

    Lecture 12 NP Class

    Hamiltonian cycle is a cycle passing every vertex exactly once. Nondeterministic Algorithm Guess a permutation of all vertices. Check whether this permutation gives a cycle. If yes, then algorithm halts. What is the running time? Minimum Spanning Tree Given an...
  • Which is the most developed country in Asia?

    Which is the most developed country in Asia?

    Hypothesis - Copy either;The Brandt line is still accurateorThe Brandt line is inaccurate. Choose 4countries from Asia that you want to compare.2 above the Brandt line and 2 below. Choose 3 development indicators you want to compare. 1 social e.g...
  • International Telecommunication Union - ITU

    International Telecommunication Union - ITU

    Challenges * Geneva, 13-16 July 2009 To encourage the implementation of standards to meet the requirement of regulation, customers, carriers and vendors To develop more standards concerning the environment and climate change To study the standardization on new emerging technologies...