Active Research Areas and Recent Developments 30MIN Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer 1 Challenges and Active Research Areas Metric Sensitivity

Problems with NHST and p-value Continuous Decision Making (Time permit) Beyond Average Treatment Effect, a.k.a. Effect Heterogeneity Time Varying Effects and Experiment Duration Leakage/Violation of SUTVA(stable unit treatment value assumption) and network interference Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer 2

Metric Sensitivity Metric Sensitivity is not just about statistical power P(Detect a True Movement) = P(Detect | True Movement)P(True Movement) Statistical Power quantifies P(Detect | True Movement) P(True Movement) reveals the sensitivity of the Metric responding to the experiments When a metric lacks sensitivity, which part?

Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer 3 Case 1: Statistical Power Increase traffic/run larger experiments/power up, limited to capacity(total user base) Surprise! Run experiment longer wont always help Variance Reduction (CUPED) improve your statistical test, same metric but pure power increase Change Estimand(What you estimate) from average treatment effect to something else Transformation and capping of highly-skewed metrics Interpret with caution

Non-parametric test such as Wilcoxon Rank-Sum test/Mann-Whitney U-test. Hard to interpret(Not comparing median) Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer Challenge Run experiments longer wont always help T-stat = Reduce => improve sensitivity without changing definition of can be reduced by increasing sample size But running experiments longer CAN change distribution!

5 Variance Reduction Find another such that 1. , so both are estimating the same Average Treatment Effect 2. , so test based on is more sensitive Motivation: baseline adjustment mixture model

total variance = between-group variance + within-group variance Because of randomization, the proportional of heavy users vs light users (X) might be slightly different between treatment and control. If treatment has more heavy user(baseline), it likely will have bigger revenues/user. Intuitively we need to adjust: Variance explained by X CUPED (WSDM 2013)

Define , AS LONG AS ! We call X such that COVARIATE or baseline. Intuitively X are things that are not affected by treatment Anything that we know at pre-experiment or pre-treatment triggering time can be considered as X What is ? Pick the one minimize the variance of We call it Controlled Experiment Using Pre-Experiment Data Extension is a Linear Adjustment (We found it often good enough)

Better adjustment is to find optimal adjustment in where and Any regression method, e.g. boosting, forest can be used here to learn and Any and can be used and is still unbiased. But better regression fit means more variance reduction Case 2: P(True Movement) is low A perfectly designed metric is not actionable if you can not move it What should I do? Re-engineer your metric. You need different OEC at different stages of your product. DAU(daily active users) easy to move for a new product, but harder for matured sites Session Success Rate and Time To First Success moves a lot more than Sessions/UU

Define a new surrogate metric as a function of metrics with higher P(Movement) and reasonably statistical power. Calibrate the function form so that the surrogate metric aligns with your OEC Challenge: Nave linear regression might be misleading Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer Combo Metrics Let X be the set of surrogate metrics, and Y be the target metric Observed many pairs of (). What we really want to observe is pairs of true underlying movements ().

Regression of to find out the relationship between the two? For illustration, if Y is sessions/UU and X is revenue/UU. But say we have 100 results from A/A tests (, what is correlation of and ? Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer 11 Sessions/UU and Revenue/UU display positive correlation even in A/A tests! For audiences willing to see math

Problem is and the error term are correlated! Regression will be biased Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer 12 Combo Metrics Peysakhovich and Eckles (2017) treat the problem as weak Instrument problem in econometrics, and proposed using L-0 regularization Essentially if is not large enough, treat it as 0

They device a cross-validation scheme to choose the shrinkage threshold Kharitonov et al. (2017) frame the whole problem as an optimization problem using empirical data Three class of historical results: E (detected effect), U( uncertain), C(known A/A) Search for linear combinations to maximize test scores in E and U and minimize test scores in C (with some weights) Key is to decompose covariance matrix But is unknown because we never observe treatment effect. Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer 13

Problems with NHST and p-value Many published research findings found not reproducible. Notable/Surprising results even more so Many results with small p-value fails Twymans law Hard to Interpret correctly Common mistake is to interpret p-value as P( This finding did not reach statistical significance(p=0.054), but it indicates a 94.6% probability that statins were responsible for the symptons --- an article on adverse effect of statin published in JAMA P-value hack Unable to Accept Null If desired result is not to reject the null, just run a small sample experiment

Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer 14 P(H0|Data), not P(Data|H0) P(H1|Data) is the Bayesian posterior belief of the alternative hypothesis, it is closely related to the concept of FDR (False discovery Rate) P(H1|Data) = 1- P(H0|Data) represents the confidence of a correct ship decision P(H0|Data) and P(H1|Data) are auto-adjusted for multiple testing adjustment if metrics movements are independent (why should you care

about hundreds of other unrelated tests?) Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer Bayesian Two Sample Hypothesis Testing: full symmetry! 1. H0 and H1, with prior odds 2. Given observations, likelihood ratio 3. Bayes Rule 4.

Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer But we dont know P(H0) and P(H1) We dont even know because we dont know what is the distribution of effects under H1 Solution: use historical experiments data to estimate P(H0) and also distribution of effects under H1

Cold start problem: what if we dont have rich historical data? How to know whether historical experiments are similar to the current one we are testing? Using Rich Historical Experiment Data Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer Continuous Decision Making

Traditional tests assumes a fixed horizon T is a function of sufficient statistics Deng et al. 2017 proved when data is observed sequentially, and we stop the data gathering at a random time The functional form of the Bayesian test remains the same! This result is widely expected, but not trivial to prove. This random time need to be a proper stopping time, basically the decision of stopping the experiment should be made with observations before or at . No peak ahead! We cannot cheat by observe till time T>t and decided to pretend we stopped the experiment at time t! This result is only for Bayesian Hypothesis Testing, not true for NHST. Justifies bandit using Thompsons sampling

Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer 18 Beyond Average Treatment Effect When we say treatment effect most cases we refer to Population Average Treatment Effect (ATE or PATE) We know treatment effect differs from unit to unit A feature might not be popular in some markets -> improvement A feature might be broken on one browser -> bug There could be many micro-structure in subpopulations, where treatment effect varies, or even flip sign!

Heterogeneous Treatment Effect (HTE): Hot topic in Economics/Policy Evaluation, Personalized/precise Treatment/Drug, etc. Hot research intersection of causal inference and machine learning! Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer 19 Browser differenc e Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny

Kohavi, Paul Raff, Lukas Vermeer Weeke nd vs weekda y Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer Shift

Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer CATE Potential Outcomes with covariates and assignment T: Interested in predicting individual treatment effect (ITE): given , unknown Best prediction is , i.e. the regression , a.k.a. conditional average treatment effect (CATE)

Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer 23 Meta Learners T-Learner: fit model for and using treatment group and control group data separately S-Learner: fit one model using combined data Both need to strike balance between bias and variance. Popular base learners: Random Forest, BART(Bayesian additive regression tree), Lasso But bias of and can be misinterpreted as treatment effect Recent development: Treat and as nuisance parameter, directly model as function of X, and put regularization/sparsity on form of

Targeted Learning, Double ML, U-Learner 24 Overfit or not overfit? Kunzel et al., 2017. Figure 1.a Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer 25

References Deng et al., 2013: Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data Peysakhovich and Eckles, 2017: Learning causal effects from many randomized experiments using regularized instrumental variables Kharitonov et al., 2017: Learning sensitive combinations of a/b test metrics Deng et al., 2017 : Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing Deng, 2015: Objective Bayesian two sample hypothesis testing for online controlled experiments Wager and Athey, 2015: Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tian et al., 2012:A Simple Method for Detecting Interactions between a Treatment and a Large Number of Covariates Deng et al., 2017: Concise Summarization of Heterogeneous Treatment Effect Using Total Variation Regularized Regression Tansey et al., 2017: Interpretable Low-Dimensional Regression via Data-Adaptive Smoothing

Zhao et al., 2017: Selective inference for effect modification via the lasso Kunzel et al., 2017: Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer 26 Questions? http://exp-platform.com Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer

Appendix Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer 28 Targeted Learning/Remove Nuisance Parameter We dont care about , only care about Take conditional expectation given X on both sides Subtract the two

is removed! Also is known in randomized experiment. Fit a model for , plug in for , fit the model for , put sparsity constraints on for better interpretation (Deng et al. 2017, Tansey et al. 2017) Surprise! In randomized experiment, the fit doesnt need to be unbiased. In fact, Tian et. al. 2014 uses a working model of a constant . A better reduces variance! Alex Deng, Pavel Dmitriev, Somit Gupta, Ronny Kohavi, Paul Raff, Lukas Vermeer 30