Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida, College of Nursing Professor, College of Public Health Department of Epidemiology and Biostatistics Associate Member, Byrd Alzheimers Institute Morsani College of Medicine Tampa, FL, USA 1 SECTION 1.1 Module Overview and Introduction Introduction to biostatistics, descriptive statistics, SPSS, and Power Point. SECTION 1.4 Introduction to SPSS

Introduction to SPSS Database structure Data view and variable view Variable names, labels, and formats Interactive menus SPSS syntax generated from interactive analyses SECTION 1.5 Summarizing Data in Charts Summarizing Data Charts 1. One categorical, >1 proportion/percentage (i) Bar chart (ii) Stacked bar chart

(iii) Stacked bar chart (100%) 2. One categorical, >1 continuous variable (i) Box plot (ii) High-low (iii) Line (iv) Kernel-density plots 3. Two continuous variables (i) X-Y scatter (ii) Histogram (can be used for 1 variable) 1. One categorical, >1 proportion/percentage (i) Bar chart

Rectangular bars with lengths proportional to the values that they represent. Bars can be plotted vertically or horizontally. 1. One categorical, >1 proportion/percentage (ii) Stacked bar chart Can be counts or percentages. Do not sum to a specified value % Obese Age Group 1. One categorical, >1 proportion/percentage (iii) Stacked bar chart (100%) Bar Charts and Stacked Bar Charts Important to select either row versus column percentages Example: Race and blood pressure classification Usually, the row variable is the predictor, and the column variable is the outcome.

SPSS: Analyze Descriptive statistics Crosstabs Bar Charts and Stacked Bar Charts Column Percentage: SPSS-CROSSTABS /TABLES=SCR_RACECAT3 BY SCR_BP_CLASS4 /FORMAT=AVALUE TABLES /CELLS=COUNT COLUMN /COUNT ROUND CELL Race * BP classification Crosstabulation /BARCHART. BP classification Normal Race White Count % within BP classification

Black Count % within BP classification Other Count % within BP classification Total Count % within BP classification Prehypertensive 247 397 Hypertensive Hypertensiv Stage 1

e Stage 2 294 95 Total 1033 65.2% 58.3% 49.8% 38.0% 54.4% 117 262 275 149

803 30.9% 38.5% 46.6% 59.6% 42.3% 15 22 21 6 64 4.0%

3.2% 3.6% 2.4% 3.4% 379 681 590 250 1900 100.0% 100.0% 100.0%

100.0% 100.0% Difficult to identify trends Bar Charts and Stacked Bar Charts Row Percentage: SPSS-CROSSTABS /TABLES=SCR_RACECAT3 BY SCR_BP_CLASS4 /FORMAT=AVALUE TABLES /CELLS=COUNT ROW /COUNT ROUND CELL /BARCHART. Race * BP classification Crosstabulation Normal Race White Count % within Race

Black Count % within Race Other Count % within Race Total Count % within Race BP classification Hypertensive Prehypertensive Stage 1 Hypertensive Stage 2

247 397 294 95 Total 1033 23.9% 38.4% 28.5% 9.2% 100.0% 117 262

275 149 803 14.6% 32.6% 34.2% 18.6% 100.0% 15 22 21 6

64 23.4% 34.4% 32.8% 9.4% 100.0% 379 681 590 250 1900 19.9%

35.8% 31.1% 13.2% 100.0% Use row percentages in stacked bar chart (PP) Power Point Chart Column 100% Stacked Column Power Point Chart (Practice) Column - 100% Stacked Column Display Quality of Life from Poor to Excellent by Gender Column Percentages for QOL Gender * QOL: Health Crosstabulation QOL: Health Excellent Very good Gender

Male Count % within QOL: Health Female Count % within QOL: Health Total Count % within QOL: Health Good Fair Poor 128 245

249 58 6 Total 686 40.8% 34.4% 34.7% 26.6% 25.0% 34.5% 186

467 469 160 18 1300 59.2% 65.6% 65.3% 73.4% 75.0% 65.5% 314

712 718 218 24 1986 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Row Percentages for QOL Gender * QOL: Health Crosstabulation

QOL: Health Excellent Very good Good Gender Male Count % within Gender Female Count % within Gender Total Count % within Gender Fair Poor

128 245 249 58 6 Total 686 18.7% 35.7% 36.3% 8.5% .9% 100.0%

186 467 469 160 18 1300 14.3% 35.9% 36.1% 12.3% 1.4% 100.0%

314 712 718 218 24 1986 15.8% 35.9% 36.2% 11.0% 1.2% 100.0%

Power Point Chart Column 100% Stacked Column Power Point Chart Column 100% Stacked Column 2. One categorical, >1 continuous variable (i) Box plot Also known as box-and-whisker diagram. Displays 5 summary statistics: minimum, lower quartile (Q1), median (Q2), upper quartile (Q3), and maximum No assumptions on underlying statistical distribution non-parametric SPSS: Graphs Chart Builder Boxplot

Example: HDL Cholesterol (continuous) distribution by gender (categorical) 2. One categorical, >1 continuous variable (i) Box plot Question: Are HDL cholesterol levels positively or negative skewed? Run SPSS frequencies procedure 2. One categorical, >1 continuous variable (i) Box plot Question: Are triglycerides positively or negative skewed? Run SPSS frequencies procedure 2. One categorical, >1 continuous variable

(i) Box plot (Practice) Draw a box plot of the distribution of HDL cholesterol by ethnicity: Hispanic: Min=30, Q1=40, Q2=46, Q3=56, Max=86 Non-Hispanic: Min=21, Q1=46, Q2=56, Q3=66, Max=131 Example: 2. One categorical, >1 continuous variable (i) Box plot (Practice) Draw a box plot of the distribution of HDL cholesterol by ethnicity: Hispanic: Min=30, Q1=40, Q2=46, Q3=56, Max=86 Non-Hispanic: Min=21, Q1=46, Q2=56, Q3=66, Max=131 2.

One categorical, >1 continuous variable (ii) High-low Can trick Power Point to use open-high-low-close chart (i.e. used for financials) to show distributions of continuous variables Upper and lower ends (high-low) can represent any percentiles, such as 5 th/95th percentiles Total Cholesterol (mg/dl) P=0.003 White Black Ptrend=0.009 EU>85% EU>40% EU>25% EU<40%

White Black Black Self-Report N (753) (464) EU<25% Black Admixture Defined (753) (68) (201)

The filled rectangles depict the interquartile range (25 th and 75th percentile). The lower and upper limits of the vertical lines depict the 5 th and 95th percentiles, respectively. (195) Total Cholesterol (mg/dl) U.S. Black vs. Ghana Urban: P=0.0001 U.S. Black vs. Ghana Rural: P<0.0001 Ghana Urban vs. Ghana Rural: P<0.0001 N=594 N=546 N=80 N=111 The filled rectangles depict the interquartile range (25 th and 75th percentile). The lower and upper limits of the vertical lines depict the 5 th and 95th percentiles, respectively.

5% Male 137 Female 153 25% 175 190 75% 224 245 95% 271 295 Total Cholesterol: (Practice in Power Point first draw by hand) (mg/dl) The filled rectangles depict the interquartile range (25 th and 75th percentile). The lower and upper limits of the vertical lines depict the 5 th and 95th percentiles, respectively. Total Cholesterol: (Practice in Power Point)

(mg/dl) 5% Male 137 Female 153 25% 175 190 75% 224 245 95% 271 295 Trick Power Point Open High Low 25%

95% 5% The filled rectangles depict the interquartile range (25 th and 75th percentile). The lower and upper limits of the vertical lines depict the 5 th and 95th percentiles, respectively. Close 75% 2. One categorical, >1 continuous variable (iii) Line chart Typically represents trend in data over intervals of time (i.e. time series) Often used to show repeated health outcome measurements over time. Prevalence of Use (%) Crohns Disease Medications In this example, the categorical variable is individual subject nested within each treatment arm of the trial 2.

One categorical, >1 continuous variable (iv) Kernel density plots Like a histogram, but constructs a smooth probability density function 3. Two continuous variables (i) X-Y scatter Shows the relationship between two sets of continuous data Also called a scatter chart, scattergram, scatter diagram or scatter graph. Body Density 1.1 1.08 1.06 1.04 1.02 1 0.98 0.96 15

20 25 30 35 40 Body Mass Index 45 50 55 60 3. Two continuous variables (ii) Histogram(s)

Probability distribution of a continuous variable(s) displayed over discrete intervals (bins) The bins contain frequency counts, or can be normalized to display relative frequencies (i.e. proportion of cases that fall into each category (bin) with total area = 1.0) # subjects 3. Two continuous variables (ii) Histogram(s) Probability distribution of a continuous variable(s) displayed over discrete intervals (bins) The bins contain frequency counts, or can be normalized to display relative frequencies (i.e. proportion of cases that fall into each category (bin) with total area = 1.0) SECTION 1.6 SPSS Data Manipulation SPSS Data Manipulation and Syntax Editor 1. Recode continuous variable into arbitrarilydefined or pre-defined categories 2. Visual binning of continuous variable 3. Transform a skewed variable 4. Using the SPSS Data Editor

SPSS Data Manipulation and Syntax Editor 1. Recode continuous variable into arbitrarily-defined or pre-defined categories Example: Define age into 3 categories (arbitrary) 45-54 55-64 65 and older SPSS Transform Recode into different variables Input variable is age Output variable Name: age_cat Label: Age in 3 categories Click on old and new values Range specify explicitly 45-54 = value 1 54 64 = value 2 65 and older = value 3 SPSS Data Manipulation and Syntax Editor

2. Visual binning of continuous variable Example: Body mass index Put in output name for binned variable Make cutpoints Equal percentiles based on scanned cases Put in labels for frequency display in bar chart SPSS Code Visual Binning. SPSS Data Manipulation and Syntax Editor 3. Transform a skewed variable Descriptive statistics for triglycerides in natural scale Mean, median, SD, min, max, skewness, kurtosis Chart = histogram with normal curve superimposed Triglycerides are skewed. Use a transformation to create a new variable and reduce the skew in triglycerides. SPSS Compute variable

Target Variable: Numeric Expression: SPSS Syntax: LOG_TRIG lg10(LAB_TRIG_VAP) COMPUTE log_trig=lg10(LAB_TRIG_VAP). SPSS Data Manipulation and Syntax Editor 4. Using the SPSS Data Editor SPSS: File: New (syntax) Save the file with a new name 1. Select males only (scr_sex=1) Data Select Cases If scr_sex=1

USE ALL. COMPUTE filter_$=(SCR_SEX=1). VARIABLE LABELS filter_$ 'SCR_SEX=1 (FILTER)'. VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. FORMATS filter_$ (f1.0). FILTER BY filter_$. EXECUTE. 2.Run descriptives for age 3.Copy code and repeat for females (scr_sex=2); SPSS Data Manipulation and Syntax Editor 4. Using the SPSS Data Editor USE ALL. COMPUTE filter_$=(SCR_SEX=1). VARIABLE LABELS filter_$ 'SCR_SEX=1 (FILTER)'. VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. FORMATS filter_$ (f1.0). FILTER BY filter_$. EXECUTE. DESCRIPTIVES VARIABLES=SCR_AGE

/STATISTICS=MEAN STDDEV MIN MAX. USE ALL. COMPUTE filter_$=(SCR_SEX=2). VARIABLE LABELS filter_$ 'SCR_SEX=2 (FILTER)'. VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. FORMATS filter_$ (f1.0). FILTER BY filter_$. EXECUTE. DESCRIPTIVES VARIABLES=SCR_AGE /STATISTICS=MEAN STDDEV MIN MAX.