**Quantitative Research Design**

**Finding data: **Use Princeton’s guide to political science data sources to find high-quality, quantitative datasets to measure your key concepts

**Choosing a test for your hypotheses: ***The table below can help you select a test for obtaining direction, size, and significance of relationship between two variables using quantitative data. There are links to learning more about each test in the table (coming soon). Note: this is not an exhaustive list of types of tests, but represents very common tests in political science.*

Type of Dependent Variable | Type of Independent Variable | Control Variables | Test |

Any | Dichotomous | No | Difference-in-means t-test |

Categorical | Categorical | No | Chi-squared test |

Continuous | Continuous | No | Pearson Correlation |

Continuous | Any | No | Simple (OLS) Regression |

Continuous | Any | Yes | Multiple (OLS) Regression |

Dichotomous | Any | Yes | Logistic (Logit) Regression |

Ordinal | Any | Yes | Ordered Logit Regression |

Count | Any | Yes | Negative Binomial Regression |

Event | Any | Yes | Event History Model |

## Online Tutorials and Manuals

There are numerous online manuals and tutorials to data analysis and social scientific research design. Here are some that are free and accessible:

- Wehde, Wesley et al. 2020.
*Quantitative Research Methods for Political Science, Public Policy and Public Administration for Undergraduates: 1st Edition with Applications in Excel.* - Sheppard, Valerie. 2020.
*Research Methods for the Social Sciences: An Introduction*. - Monogan, James E. 2015.
*Political Analysis Using R.* - Franco, Josh et al. 2020.
*Introduction to Political Science Research Methods*and*Polimetrics: A Stata Companion to Introduction to Political Science Research Methods*. - Huntington-Klein, Nick. 2020.
*The Effect: An Introduction to Research Design and Causality.*

For an intuitive sense of regression analysis, see Eyeball Regression by Sophie E. Hill.

## Data Analysis in Excel, R, and Stata

**Loading data in R:**

```
#First, set your working directory, replacing "/filepath" with the correct file path for your computer.
```**setwd("/filepath")**
#Then, use the following command to read in a .csv file into R. Replace "filename.csv" with the correct name of the file from your working directory. Specify header=TRUE if your .csv file has variable names in the first row. Set na=NA to code missing observations NA.
**data<-read.csv("filename.csv", header=TRUE, na=NA)**
#You can also read in .dta files. First, install and load the foreign package. Then, read in the .dta file, replacing "filename.dta" with the correct name of the file from your working directory.
**install.packages("foreign")**
**library(foreign)
data<-read.dta("filename.dta") **

**Loading data in Stata:**

```
*First, set your working directory, replacing "/filepath" with the correct file path for your computer.
```**cd "/filepath"**
*Then, use the following command to read in a .csv file into Stata. Replace "filename.csv" with the correct name of the file from your working directory.
**import delimited "filename.csv", clear **
*You can also read in .dta files. Specify 'clear' to clear out data you've already loaded into Stata.
**use "filename.dta", clear**

**Difference-in-means tests** **for paired samples (i.e. same group measured twice) or two-sample t-tests (i.e. distinct groups)**:

Excel

```
In an empty cell, enter the following function:
```**=TTEST(array1,array2,tails,type)**
In place of array1, highlight your first group of observations. In place of array2, highlight your second group of observations. Conventionally, researchers specify 2 for a two-tailed test, but if you have a directional hypothesis, 1 is acceptable. For type, enter 1 for a paired sample t-test or 2 for two-sample t-test.

R

```
#For a two-sample t-test of groups with unequal variance, create an object 'dm' which is defined as the results of the difference-in-means test. Replace 'dv' and 'iv' with the names of the dependent variable and group variable in your own dataset. Replace 'mydata' with the name of your dataset object. Change 'F' to 'T' to specify a t-test of groups with equal variance.
```**dm<-t.test(dv~iv,data=mydata,var.equal=F)**
#Then, retype the object name to see the results
**dm**
#For a paired sample t-test, create an object 'dmp' which is defined as the results of the difference-in-means test for a paired sample. Replace 'x' with the name of the variable that contains the first measurement of the group being observed and replace 'z' with the name of the variable that contains the second measurement of the group being observed. Replace 'mydata' with the name of your dataset object.
**dmp<-t.test(mydata$x, mydata$z, paired = TRUE, alternative="two.sided")**
#Note the option alternative="two.sided" specifies a two-tailed test. Change to "less" for a one-tailed test for hypotheses that predict the first measurement (i.e. mydata$x) will be less than the second measurement (i.e. mydata$z); switch to "greater" if that is what your hypothesis predicts.
#Then, retype the object name to see the results
**dmp**

Stata

```
*For a two-sample t-test of unequal variance, use the ttest command. Replace 'dv' and 'iv' with the names of the dependent variable and group variable in your own dataset.
```**ttest dv, by(iv)**
*For a paired sample t-test, use the ttest command. Replace 'x' with the name of the variable that contains the first measurement of the group being observed and replace 'z' with the name of the variable that contains the second measurement of the group being observed. The command defaults to a two-tailed test.
**ttest x == z**

**Chi-squared test: **

Excel

`A chi-squared test in Excel requires the creation of a Pivot Table to generate the `*observed distribution* of values, use of mathematical functions to calculate the *expected distribution* of values, and use of **=CHISQ.TEST(actual_range,expected_range)** where 'actual_range' is replaced with the observed distribution of values and 'expected_range' is replaced with the expected distribution of values. To see how to do this multistep process in Excel, check out this Youtube tutorial.

R

```
#First, create an object ('tab1' in example below) using the table function that is the observed distribution of your independent variable across your dependent variable. Replace 'dv' and 'iv' with the names of the dependent variable and independent variable in your own dataset, respectively. Replace 'mydata' with the name of your dataset object.
```**
tab1<-table(mydata$iv,mydata$dv)
x2t<-chisq.test(tab1)**
#Then, retype the object name to see the results
**x2t**

Stata

```
*Use the tab command and chi2 option in Stata to execute a chi-squared test. Replace 'dv' and 'iv' with the names of the dependent variable and independent variable in your own dataset, respectively.
```**tab dv iv, chi2**

**Correlation: **

Excel

```
In an empty cell, enter the following function:
```**=CORREL(array1,array2)**
In place of array1, highlight the observations of your independent variable. In place of array2, highlight the observations of your dependent variable.

R

```
#The cor command calculates the Pearson's r correlation. Replace 'x' and 'z' with the names of the variables in your own dataset. Replace 'mydata' with the name of your dataset object. Retype the object name to see the results.
```**cor1<-cor(mydata$x,mydata$z) **
**cor1**

Stata

```
*The corr command calculates the Pearson's r correlation. Replace 'x' and 'z' with the names of the variables in your own dataset. You can also create a correlation matrix by simply adding to the list of variables, as shown below.
```**
corr x z
corr x z a b**

**Simple and Multiple Regression (OLS)**:

R

`#First, create an object 'm1' using the `**lm** command that is a linear model of the dependent variable as a function of (i.e. ~) the independent variable. Replace 'dv' and 'iv' with the names of the dependent variable and independent variable in your own dataset. Replace 'mydata' with the name of your dataset object.
**
m1<-lm(dv~iv,data=mydata) **
#Use the **summary()** command to see the results
**summary(m1)**
#A simple regression is essentially drawing a best fit line to characterize the relationship between the two variables. You can visualize the results by using the **plot()** command. Replace 'dv' and 'iv' with the names of the dependent variable and independent variable in your own dataset, respectively. Replace 'mydata' with the name of your dataset object. Use the **abline()** command to add the linear best fit line.
**plot(mydata$iv,mydata$dv,pch=16,col="black", main="Enter Main Title Here", xlab="Enter X-Axis Label Here", ylab="Enter Y-Axis Label Here")
abline(m1,col="red")
box()**
#You can estimate a multiple regression by adding control variables to your linear model. Simply use the + sign before each additional control variable. Replace 'dv','iv', and 'mydata' with proper names in your own application, as noted above. Replace 'c1' with name of control variable in your own data set. Add additional control variables using the + sign.
**m2<-lm(dv~iv+c1,data=mydata)
summary(m2)**

Stata

```
*The reg command calculates a simple regression between the dependent variable (listed first) and the independent variable (listed second). Replace 'dv' and 'iv' with the names of the dependent variable and independent variable in your own dataset.
```**reg dv iv**
*A simple regression is essentially drawing a best fit line to characterize the relationship between the two variables. You can visualize the results by using the **twoway** command to create a scatter plot with best fit line. Replace 'dv' and 'iv' with the names of the dependent variable and independent variable in your own dataset, respectively.
**twoway (scatter dv iv) (lfit dv iv)**
*You can estimate a multiple regression by adding control variables to your linear model. Simply list each additional control variable after your primary independent variable. Replace 'dv' and 'iv' with proper names in your own application, as noted above. Replace 'c1' with name of control variable in your own data set.
**reg dv iv** **c1**

**Interpreting Results**:

Coming soon.

**Data Visualization**:

Coming soon.

**Remember:** you can find lots of R and Stata help online — many have encountered issues and have found workarounds.