Research Resources for Students

Quantitative Research Design

Finding data: Use Princeton’s guide to political science data sources to find high-quality, quantitative datasets to measure your key concepts

Choosing a test for your hypotheses: The table below can help you select a test for obtaining direction, size, and significance of relationship between two variables using quantitative data. There are links to learning more about each test in the table (coming soon). Note: this is not an exhaustive list of types of tests, but represents very common tests in political science.

Type of Dependent VariableType of Independent VariableControl VariablesTest
AnyDichotomousNoDifference-in-means t-test
CategoricalCategoricalNoChi-squared test
ContinuousContinuousNoPearson Correlation
ContinuousAnyNoSimple (OLS) Regression
ContinuousAnyYesMultiple (OLS) Regression
DichotomousAnyYesLogistic (Logit) Regression
OrdinalAnyYesOrdered Logit Regression
CountAnyYesNegative Binomial Regression
EventAnyYesEvent History Model
Note: Continuous refers to interval or ratio variables, Dichotomous refers to nominal variables with 2 values, Categorical refers to nominal or ordinal variables with 3 or more values, Ordinal refers to categorical variables with 3 or more values in which higher values indicate higher value, Count refers to ordinal variables with many values but none that are negative, Event refers to dichotomous variables indicating when an event happened in time-series cross-sectional data.

Online Tutorials and Manuals

There are numerous online manuals and tutorials to data analysis and social scientific research design. Here are some that are free and accessible:

For an intuitive sense of regression analysis, see Eyeball Regression by Sophie E. Hill.

Data Analysis in Excel, R, and Stata

Loading data in R:

#First, set your working directory, replacing "/filepath" with the correct file path for your computer.

setwd("/filepath") 

#Then, use the following command to read in a .csv file into R. Replace "filename.csv" with the correct name of the file from your working directory. Specify header=TRUE if your .csv file has variable names in the first row. Set na=NA to code missing observations NA.

data<-read.csv("filename.csv", header=TRUE, na=NA)

#You can also read in .dta files. First, install and load the foreign package. Then, read in the .dta file, replacing "filename.dta" with the correct name of the file from your working directory.

install.packages("foreign") 
library(foreign) 
data<-read.dta("filename.dta") 

Loading data in Stata:

*First, set your working directory, replacing "/filepath" with the correct file path for your computer. 

cd "/filepath"

*Then, use the following command to read in a .csv file into Stata. Replace "filename.csv" with the correct name of the file from your working directory. 

import delimited "filename.csv", clear 

*You can also read in .dta files. Specify 'clear' to clear out data you've already loaded into Stata.

use "filename.dta", clear

Difference-in-means tests for paired samples (i.e. same group measured twice) or two-sample t-tests (i.e. distinct groups):

Excel

In an empty cell, enter the following function:

=TTEST(array1,array2,tails,type)

In place of array1, highlight your first group of observations. In place of array2, highlight your second group of observations. Conventionally, researchers specify 2 for a two-tailed test, but if you have a directional hypothesis, 1 is acceptable. For type, enter 1 for a paired sample t-test or 2 for two-sample t-test.

R

#For a two-sample t-test of groups with unequal variance, create an object 'dm' which is defined as the results of the difference-in-means test. Replace 'dv' and 'iv' with the names of the dependent variable and group variable in your own dataset. Replace 'mydata' with the name of your dataset object. Change 'F' to 'T' to specify a t-test of groups with equal variance.

dm<-t.test(dv~iv,data=mydata,var.equal=F)

#Then, retype the object name to see the results

dm

#For a paired sample t-test, create an object 'dmp' which is defined as the results of the difference-in-means test for a paired sample. Replace 'x' with the name of the variable that contains the first measurement of the group being observed and replace 'z' with the name of the variable that contains the second measurement of the group being observed. Replace 'mydata' with the name of your dataset object. 

dmp<-t.test(mydata$x, mydata$z, paired = TRUE, alternative="two.sided")

#Note the option alternative="two.sided" specifies a two-tailed test. Change to "less" for a one-tailed test for hypotheses that predict the first measurement (i.e. mydata$x) will be less than the second measurement (i.e. mydata$z); switch to "greater" if that is what your hypothesis predicts.

#Then, retype the object name to see the results

dmp

Stata

*For a two-sample t-test of unequal variance, use the ttest command. Replace 'dv' and 'iv' with the names of the dependent variable and group variable in your own dataset.

ttest dv, by(iv)

*For a paired sample t-test, use the ttest command. Replace 'x' with the name of the variable that contains the first measurement of the group being observed and replace 'z' with the name of the variable that contains the second measurement of the group being observed. The command defaults to a two-tailed test.

ttest x == z

Chi-squared test:

Excel

A chi-squared test in Excel requires the creation of a Pivot Table to generate the observed distribution of values, use of mathematical functions to calculate the expected distribution of values, and use of =CHISQ.TEST(actual_range,expected_range) where 'actual_range' is replaced with the observed distribution of values and 'expected_range' is replaced with the expected distribution of values. To see how to do this multistep process in Excel, check out this Youtube tutorial.

R

#First, create an object ('tab1' in example below) using the table function that is the observed distribution of your independent variable across your dependent variable. Replace 'dv' and 'iv' with the names of the dependent variable and independent variable in your own dataset, respectively. Replace 'mydata' with the name of your dataset object.

tab1<-table(mydata$iv,mydata$dv)
x2t<-chisq.test(tab1)

#Then, retype the object name to see the results

x2t

Stata

*Use the tab command and chi2 option in Stata to execute a chi-squared test. Replace 'dv' and 'iv' with the names of the dependent variable and independent variable in your own dataset, respectively. 

tab dv iv, chi2

Correlation:

Excel

In an empty cell, enter the following function:

=CORREL(array1,array2)

In place of array1, highlight the observations of your independent variable. In place of array2, highlight the observations of your dependent variable.

R

#The cor command calculates the Pearson's r correlation. Replace 'x' and 'z' with the names of the variables in your own dataset. Replace 'mydata' with the name of your dataset object. Retype the object name to see the results.

cor1<-cor(mydata$x,mydata$z) 
cor1

Stata


*The corr command calculates the Pearson's r correlation. Replace 'x' and 'z' with the names of the variables in your own dataset. You can also create a correlation matrix by simply adding to the list of variables, as shown below.

corr x z
corr x z a b 

Simple and Multiple Regression (OLS):

R

#First, create an object 'm1' using the lm command that is a linear model of the dependent variable as a function of (i.e. ~) the independent variable. Replace 'dv' and 'iv' with the names of the dependent variable and independent variable in your own dataset. Replace 'mydata' with the name of your dataset object.

m1<-lm(dv~iv,data=mydata) 

#Use the summary() command to see the results

summary(m1)

#A simple regression is essentially drawing a best fit line to characterize the relationship between the two variables. You can visualize the results by using the plot() command. Replace 'dv' and 'iv' with the names of the dependent variable and independent variable in your own dataset, respectively. Replace 'mydata' with the name of your dataset object. Use the abline() command to add the linear best fit line.

plot(mydata$iv,mydata$dv,pch=16,col="black", main="Enter Main Title Here", xlab="Enter X-Axis Label Here", ylab="Enter Y-Axis Label Here")
abline(m1,col="red")
box()

#You can estimate a multiple regression by adding control variables to your linear model. Simply use the + sign before each additional control variable. Replace 'dv','iv', and 'mydata' with proper names in your own application, as noted above. Replace 'c1' with name of control variable in your own data set. Add additional control variables using the + sign.

m2<-lm(dv~iv+c1,data=mydata) 
summary(m2) 

Stata

*The reg command calculates a simple regression between the dependent variable (listed first) and the independent variable (listed second). Replace 'dv' and 'iv' with the names of the dependent variable and independent variable in your own dataset. 

reg dv iv

*A simple regression is essentially drawing a best fit line to characterize the relationship between the two variables. You can visualize the results by using the twoway command to create a scatter plot with best fit line. Replace 'dv' and 'iv' with the names of the dependent variable and independent variable in your own dataset, respectively.

twoway (scatter dv iv) (lfit dv iv)

*You can estimate a multiple regression by adding control variables to your linear model. Simply list each additional control variable after your primary independent variable. Replace 'dv' and 'iv' with proper names in your own application, as noted above. Replace 'c1' with name of control variable in your own data set.

reg dv iv c1

Interpreting Results:

Coming soon.

Data Visualization:

Coming soon.

Remember: you can find lots of R and Stata help online — many have encountered issues and have found workarounds.

Qualitative Research Design

The above slide demonstrates how you might choose cases using either method to test the following hypothesis: universities with more undergraduate students will charge more in fees than universities with fewer undergraduate students.
Single-case studies allow researcher to engage in process-tracing, and these cases should still be chosen with care and justification.