Economics of Faith, Race, and Religion: Data Analysis 1 (June 29-July 6)

I am proceeding with the instructions from Professor Parman. He suggested that I do both ordered logistic regressions (ologits) and regular regressions, and to also look at the residuals of my regressions. I also used bin scatters this week and worked through my missing data issue.

 

I started to implement my ologit and linear regressions, and immediately, I was concerned–at the time, I did not know the difficulties associated with doing residual analysis for an ordered logistic regression, and I was still learning how to interpret the residual scatter plots for my linear regressions.

Binscatter religdens attend

Image1

We can kind of see that as the religious density increases, attendance increases (1 is high attendance, religdens increases)

 

Additionally, in comparing my ologit regressions and linear regressions, my R-squared values differed significantly. My pseudo-R2 for an ologit was around 0.22, while the R2 for a linear regression shocked me: .01. These were part of some of my questions in the following email to Professor Parman.

My email:

 

Next, the current status of my project. I think it might be helpful for me to outline some of the code that I input into stata and then ask follow-up questions:

Reg attend religdens sex agerec racecmb if attend<9 & agerec<99 & racecmb<9

image2

Here, I am analyzing the data while excluding the missing data. I got some interesting results.

While my explanatory variables seem to be statistically significant, my R-squared value is 0.0376–should I be concerned?  I’m also wondering if I would be better off making sex a dummy variable.

Some questions regarding residuals,too:

  1. When I look at the residuals, I am not sure whether I should do a regular scatter or a binscatter
  2. When I looked up interpreting residuals in a regression analysis, I came across a couple of different methods. Some scatter the residuals with the predicted values, with an independent variable, or the dependent variable. I’m not sure what each option would mean for my residual analysis. Here are a couple of residual plots.

Binscatter attend_residual religdens if attend<9

image3

binscatter attend_residual attend if attend<9

 

I also did an ologit with the same variables:

 

ologit attend religdens sex agerec racecmb if attend<9 & agerec<99 & racecmb<9

 

 

I was also not able to generate residuals for this analysis (predict attend_residual, res

option res not allowed). I wasn’t able to find the answer for this on the Internet, but my guess is that because an ordered logistic regression is different than a regression that I’m not able to predict residuals through stata. I’m not sure whether this is correct econometrically but instead I found the predicted values and subtracted them from the actual values to find the residuals:

image5

Predict attend_predict if e(sample)

Gen attend_res=attend-attend_predict

 

Is this something I’m allowed to do?

 

Professor Parman’s response:

 

For your linear regressions, a low R-squared is nothing to worry about. You are running regressions where the unit of observation is an individual and the outcome is something that can vary for all kinds of reasons that won’t be captured by your independent variable. Individual people are going to choose to attend religious services for a wide range of reasons. Your statistically significant coefficients suggest that variation in your independent variables is associated with meaningful variation in attendance which is the key thing you are looking for. The low R-squared just tells you that while your independent variables are associated with differences in attendance, they can only explain a small portion of the variation in attendance. Unobserved factors are driving the bulk of the variation. This is pretty much what I would expect.

 

– For the residuals, a bin scatter is probably easiest to look at given how many data points you have. This will let you see general trends in the means of the residuals. What you miss from looking at a bin scatter instead of a scatter of the residuals themselves is how the variance of the residuals is changing as your variables change. This would let you know something about heteroscedasticity in the data.

 

As for the different approaches to analyzing the residuals, each one will tell you something different. If you look at the residuals versus the predicted values, you can see whether your model is fitting the data well across the range of outcomes or whether you tend to overpredict (or underpredict) over certain ranges. This helps tell you where the model does a good job of predicting the outcome and helps you see if maybe you need to switch to a nonlinear transformation of your outcome variable. For example, suppose the correct relationship between y and x is ln(y)=a+bx. Think about what the residuals would look like if you fit a straight line through a graph of y versus x.

 

Looking at the residuals versus one of your independent variables tells you similar things, but relative to the independent variable instead of the outcome. It would help you see whether there are certain ranges of your x variable where the model is fitting the data well and other areas where it does not fit well and could lead you to think that a transformation of the independent variable might be useful.

 

When looking at your two residual plots above, it looks like your residuals are pretty random relative to the religious density variable, so there isn’t anything there that would suggest you need to change the model specification. When looking at the plot with attend on the horizontal axis, you are basically graphing a mechanical relationship. Your regression is likely giving predicted values for attend that are all pretty close to the overall mean. This means that people that actually have attend=1 will have a very big and negative residual and people that actually have attend=6 have a very big and positive residual. As you move closer to the mean of attend, these residuals will get smaller. This explains that linear relationship on your residual plot.

 

– When working with ologit results, you won’t be able to do a similar residual analysis. The ologit regression can give you predictions, but they will be the predicted probabilities for possible outcome. So you end up with multiple predicted values for each observation and you can’t calculate a residual in the same way you can with the linear regression. I wouldn’t attempt to do any sort of residual analysis with the ologit results.

 

 

My plan for next week: restrict by race and religion, and start looking at education and income.