Using the ACS: Predicting income extremes Demographics and income

To what extent is an American's income predictable? Here, I look at a subset of this question, using logistic regression to predict whether or not an individual lives below the poverty line. I focus on my home state of Pennsylvania as a smaller test case.

Bottom line: logistic regression gave me insights regarding how demographics and life histories influence whether a person lives below the poverty line. Predictions using this fit were decent, but not great.

Demographics, life history, and poverty

Again, we learn to stay in school

The graph on the left shows the odds ratios and 95% confidence intervals for the logistic regression. An odds ratio greater than one means a person in that category is more likely to live below the poverty line, and similarly an odds ratio less than one means that it is less likely. If the confidence interval bars extending out from each data point cross the line at 1.0, that indicates that that effect is not significant.

Odds ratios are interpreted as the odds that, all other factors being equal, a person lives above the poverty line compared to a base case. I will specify the base cases for each category in the discussion below.

What variables were important?

Education and marriage

The figure to the right shows the relative importance of different variables in the logistic regression results, colored by category as above. For clarity, I show only the top ten predictors.

Living in a household without one's spouse (regardless of whether one has a spouse) was the most important factor in predicting poverty status. This result was surprising, particularly since this category includes unmarried partners living together. Education variables account for half of the top ten most important factors for determining the model coefficients. Length of home occupation, age, and disability status were also important.

How well can we predict poverty?

Kind of ok.

We can assess the predictive capability of the logistic regression model several ways.

  • The accuracy of its predictions on a separate training data set: 75%. This seems decent, except that if I always guessed a person was living above the poverty line, I would be right about 92% of the time. Doing worse than random guessing is not great.
  • Its confusion matrix: out of a total of 25% incorrect responses, 2% are false positives and 23% are false negatives. Unsurprisingly, I do a much better job of predicting the majority class of people living above the poverty line.
  • The area under the ROC curve, pictured to the left. This curve shows the relationship between the model's specificity, or rate of false positives, and sensitivity, or rate of true positives. Ideally, the curve would look like a step function, with an area of 1. For my results, I have an area of 0.83, which is ok.

More details for nerds: