cleaning the dataset

Statistical testing

library(corrplot)

## corrplot 0.84 loaded

#test to see if there is a relationship between 
#sex and income
test<-chisq.test(table(df3$sex, df3$income))
test

## 
##  Pearson's Chi-squared test
## 
## data:  table(df3$sex, df3$income)
## X-squared = 42.441, df = 9, p-value = 2.729e-06

table(df3$sex, df3$income)

##         
##          <10 10-20 20-30 30-40 40-50 50-75 75-100 100-150 >150  NA
##   Male    99   138   132   135   100   211    149     126   98 138
##   Female 145   176   185   171   149   203    149     134   73 256

The p-value = .000002729 to test for independence gender and “income” we found we would reject independence and assumed that income is dependent on gender. With this we can assume a gender wage gap for workers in our study.

corrplot(test$residuals, is.corr = F)

As predicted by our p-value, we can see there is a difference in earnings depending on gender.

#test to see if there is a relationship between 
#race and income
test2<-chisq.test(df3$race1_1, df3$income)

## Warning in chisq.test(df3$race1_1, df3$income): Chi-squared approximation
## may be incorrect

test2

## 
##  Pearson's Chi-squared test
## 
## data:  df3$race1_1 and df3$income
## X-squared = 195.77, df = 36, p-value < 2.2e-16

In our qui-squared test for independence to find a relationship between race and income level, we found our p-value to be even smaller than before. We can conclude that there is a stronger dependence between race of a worker and their salary level than gender and salary.

corrplot(test2$residuals, is.corr = F)

By graphing the relationship between race and income it is easy to see that White and Assian workers have the highest income. In contrast, Black workers by far earn the lowest income.

Eddie’s section

model <- glm(bi_var ~ income, data = df3, family = "binomial")
pander::pander(summary(model))

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	0.2305	0.1289	1.789	0.07369
income10-20	0.635	0.1786	3.556	0.0003764
income20-30	0.5736	0.1772	3.238	0.001205
income30-40	1.08	0.1901	5.68	1.347e-08
income40-50	0.8735	0.1952	4.475	7.624e-06
income50-75	1.362	0.1839	7.406	1.304e-13
income75-100	1.347	0.2007	6.713	1.911e-11
income100-150	1.361	0.2097	6.489	8.668e-11
income>150	1.163	0.2309	5.037	4.733e-07
incomeNA	0.6804	0.1704	3.994	6.489e-05

(Dispersion parameter for binomial family taken to be 1 )

Null deviance:	3329 on 2966 degrees of freedom
Residual deviance:	3233 on 2957 degrees of freedom

This table separates the “standar error”, “z value” and “p-value” per income level. We found the lowest p-value on the most common income between $50 and $75 thousand per year.

Simple testing

cleaning the dataset

Statistical testing

Eddie’s section