In a recent job interview for an economic analyst position I was asked to run a statistical analysis and create a presentation on a dataset of my choosing from the opportunity insights collection. Unfortunately I wasn’t the right fit for the job, but I wanted to share my findings and the process of working with data in a different way from what I’m used to.
Going into the project I was pretty nervous since this didn’t really fall into my geography background. This was only made worse by the fact that I was moving across the country in part of the time given to prepare for the interview. Luckily I was able to plan out my time accordingly and maybe more importantly get the Wi-Fi set up quickly so I could start working and create a presentation I could be proud of.
Before getting started I talked with some of my colleagues from Miami University to get ideas on what data to investigate and I ended up settling on a dataset titled: Commuting Zone Life Expectancy Estimates by Gender and Income Quartile. I chose this dataset interested in exploring if there exists any links between commuting zone diversity and average life expectancy. I was particularly interested in this dataset since it has blocking by gender and income allowing for a deeper dive into any correlations in the data.
The first step in examining this potential relationship was to create some robust definition of diversity using the demographics dataset which had information on four demographic groups: Black, White, Asian, and Hispanic. To measure diversity across each commuter zone in a more standardized way I made use of the Shannon diversity index letting me convert the four categories into one standardized number. There are some flaws to this methodology, obviously having only four categories is not representative of the true diversity of the country, but it worked as a starting and further more representative research could come later if needed. In R I then joined the demographics dataset with its new Shannon diversity index field with the commuter zone life expectancy by gender and income. the analysis did this by joining the commuting zone life expectancy dataset with a commuting zone demographics dataset. This combined data could then be plotted out as the graphs below. For the sake of conciseness in the interview I decided to focus on female life expectancy in particular, though I did compare it to the male life expectancy briefly.
One thing to keep in mind when interpreting these results is that commuter zone dataset looked at all commuter zones with populations of 25,000 or more. This could create some misleading trends in the data when comparing smallish and likely more homogenous cities to very large and likely more diverse cities. Overall though there were some interesting findings buried in the data and it was fun to try and piece them together. I must say I was intimidated going into this project, but I came out of it with a new confidence realizing that my data science skills don’t just apply to geography, but can apply on a broader scale tackling different issues and questions than I’m used to.