Project 1 Summary
This week I turned in Project 1 for my R class. This project was all about learning to create functions in order to streamline processes. We were given data sets with census data and were to use that data to analyze public school enrollment data in several different ways. The assignment was broken roughly into four parts: data processing, creating functions, writing generic functions, and then actual execution of the functions using different inputs.
I found the first part of this assignment, data processing, to be rather intuitive. Here I just had to import one data file and then do a bunch of things to the data to get it into an appropriate format for analyzing. There were two things that I struggled with a bit on this part. The first was to grab the two-digit year from a measurementID and then turn the two-digit year into a four-digit year. I wasn’t sure how to do this with a function, so I used an if/else function to add 1900 to any years that were greater than “69” and then add 2000 to any other years. This worked just fine, but I wonder if there may be a better way to do this, particularly since a different set of data sets could have data older than 1970. The second part I struggled with just a bit was correlating the state names to a census division. Here I just created a series of lists and used the lists to lookup the state in order to get a corresponding division number. I suspect there isn’t a better way to do this but I’m curious to see the solutions.
The second part of the assignment, creating functions, took me the most time, although it was made easier by the work that I’d put into the first part. First, I created a function to import the data, select the appropriate columns, and then change that data into long format. Then, I created a function to extract the year from the measurementID as I’d done in the first part. Then I used another function to pull out the two-character state ID from one of the columns to make it easier to sort by state and used yet another function on the state level data to correlate the state to a division. Once all of that was done, I created another function to split the data into state and county data. Here I created a new column to categorize the data into state and county and then filtered each of those into their own data sets. Finally, I created a wrapper function to put all of these functions together to be used on subsequent data sets. Here I had the most trouble with keeping track of all of the data sets and function names, but I found writing it all down and sketching out all of the steps here to be particularly helpful.
The third part of the assignment was to create generic functions for plotting the data based on whether the class of the data was state or county. The function for the state data was fairly simple. Here I just took the data, grouped it by division and year and then summarized the data by the average of the enrollment data. I also filtered out the data that didn’t correspond to a specific state. The function for the county level data was much more involved. I had to filter the data by three optional user inputs here: the state, the number of counties to return, and whether or not the user wanted to see data from the top counties or from the bottom counties (based on enrollment numbers). Again, I found sketching this out on paper to be very useful here to get my logic straight before beginning to write the code. I first filtered the data by the state and put that into its own data set. Once I had that, I used an if/else if function to sort the data differently depending on whether the user wanted the top or bottom data from that state. I then grabbed the user-specified number of top/bottom data and used that to find all of the data from those counties. Once I got that into its own data set, this is the data I plotted.
The final part of the assignment was simply to combine two data sets and run the plotting functions based on a variety of different inputs. Finally, I started with four new data sets which I combined and ran all of the functions on before plotting those based on a new series of inputs.
Overall, I found this assignment to be moderately challenging but fun. I enjoyed learning more about functions and how to use them in real life scenarios. Going forward, I think I will continue to work out the logic first using a pencil and paper before diving into R. While it may seem to just add extra steps to the process, I do believe that having the logic sketched out saved me a ton of time on this assignment, particularly as someone who is still new to R. One other useful feature I found that I will definitely use going forward is that RStudio has spell check (Edit, Check Spelling or just F7)!
See my R Markdown File here.