• Danielle Elizabeth Anne Quinn

Taking the "Next Steps in R"

On Day 3 of CarpentryCon 2018 in Dublin, Ireland, I had the pleasure of co-instructing a three hour workshop called "Next Steps in R" alongside the wonderful Francois Michonneau. Our intended audience were those who had previous experience using R, ideally dabbling with components of the tidyverse like {dplyr} or {ggplot2}, and who wanted to take their skills to the next level. Specifically, the goal of this workshop was not only for learners to acquire new data manipulation skills, but also to practice using these skills while considering the thought processes behind developing workflow pipelines. Essentially, if you are dealing with a single or multiple data sets, and want to get from point A to point B, what functions are available?, and what do you need to think about when using them?

As taught in The Carpentries Instructor Training, “teach things that are quick to master and immediately useful first”. This could be considered the golden rule for motivating learners and establishing the value of the lesson. So, before diving into the world of {dplyr}, (and while waiting for packages to install!) we began with a demonstration of sections and code folding in RStudio, which breaks a script into discrete, nameable regions for easy navigation via a drop down menu and document outline. In my opinion, this functionality in RStudio is a game-changer!

Using multiple data sets from gapminder, the remainder of the lesson guided learners through the process of exploring, tidying, merging, manipulating, summarizing, visualizing, and analyzing data (whew!). The data included annual records of population, GDP per capita, and life expectancy of each country, with supplemental data sets containing information about the number of cell phone subscribers and car related deaths per year.

After reviewing some of the exploration functions seen in The Carpentries lessons, like select() and filter(), “helpers” for these functions were introduced, which can allow users to more dynamically explore data frames. For example, when using select() to choose columns of interest, the helper function starts_with() can be added to choose columns with names that begin with specific characters! (If you’re interested in more helper functions, check out ends_with(), contains(), and grepl() as a starting point).

Next, data structure was discussed in the context of “long” versus “wide” format, and learners practiced using gather() to conveniently reshape data frames into the desired long format; at this point, one learner gasped, put their hands on their head and exclaimed “this would have saved me so much time…!” With all three of their data sets in the appropriate format, learners practiced merging data frames using left_join(). This exercise gave everyone the opportunity to do some troubleshooting (hooray for error framing!), encouraging them to think about differing variable classes and other potential problems that need to be considered before attempting to join or merge data. A key objective in these lessons was for learners to justify when an existing object should be overwritten to reflect changes to the data frame, and to apply this knowledge. As a rule, we demonstrated the value of first adding functions like glimpse() or View() to the pipeline to visualize the workflow results before actually overwriting an object.

We wrapped up the first section of the workshop by working through the process of summarizing a data frame using not only the standard summarize() but also the extension summarize_at(), and discussing when each would be appropriate. The summarize_at() function is particularly useful for running multiple functions on multiple variables:

During the second portion of the workshop, Francois guided learners through a case-study analysis, with the goal of visualizing cell phone subscriptions per capita over time in the three countries in each continent that had the highest rates of cell phone subscriptions. This included a great series of problem-solving exercises that required learners to apply what they had learned about {dplyr} functionality and thought processes in order to build workflow pipelines. Learners were encouraged to explore other {dplyr} functions, including top_n(), arrange(), and slice() to generate and manipulate the data required to complete the analysis, which led to discussion among learners, and valuable “tinkering” with the code and data.

In true Carpentries fashion, the workshop included a brilliant example of peer instruction and lifelong learning - I nearly jumped out my chair when Francois demonstrated that {ggplot2} functions could be added directly to a workflow pipeline! (This trick has since been widely shared here in Newfoundland and beyond...)

The data sets and commented code can be found here. Thank you so much to those who attended, and thank you to our fantastic helpers, Jean Manguy and Anna Krystalli!

This piece was originally posted at https://daniellequinn.github.io/blog-posts/nextstepsinr/blogtext.html on June 1, 2018.