Data Wrangling in R with dplyr and tidyr

CCHMC R Users Group

Cole Brokamp

2023-10-11

👋   Welcome

 

Join the RUG Outlook group for updates and events:  https://t.ly/vRw7y

 

Informal poll from last meeting

Tidy Data Wrangling in R with {dplyr} and {tidyr}

tidyr (version 1.3.0): Create tidy data

https://tidyr.tidyverse.org

{tidyr} cheatsheet

 

dplyr (version 1.1.3): Tidy data manipulation

https://dplyr.tidyverse.org

{dplyr} cheatsheet

Tidy Data

{dplyr}, {tidyr}, and generally all tidyverse functions expect tidy data:

Tidy data works well with pipes: x |> f(y) becomes f(x,y)

    R for Data Science (2e): https://r4ds.hadley.nz/

Tidying Data with {tidyr}

pivoting

    pivot_longer(), pivot_wider()

rectangling

    unnest_longer(), unnest_wider(), hoist()

split and combine character columns

    separate(), extract(), unite()

missing values helpers

    complete(), drop_na(), fill(), replace_na()

Manipulating Data with {dplyr}

select() picks variables based on their names

filter() picks cases based on their values

mutate() adds new variables that are functions of existing variables

summarise() reduces multiple values down to a single summary

Tidy Evaluation

data masking: use data variables as if they were variables in the environment

    my_variable instead of my_df$my_variable

tidy selection: choose variables based on name, position, type

    starts_with("x") or is.numeric

Example Data from CoDEC

library(dplyr, warn.conflicts = FALSE)

# install.packages('codec', repos = c('https://geomarker-io.r-universe.dev', 'https://cloud.r-project.org'))
d <- codec::codec_data("hh_acs_measures")

codec::glimpse_attr(d)
# A tibble: 7 × 2
  name        value                                                             
  <chr>       <chr>                                                             
1 profile     tabular-data-resource                                             
2 name        hh_acs_measures                                                   
3 path        hh_acs_measures.csv                                               
4 version     1.1.0                                                             
5 title       Harmonized Historical American Community Survey Measures          
6 homepage    https://geomarker.io/hh_acs_measures                              
7 description 2010 - 2021 measures derived from ACS variables for census tracts…

select() & filter()

d <-
  d |>
  select(-median_income, -median_home_value) |>
  filter(year >= 2016) |>
  select(-ends_with("_moe"))
d
# A tibble: 1,340 × 30
   census_tract_id_2010  year fraction_poverty n_children_lt18 n_pop
   <chr>                <int>            <dbl>           <int> <int>
 1 39061010300           2016            0.442             388  1465
 2 39061007800           2016            0.23              389  2542
 3 39061006600           2016            0.394             427  2262
 4 39061010202           2016            0.131             638  2565
 5 39061010201           2016            0.23             1348  5336
 6 39061021506           2016            0.118             873  3341
 7 39061008400           2016            0.384             673  2078
 8 39061021001           2016            0.086             711  3195
 9 39061001700           2016            0.545             348  1074
10 39061026400           2016            0.42              805  2434
# ℹ 1,330 more rows
# ℹ 25 more variables: n_household_lt18 <int>, n_household <int>,
#   fraction_insured <dbl>, fraction_snap <dbl>, fraction_fam_nospouse <dbl>,
#   fraction_employment <dbl>, n_housing_units <int>,
#   median_home_value_2010adj <dbl>, fraction_housing_renters <dbl>,
#   median_rent_to_income_percentage <dbl>, fraction_high_rent <dbl>,
#   fraction_conditions <dbl>, fraction_builtbf1970 <dbl>, …

mutate() & rename()

d <-
  d |>
  mutate(marginalized_population_percentage =
           fraction_hl_w + fraction_hl_b + fraction_hl_o +
           fraction_nhl_b + fraction_nhl_o) |>
  rename(mpp = marginalized_population_percentage)
d
# A tibble: 1,340 × 31
   census_tract_id_2010  year fraction_poverty n_children_lt18 n_pop
   <chr>                <int>            <dbl>           <int> <int>
 1 39061010300           2016            0.442             388  1465
 2 39061007800           2016            0.23              389  2542
 3 39061006600           2016            0.394             427  2262
 4 39061010202           2016            0.131             638  2565
 5 39061010201           2016            0.23             1348  5336
 6 39061021506           2016            0.118             873  3341
 7 39061008400           2016            0.384             673  2078
 8 39061021001           2016            0.086             711  3195
 9 39061001700           2016            0.545             348  1074
10 39061026400           2016            0.42              805  2434
# ℹ 1,330 more rows
# ℹ 26 more variables: n_household_lt18 <int>, n_household <int>,
#   fraction_insured <dbl>, fraction_snap <dbl>, fraction_fam_nospouse <dbl>,
#   fraction_employment <dbl>, n_housing_units <int>,
#   median_home_value_2010adj <dbl>, fraction_housing_renters <dbl>,
#   median_rent_to_income_percentage <dbl>, fraction_high_rent <dbl>,
#   fraction_conditions <dbl>, fraction_builtbf1970 <dbl>, …

transmute()

d <-
  d |>
  filter(year < 2020) |>
  transmute(census_tract_id = census_tract_id_2010,
            year,
            mpp,
            fraction_housing_renters,
            fraction_high_rent,
            fraction_fam_nospouse,
            fraction_employment,
            n_children_lt18,
            median_income_2010adj)
d
# A tibble: 888 × 9
   census_tract_id  year   mpp fraction_housing_renters fraction_high_rent
   <chr>           <int> <dbl>                    <dbl>              <dbl>
 1 39061010300      2016 0.34                     0.631              0.395
 2 39061007800      2016 0.365                    0.481              0.458
 3 39061006600      2016 0.783                    0.85               0.416
 4 39061010202      2016 0.421                    0.387              0.399
 5 39061010201      2016 0.275                    0.545              0.535
 6 39061021506      2016 0.813                    0.308              0.596
 7 39061008400      2016 0.591                    0.589              0.518
 8 39061021001      2016 0.08                     0.461              0.328
 9 39061001700      2016 0.771                    0.928              0.432
10 39061026400      2016 0.876                    0.859              0.536
# ℹ 878 more rows
# ℹ 4 more variables: fraction_fam_nospouse <dbl>, fraction_employment <dbl>,
#   n_children_lt18 <int>, median_income_2010adj <dbl>

arrange(): change ordering of observations

d |>
  arrange(desc(median_income_2010adj)) |>
  head()
# A tibble: 6 × 9
  census_tract_id  year   mpp fraction_housing_renters fraction_high_rent
  <chr>           <int> <dbl>                    <dbl>              <dbl>
1 39061024400      2016 0.16                     0.18               0.323
2 39061024400      2017 0.175                    0.162              0.304
3 39061025104      2017 0.076                    0.02               0    
4 39061024400      2018 0.154                    0.124              0.439
5 39061024400      2019 0.147                    0.136              0.55 
6 39061025104      2018 0.067                    0.045              0.433
# ℹ 4 more variables: fraction_fam_nospouse <dbl>, fraction_employment <dbl>,
#   n_children_lt18 <int>, median_income_2010adj <dbl>

summarise(): reduce multiple values to a single summary

d |>
  summarise(total_children = sum(n_children_lt18),
            median_children = median(n_children_lt18),
            max_children = max(n_children_lt18),
            median_mpp = median(mpp))
# A tibble: 1 × 4
  total_children median_children max_children median_mpp
           <int>           <dbl>        <int>      <dbl>
1         751234             781         3018      0.310

group_by()

d |>
  group_by(year)
# A tibble: 888 × 9
# Groups:   year [4]
   census_tract_id  year   mpp fraction_housing_renters fraction_high_rent
   <chr>           <int> <dbl>                    <dbl>              <dbl>
 1 39061010300      2016 0.34                     0.631              0.395
 2 39061007800      2016 0.365                    0.481              0.458
 3 39061006600      2016 0.783                    0.85               0.416
 4 39061010202      2016 0.421                    0.387              0.399
 5 39061010201      2016 0.275                    0.545              0.535
 6 39061021506      2016 0.813                    0.308              0.596
 7 39061008400      2016 0.591                    0.589              0.518
 8 39061021001      2016 0.08                     0.461              0.328
 9 39061001700      2016 0.771                    0.928              0.432
10 39061026400      2016 0.876                    0.859              0.536
# ℹ 878 more rows
# ℹ 4 more variables: fraction_fam_nospouse <dbl>, fraction_employment <dbl>,
#   n_children_lt18 <int>, median_income_2010adj <dbl>

All functions designed to be paired with group_by()

d |>
  group_by(year) |>
  summarise(total_children = sum(n_children_lt18),
            median_children = median(n_children_lt18),
            max_children = max(n_children_lt18),
            median_mpp = median(mpp))
# A tibble: 4 × 5
   year total_children median_children max_children median_mpp
  <int>          <int>           <dbl>        <int>      <dbl>
1  2016         187730             793         2571      0.295
2  2017         187626             770         2453      0.311
3  2018         187979             779         2567      0.307
4  2019         187899             782         3018      0.316

group_by() multiple, “on the fly” variables

d |>
  group_by(year, high_mpp = mpp > median(mpp)) |>
  summarise(fraction_housing_renters = median(fraction_housing_renters),
            fraction_high_rent = median(fraction_high_rent),
            fraction_fam_nospouse = median(fraction_fam_nospouse),
            fraction_employment = median(fraction_employment))
# A tibble: 8 × 6
# Groups:   year [4]
   year high_mpp fraction_housing_ren…¹ fraction_high_rent fraction_fam_nospouse
  <int> <lgl>                     <dbl>              <dbl>                 <dbl>
1  2016 FALSE                     0.311              0.39                  0.22 
2  2016 TRUE                      0.602              0.51                  0.521
3  2017 FALSE                     0.308              0.384                 0.22 
4  2017 TRUE                      0.596              0.488                 0.511
5  2018 FALSE                     0.318              0.386                 0.219
6  2018 TRUE                      0.592              0.492                 0.498
7  2019 FALSE                     0.303              0.389                 0.218
8  2019 TRUE                      0.599              0.471                 0.491
# ℹ abbreviated name: ¹​fraction_housing_renters
# ℹ 1 more variable: fraction_employment <dbl>

across()

d |>
  group_by(year, high_mpp = mpp > median(mpp)) |>
  summarise(across(c(starts_with("fraction")), median))
# A tibble: 8 × 6
# Groups:   year [4]
   year high_mpp fraction_housing_ren…¹ fraction_high_rent fraction_fam_nospouse
  <int> <lgl>                     <dbl>              <dbl>                 <dbl>
1  2016 FALSE                     0.311              0.39                  0.22 
2  2016 TRUE                      0.602              0.51                  0.521
3  2017 FALSE                     0.308              0.384                 0.22 
4  2017 TRUE                      0.596              0.488                 0.511
5  2018 FALSE                     0.318              0.386                 0.219
6  2018 TRUE                      0.592              0.492                 0.498
7  2019 FALSE                     0.303              0.389                 0.218
8  2019 TRUE                      0.599              0.471                 0.491
# ℹ abbreviated name: ¹​fraction_housing_renters
# ℹ 1 more variable: fraction_employment <dbl>

Result of summary() is a tibble

d |>
  group_by(year, high_mpp = mpp > median(mpp)) |>
  summarise(across(c(starts_with("fraction"), mpp), median)) |>
  mutate(housing_burden = fraction_housing_renters^2 + fraction_high_rent) |>
  select(-fraction_housing_renters, -fraction_high_rent)
# A tibble: 8 × 6
# Groups:   year [4]
   year high_mpp fraction_fam_nospouse fraction_employment   mpp housing_burden
  <int> <lgl>                    <dbl>               <dbl> <dbl>          <dbl>
1  2016 FALSE                    0.22                0.952 0.126          0.487
2  2016 TRUE                     0.521               0.889 0.591          0.872
3  2017 FALSE                    0.22                0.96  0.125          0.480
4  2017 TRUE                     0.511               0.9   0.573          0.844
5  2018 FALSE                    0.219               0.965 0.130          0.486
6  2018 TRUE                     0.498               0.915 0.584          0.843
7  2019 FALSE                    0.218               0.969 0.135          0.481
8  2019 TRUE                     0.491               0.929 0.586          0.830

Wrangle summary() output into a table

d |>
  group_by(year, high_mpp = mpp > median(mpp)) |>
  summarise(across(c(starts_with("fraction"), mpp), median)) |>
  mutate(housing_burden = fraction_housing_renters^2 + fraction_high_rent) |>
  select(year, high_mpp, housing_burden) |>
  tidyr::pivot_wider(names_from = year, values_from = housing_burden) |>
  knitr::kable(digits = 2)
high_mpp 2016 2017 2018 2019
FALSE 0.49 0.48 0.49 0.48
TRUE 0.87 0.84 0.84 0.83

Distinct observations

distinct(d, census_tract_id)
# A tibble: 222 × 1
   census_tract_id
   <chr>          
 1 39061010300    
 2 39061007800    
 3 39061006600    
 4 39061010202    
 5 39061010201    
 6 39061021506    
 7 39061008400    
 8 39061021001    
 9 39061001700    
10 39061026400    
# ℹ 212 more rows

Slicing observations

slice_head(d, n = 2)
# A tibble: 2 × 9
  census_tract_id  year   mpp fraction_housing_renters fraction_high_rent
  <chr>           <int> <dbl>                    <dbl>              <dbl>
1 39061010300      2016 0.34                     0.631              0.395
2 39061007800      2016 0.365                    0.481              0.458
# ℹ 4 more variables: fraction_fam_nospouse <dbl>, fraction_employment <dbl>,
#   n_children_lt18 <int>, median_income_2010adj <dbl>
slice_sample(d, prop = 0.1)
# A tibble: 88 × 9
   census_tract_id  year   mpp fraction_housing_renters fraction_high_rent
   <chr>           <int> <dbl>                    <dbl>              <dbl>
 1 39061025600      2017 0.12                     0.619              0.491
 2 39061010202      2018 0.487                    0.398              0.276
 3 39061021421      2016 0.032                    0.077              0.202
 4 39061025102      2019 0.148                    0.23               0.347
 5 39061005000      2019 0.14                     0.514              0.365
 6 39061008501      2018 0.763                    0.813              0.572
 7 39061007300      2018 0.587                    0.502              0.421
 8 39061023201      2019 0.109                    0.551              0.484
 9 39061004602      2018 0.275                    0.414              0.531
10 39061020741      2018 0.423                    0.258              0.469
# ℹ 78 more rows
# ℹ 4 more variables: fraction_fam_nospouse <dbl>, fraction_employment <dbl>,
#   n_children_lt18 <int>, median_income_2010adj <dbl>

Manipulate Observations

d |>
  group_by(census_tract_id) |>
  slice_min(fraction_employment, with_ties = FALSE)
# A tibble: 222 × 9
# Groups:   census_tract_id [222]
   census_tract_id  year   mpp fraction_housing_renters fraction_high_rent
   <chr>           <int> <dbl>                    <dbl>              <dbl>
 1 39061000200      2016 0.969                    1                  0.585
 2 39061000700      2016 0.219                    0.943              0.283
 3 39061000900      2019 0.454                    0.559              0.481
 4 39061001000      2017 0.464                    0.739              0.309
 5 39061001100      2016 0.723                    0.771              0.361
 6 39061001600      2016 0.755                    0.925              0.599
 7 39061001700      2016 0.771                    0.928              0.432
 8 39061001800      2016 0.382                    0.536              0.285
 9 39061001900      2016 0.482                    0.696              0.276
10 39061002000      2016 0.367                    0.677              0.268
# ℹ 212 more rows
# ℹ 4 more variables: fraction_fam_nospouse <dbl>, fraction_employment <dbl>,
#   n_children_lt18 <int>, median_income_2010adj <dbl>

Combine Tables

d_411 <-
  codec::read_tdr_csv("https://github.com/geomarker-io/hamilton_property_code_enforcement/releases/download/0.1.3")

d <- left_join(d, d_411, by = join_by(census_tract_id == census_tract_id_2010))

tibble::glimpse(d)
Rows: 888
Columns: 10
$ census_tract_id          <chr> "39061010300", "39061007800", "39061006600", …
$ year                     <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 201…
$ mpp                      <dbl> 0.340, 0.365, 0.783, 0.421, 0.275, 0.813, 0.5…
$ fraction_housing_renters <dbl> 0.631, 0.481, 0.850, 0.387, 0.545, 0.308, 0.5…
$ fraction_high_rent       <dbl> 0.395, 0.458, 0.416, 0.399, 0.535, 0.596, 0.5…
$ fraction_fam_nospouse    <dbl> 0.526, 0.255, 0.791, 0.373, 0.579, 0.494, 0.6…
$ fraction_employment      <dbl> 0.776, 0.893, 0.911, 0.903, 0.865, 0.909, 0.9…
$ n_children_lt18          <int> 388, 389, 427, 638, 1348, 873, 673, 711, 348,…
$ median_income_2010adj    <dbl> 21200, 37900, 20600, 45000, 28100, 49600, 292…
$ violations_per_household <dbl> 77.8241758, 173.7297297, 12.5778364, 22.91818…

List-column workflow with tidyr

Create a data list-column using a combination of dplyr::group_by(), tidyr::nest(), dplyr::nest_by(), or dplyr::rowwise()

d <- d |> dplyr::nest_by(year)
d
# A tibble: 4 × 2
# Rowwise:  year
   year               data
  <int> <list<tibble[,9]>>
1  2016          [222 × 9]
2  2017          [222 × 9]
3  2018          [222 × 9]
4  2019          [222 × 9]

List structure

d$data[d$year == "2019"][[1]]
# A tibble: 222 × 9
   census_tract_id   mpp fraction_housing_renters fraction_high_rent
   <chr>           <dbl>                    <dbl>              <dbl>
 1 39061010300     0.182                    0.652              0.484
 2 39061007800     0.332                    0.567              0.478
 3 39061006600     0.837                    0.915              0.536
 4 39061010202     0.509                    0.419              0.293
 5 39061010201     0.421                    0.537              0.471
 6 39061021506     0.816                    0.37               0.514
 7 39061008400     0.619                    0.552              0.467
 8 39061021001     0.087                    0.348              0.451
 9 39061001700     0.783                    0.929              0.507
10 39061026400     0.804                    0.786              0.323
# ℹ 212 more rows
# ℹ 5 more variables: fraction_fam_nospouse <dbl>, fraction_employment <dbl>,
#   n_children_lt18 <int>, median_income_2010adj <dbl>,
#   violations_per_household <dbl>

rowwise

Use a rowwise tibble to apply non vectorized functions across list-cols

d |>
  mutate(
    lm = list(lm(fraction_high_rent ~ mpp + fraction_employment, data = data)),
    lm_pred = list(predict(lm, data))
  ) |>
  summarise(
    rmse = sqrt(mean((lm_pred - data$fraction_high_rent) ^ 2)),
    mae = median(abs(lm_pred - data$fraction_high_rent))
  ) |>
  knitr::kable(digits = 3)
year rmse mae
2016 0.130 0.070
2017 0.127 0.081
2018 0.122 0.071
2019 0.124 0.079

broom::glance()

broom::glance() is designed for extracting summary measures of modeling objects into tidy data frames

d |>
  mutate(lm = list(lm(fraction_high_rent ~ mpp + fraction_employment, data = data))) |>
  summarise(broom::glance(lm))
# A tibble: 4 × 13
# Groups:   year [4]
   year r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
  <int>     <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
1  2016     0.198         0.191 0.130      27.1 3.08e-11     2   139. -269.
2  2017     0.219         0.212 0.128      30.7 1.82e-12     2   143. -278.
3  2018     0.247         0.240 0.123      36.0 3.09e-14     2   151. -295.
4  2019     0.207         0.200 0.124      28.6 9.50e-12     2   149. -291.
# ℹ 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>, nobs <int>

broom::tidy()

… or model parameters

d |>
  mutate(lm = list(lm(fraction_high_rent ~ mpp + fraction_employment, data = data))) |>
  reframe(broom::tidy(lm))
# A tibble: 12 × 6
    year term                estimate std.error statistic    p.value
   <int> <chr>                  <dbl>     <dbl>     <dbl>      <dbl>
 1  2016 (Intercept)            0.554    0.137       4.04 0.0000745 
 2  2016 mpp                    0.186    0.0403      4.60 0.00000701
 3  2016 fraction_employment   -0.204    0.140      -1.45 0.148     
 4  2017 (Intercept)            0.684    0.153       4.47 0.0000127 
 5  2017 mpp                    0.169    0.0395      4.29 0.0000271 
 6  2017 fraction_employment   -0.352    0.156      -2.26 0.0248    
 7  2018 (Intercept)            0.763    0.166       4.60 0.00000729
 8  2018 mpp                    0.178    0.0375      4.75 0.00000371
 9  2018 fraction_employment   -0.435    0.169      -2.57 0.0107    
10  2019 (Intercept)            0.849    0.174       4.89 0.00000192
11  2019 mpp                    0.148    0.0365      4.06 0.0000674 
12  2019 fraction_employment   -0.517    0.176      -2.93 0.00375   

More dplyr

  • tidyselect
  • across(), pick(), unpack()
  • offset, cumulative aggregate, ranking functions
  • “SQL-esque” functions:
    • coalesce()
    • if_else()
    • na_if()
    • case_when()

dplyr data backends

  • arrow: Apache Arrow
  • dtplyr: use data.table
  • dbplyr: use SQL
  • duckdb: use DuckDB
  • sparklyr: use Apache Spark

dplyr “extensions”

  • tibble and vctrs packages: sf, s2, tsibble, feasts, etc…

Tidy Data + Tidy Evaluation

Thank You

🌐 https://colebrokamp.com

👨‍💻️ github.com/cole-brokamp

🐦 @cole_brokamp

📧 cole.brokamp@cchmc.org