class: center, middle, inverse, title-slide # Workflow for Data Analysis ## with R ### Julie Scholler | B 246 | january 2020 --- class: inverse <img src="images/horst-eco-r4ds.png" width="110%" style="display: block; margin: auto;" /> .right[*Image credit: Allison Horst*] --- # Workflow <img src="images/r4ds_data-science.png" width="75%" style="display: block; margin: auto;" /> .pull-left[ **Steps** 1. Import 1. Tidy: Clean and Transform 1. Manipulate, Transform and Tidy 1. Visualize 1. Modelize 1. Communicate ] .pull-right[ **Tips and Tricks** - Optimizing - When it's to big **Mulitple ecosystems** - base R - tidyverse - data.table - etc. ] --- # Comparison ### Base R - a very good starting point - easy to program with but not always easy to read ### Tidyverse - very easy to write and read - each function cares about exactly one single task - chaining approach ### Data.table - extremely fast and memory efficient, so the approach for large data - chaining approach butlonger queries are not always easy to read ### Links [base R vs. dplyr vs. data.table](https://github.com/mayer79/data_preparation_r), [syntax equivalents: base R vs Tidyverse](https://tavareshugo.github.io/data_carpentry_extras/base-r_tidyverse_equivalents/base-r_tidyverse_equivalents.html) --- # Tidyverse <img src="images/tidyverse-logo.png" width="200px" style="display: block; margin: auto;" /> - coherent system of packages that work in harmony - for Data Science - [tidyverse.org](https://www.tidyverse.org/) <img src="core_packages.png" width="50%" style="display: block; margin: auto;" /> --- # Data.table <img src="rdatatable.png" width="200px" style="display: block; margin: auto;" /> - syntax * concise and consistent * fast to read and fast to type * corresponding to SQL queries: `FROM[where|orderby, select, groupby]` - fast speed - memory efficient --- class: center, middle, inverse # Import data --- # Import tabular data ## Base R - `read.table`, `read.csv` Don't forget: * `na.string`, `stringsAsFactors`, `fileEncoding` -- ## Tidyverse and `readr` - `read_csv`, `read_csv2`, etc. * `col_names = TRUE`: logical or character vector * `col_types`: to specify column types * `na`: character vector of strings to interpret as missing values * `locale = locale(encoding = "ISO-8859-1")`: to import Windows file, e.g. -- ## Data.table - `fread` * `colCLasses` * `select`, `drop`: select or drop columns ??? - export functions: `write_csv`, `write_csv2`, etc. --- # Import (big) data - BlackFriday: dataset of 537577 observations ```r system.time(bf1 <- read.csv("BlackFriday.csv")) ``` ``` ## user system elapsed ## 3.18 0.14 3.33 ``` ```r system.time(bf2 <- read_csv("BlackFriday.csv")) ``` ``` ## user system elapsed ## 0.78 0.00 0.78 ``` ```r system.time(bf3 <- fread("BlackFriday.csv")) ``` ``` ## user system elapsed ## 0.22 0.11 0.18 ``` --- # Import (big) data ```r bf1 <- read.csv("BlackFriday.csv") bf2 <- read_csv("BlackFriday.csv") bf3 <- fread("BlackFriday.csv") ``` ## What object? ```r class(bf1) ``` ``` ## [1] "data.frame" ``` ```r class(bf2) ``` ``` ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" ``` ```r class(bf3) ``` ``` ## [1] "data.table" "data.frame" ``` --- # Tibble ```r bf2 ``` ``` ## # A tibble: 537,577 x 12 ## User_ID Product_ID Gender Age Occupation City_Category ## <dbl> <chr> <chr> <chr> <dbl> <chr> ## 1 1000001 P00069042 F 0-17 10 A ## 2 1000001 P00248942 F 0-17 10 A ## 3 1000001 P00087842 F 0-17 10 A ## 4 1000001 P00085442 F 0-17 10 A ## 5 1000002 P00285442 M 55+ 16 C ## 6 1000003 P00193542 M 26-35 15 A ## 7 1000004 P00184942 M 46-50 7 B ## 8 1000004 P00346142 M 46-50 7 B ## 9 1000004 P0097242 M 46-50 7 B ## 10 1000005 P00274942 M 26-35 20 A ## # ... with 537,567 more rows, and 6 more variables: ## # Stay_In_Current_City_Years <chr>, Marital_Status <dbl>, ## # Product_Category_1 <dbl>, Product_Category_2 <dbl>, ## # Product_Category_3 <dbl>, Purchase <dbl> ``` --- # Tibble - modern reimagining of the data.frame - convert with `as_tibble` ### Tibble vs. data.frame - printing - more autorised character in column names - no partial matching - no row names * `column_to_rownames()` * `rownames_to_column()` --- # Data.table object ```r bf3 ``` ``` ## User_ID Product_ID Gender Age Occupation City_Category ## 1: 1000001 P00069042 F 0-17 10 A ## 2: 1000001 P00248942 F 0-17 10 A ## 3: 1000001 P00087842 F 0-17 10 A ## 4: 1000001 P00085442 F 0-17 10 A ## 5: 1000002 P00285442 M 55+ 16 C ## --- ## 537573: 1004737 P00193542 M 36-45 16 C ## 537574: 1004737 P00111142 M 36-45 16 C ## 537575: 1004737 P00345942 M 36-45 16 C ## 537576: 1004737 P00285842 M 36-45 16 C ## 537577: 1004737 P00118242 M 36-45 16 C ## Stay_In_Current_City_Years Marital_Status Product_Category_1 ## 1: 2 0 3 ## 2: 2 0 1 ## 3: 2 0 12 ## 4: 2 0 12 ## 5: 4+ 0 8 ## --- ## 537573: 1 0 1 ## 537574: 1 0 1 ## 537575: 1 0 8 ## 537576: 1 0 5 ## 537577: 1 0 5 ## Product_Category_2 Product_Category_3 Purchase ## 1: NA NA 8370 ## 2: 6 14 15200 ## 3: NA NA 1422 ## 4: 14 NA 1057 ## 5: NA NA 7969 ## --- ## 537573: 2 NA 11664 ## 537574: 15 16 19196 ## 537575: 15 NA 8043 ## 537576: NA NA 7172 ## 537577: 8 NA 6875 ``` --- # Data.table object - enhanced version of data.frame - column of character type are never converted to factors by default - syntax SQL like ```r data[i, j, by] ``` - `i`: subset rows - `j`: select/compute column(s) - `by`: group (sort with `keyby`) Take data, subset/reorder rows using i, then calculate j, grouped by by. ### data.table vs. data.frame/base R - additional to base R - optimization of `[]` - more readable, faster than base R ??? ### data.table vs. tibble ... --- # Import other formats ## Excel data - package `readxl` with `read_excel` ## Hierarchical data - JSON with `jsonlite` - xml with `XML` - html with `rvest` (web scraping) ## Relational data - SQLite, MySQL, PostgreSQL etc. with `DBI` ## Miscellaneous - SAS, SPSS and Stata data with the package `haven` - dBase data with the package `foreign` --- # Hierarchical data - JSON, xml, html, etc. ### JSON: JavaScript Object Notation - lightweight data-interchange format - easy for humans to read and write - easy for machines to parse and generate - in R with `jsonlite` package ### Syntax ```r { "clé1": valeur1, "clé2": valeur2, ... } ``` --- class: left, top, background-slide-light background-image: url(images/keyboard-woman.jpeg) background-size: cover ## Action - import: `companies.json` and `backbone.xml` - look at them - try to clean them a little --- # Cleaning strings ## base R - `gsub` ## tidyverse and stringr - provide a cohesive set of functions designed to make working with strings as easy as possible - `str_*` * `str_replace`, `str_replace_all`, `str_replace_na` * `str_remove`, `str_remove_all` * `str_to_upper`, `str_to_lower` ## comparison [base R vs. stringr](https://stringr.tidyverse.org/articles/from-base.html) --- # Extract the numbers <img src="images/parse_number.png" width="70%" style="display: block; margin: auto;" /> .right[*Image credit: Allison Horst*] --- # janitor - minor bur useful commands: `remove_empty_cols()`, `remove_empty_rows()` -- <img src="images/janitor_clean_names.png" width="75%" style="display: block; margin: auto;" /> .right[*Image credit: Allison Horst*] - `clean_names()`: handles problematic variable names, returning only lowercase letters with underscore as separator, and many other improvements to colnames --- # Coding cases <img src="images/coding_cases.png" width="90%" style="display: block; margin: auto;" /> .right[*Image credit: Allison Horst*] --- class: inverse, middle, center # Manipulate --- class: inverse, middle, center # Manipulate with `dplyr` --- # dplyr <img src="images/dplyr-logo.png" width="200px" style="display: block; margin: auto;" /> - grammar of data manipulation - consistent set of verbs - mutating and filtering joins (two-table verbs) ## Rules - first argument always dataset - subsequent arguments say what to do with that data frame - always return a tibble - don't modify in place --- # dplyr verbs ## single table verbs - `slice`: pick rows using index(es) - `filter`: pick rows matching criteria - `select`: pick columns by name or index - `pull`: grab a column as a vector - `rename`: rename specific columns - `arrange`: reorder rows - `mutate`: add new variables - `transmute`: create new data frame with variables - `group_by`: create groups inside the dataset - `summarise`: reduce variables to values - `distinct`: filter for unique rows,... ## two-table verbs - `left_join`, `right_join` - `inner_join`, `full_join` --- class: left, top, background-slide-light background-image: url(images/keyboard-woman.jpeg) background-size: cover ## Action - import BlackFriday data - play with them --- # data ```r bf <- read_csv("BlackFriday.csv") bf ``` ``` ## # A tibble: 537,577 x 12 ## User_ID Product_ID Gender Age Occupation City_Category ## <dbl> <chr> <chr> <chr> <dbl> <chr> ## 1 1000001 P00069042 F 0-17 10 A ## 2 1000001 P00248942 F 0-17 10 A ## 3 1000001 P00087842 F 0-17 10 A ## 4 1000001 P00085442 F 0-17 10 A ## 5 1000002 P00285442 M 55+ 16 C ## 6 1000003 P00193542 M 26-35 15 A ## 7 1000004 P00184942 M 46-50 7 B ## 8 1000004 P00346142 M 46-50 7 B ## 9 1000004 P0097242 M 46-50 7 B ## 10 1000005 P00274942 M 26-35 20 A ## # ... with 537,567 more rows, and 6 more variables: ## # Stay_In_Current_City_Years <chr>, Marital_Status <dbl>, ## # Product_Category_1 <dbl>, Product_Category_2 <dbl>, ## # Product_Category_3 <dbl>, Purchase <dbl> ``` --- ## `slice` ```r slice(bf, 12:15) ``` ## `filter` ```r filter(bf, Gender=="F") filter(bf, Gender=="F" & City_Category=="A") filter(bf, Gender=="F", City_Category=="A") filter(bf, Gender=="F"| City_Category=="A") ``` ## `select` to keep some variables ```r select(bf, Gender, City_Category) ``` --- ## `select` to keep some variables ```r table(select(bf, Gender, City_Category)) ``` ``` ## City_Category ## Gender A B C ## F 34807 56494 40896 ## M 109831 169999 125550 ``` --- # Pipe operator ## `select` to keep some variables ```r # table(select(bf, Gender, City_Category)) select(bf, Gender, City_Category) %>% table() ``` <img src="images/pipe-logo.png" width="200px" style="display: block; margin: auto;" /> - `%>%`: pipe operator - This means you "pipe" the output of the previous line of code as the first input of the next line of code. - pipeline: we can use multiple pipes in a row - never at the begining of a line --- class: left, top, background-slide-light background-image: url(images/keyboard-woman.jpeg) background-size: cover ## Action - select variables about product category - exclude variables about ID - select variables about the city --- ## Special functions Only work inside command like `select` - `starts_with()` - `ends_with()` - `contains()` - `matches()` - `everything()` --- class: left, top, background-slide-light background-image: url(images/keyboard-woman.jpeg) background-size: cover ## Action - reorder variables to put Purchase first - compare * `bf %>% pull(Purchase) %>% head()` * `bf %>% select(Purchase) %>% head()` - add purchase in euros - create new tibble with Product_ID and price in euros - test `bf %>% arrange(Purchase,Marital_Status)%>% select(12,8)` --- # `group_by` to create subgroups ```r bf %>% group_by(Age) ``` ``` ## # A tibble: 537,577 x 12 ## # Groups: Age [7] ## User_ID Product_ID Gender Age Occupation City_Category ## <dbl> <chr> <chr> <chr> <dbl> <chr> ## 1 1000001 P00069042 F 0-17 10 A ## 2 1000001 P00248942 F 0-17 10 A ## 3 1000001 P00087842 F 0-17 10 A ## 4 1000001 P00085442 F 0-17 10 A ## 5 1000002 P00285442 M 55+ 16 C ## 6 1000003 P00193542 M 26-35 15 A ## 7 1000004 P00184942 M 46-50 7 B ## 8 1000004 P00346142 M 46-50 7 B ## 9 1000004 P0097242 M 46-50 7 B ## 10 1000005 P00274942 M 26-35 20 A ## # ... with 537,567 more rows, and 6 more variables: ## # Stay_In_Current_City_Years <chr>, Marital_Status <dbl>, ## # Product_Category_1 <dbl>, Product_Category_2 <dbl>, ## # Product_Category_3 <dbl>, Purchase <dbl> ``` --- # `group_by` to create subgroups ```r bf %>% group_by(Age) %>% slice(1) ``` ``` ## # A tibble: 7 x 12 ## # Groups: Age [7] ## User_ID Product_ID Gender Age Occupation City_Category Stay_In_Current~ ## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 1000001 P00069042 F 0-17 10 A 2 ## 2 1000018 P00366542 F 18-25 3 B 3 ## 3 1000003 P00193542 M 26-35 15 A 3 ## 4 1000007 P00036842 M 36-45 1 B 1 ## 5 1000004 P00184942 M 46-50 7 B 2 ## 6 1000006 P00231342 F 51-55 9 A 1 ## 7 1000002 P00285442 M 55+ 16 C 4+ ## # ... with 5 more variables: Marital_Status <dbl>, ## # Product_Category_1 <dbl>, Product_Category_2 <dbl>, ## # Product_Category_3 <dbl>, Purchase <dbl> ``` --- # `group_by` to create subgroups .pull-left[ ```r bf %>% group_by(Age) %>% arrange(desc(Purchase)) %>% slice(1) %>% select(Age,Purchase) ``` ] .pull-right[ ``` ## # A tibble: 7 x 2 ## # Groups: Age [7] ## Age Purchase ## <chr> <dbl> ## 1 0-17 23955 ## 2 18-25 23958 ## 3 26-35 23961 ## 4 36-45 23960 ## 5 46-50 23960 ## 6 51-55 23960 ## 7 55+ 23960 ``` ] --- # `group_by` and `summarize` ```r bf %>% group_by(Age) %>% summarise( Max_Purchase = max(Purchase, na.rm = TRUE) ) ``` ``` ## # A tibble: 7 x 2 ## Age Max_Purchase ## <chr> <dbl> ## 1 0-17 23955 ## 2 18-25 23958 ## 3 26-35 23961 ## 4 36-45 23960 ## 5 46-50 23960 ## 6 51-55 23960 ## 7 55+ 23960 ``` --- # `group_by` and `summarize` ```r bf %>% group_by(Age) %>% summarise(Min_Purchase=min(Purchase, na.rm = TRUE), Max_Purch=max(Purchase, na.rm = TRUE), Mean_Purch=mean(Purchase, na.rm = TRUE), Purchase=sum(Purchase)) ``` ``` ## # A tibble: 7 x 5 ## Age Min_Purchase Max_Purch Mean_Purch Purchase ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 0-17 187 23955 9020. 132659006 ## 2 18-25 185 23958 9235. 901669280 ## 3 26-35 185 23961 9315. 1999749106 ## 4 36-45 185 23960 9401. 1010649565 ## 5 46-50 186 23960 9285. 413418223 ## 6 51-55 187 23960 9621. 361908356 ## 7 55+ 187 23960 9454. 197614842 ``` --- # `group_by` and `summarize_at` ```r getmode <- function(x) { uniqx <- unique(x) names(which.max(table(x))) } nb.na <- function(x){sum(is.na(x))} bf %>% group_by(Age) %>% summarise_at( vars(Gender, Occupation, City_Category), getmode ) ``` ``` ## # A tibble: 7 x 4 ## Age Gender Occupation City_Category ## <chr> <chr> <chr> <chr> ## 1 0-17 M 10 C ## 2 18-25 M 4 B ## 3 26-35 M 0 B ## 4 36-45 M 7 B ## 5 46-50 M 1 B ## 6 51-55 M 7 B ## 7 55+ M 13 C ``` --- # `*_at`, `*_all` and `*_if` `summarise_all()`, `mutate_all()`, `summarise_at()`, `mutate_at()`, `summarise_if()`, `mutate_if()` - scoped variants of `summarise()` and `mutate()` - make it easy to apply the same transformation to multiple variables There are three variants. - `_all` affects every variable - `_at` affects variables selected with a character vector or `vars()` - `_if` affects variables selected with a predicate function --- # Join databases .pull-left[ ```r etu ``` ``` ## # A tibble: 7 x 2 ## name sujet ## <chr> <chr> ## 1 Bob SVM ## 2 Alice LASSO ## 3 John RandomForest ## 4 Simon Nothing ## 5 Ines SVM ## 6 Lisa LASSO ## 7 Omar RandomForest ``` ] .pull-right[ ```r exam ``` ``` ## # A tibble: 4 x 2 ## dataset method ## <chr> <chr> ## 1 energy SVM ## 2 pollution LASSO ## 3 ecommerce RandomForest ## 4 cars NeuralNetwork ``` ] --- # `left_join` and `right_join` .pull-left[ ```r etu %>% left_join(exam,by=c("sujet"="method")) ``` ``` ## # A tibble: 7 x 3 ## name sujet dataset ## <chr> <chr> <chr> ## 1 Bob SVM energy ## 2 Alice LASSO pollution ## 3 John RandomForest ecommerce ## 4 Simon Nothing <NA> ## 5 Ines SVM energy ## 6 Lisa LASSO pollution ## 7 Omar RandomForest ecommerce ``` ] .pull-right[ ```r etu %>% right_join(exam,by=c("sujet"="method")) ``` ``` ## # A tibble: 7 x 3 ## name sujet dataset ## <chr> <chr> <chr> ## 1 Bob SVM energy ## 2 Ines SVM energy ## 3 Alice LASSO pollution ## 4 Lisa LASSO pollution ## 5 John RandomForest ecommerce ## 6 Omar RandomForest ecommerce ## 7 <NA> NeuralNetwork cars ``` ] --- # `inner_join` and `full_join` .pull-left[ ```r etu %>% inner_join(exam,by=c("sujet"="method")) ``` ``` ## # A tibble: 6 x 3 ## name sujet dataset ## <chr> <chr> <chr> ## 1 Bob SVM energy ## 2 Alice LASSO pollution ## 3 John RandomForest ecommerce ## 4 Ines SVM energy ## 5 Lisa LASSO pollution ## 6 Omar RandomForest ecommerce ``` ] .pull-right[ ```r etu %>% full_join(exam,by=c("sujet"="method")) ``` ``` ## # A tibble: 8 x 3 ## name sujet dataset ## <chr> <chr> <chr> ## 1 Bob SVM energy ## 2 Alice LASSO pollution ## 3 John RandomForest ecommerce ## 4 Simon Nothing <NA> ## 5 Ines SVM energy ## 6 Lisa LASSO pollution ## 7 Omar RandomForest ecommerce ## 8 <NA> NeuralNetwork cars ``` ] --- # Recap of dplyr verbs ### One-table verbs - `slice` and `filter`: pick rows - `select`: pick columns by name or index - `pull`: grab a column as a vector - `rename`: rename specific columns - `arrange`: reorder rows - `mutate`: add new variables - `transmute`: create new data frame with variables - `groupe_by`: create groups inside the dataset - `summarise`: reduce variables to values - many more ### Two-table verbs - `left_join`, `right_join` - `inner_join`, `full_join` --- class: inverse, middle, center # Tidy --- # Tidy data Tidy data is data where: 1. Every column is variable. 2. Every row is an observation. 3. Every cell is a single value. <img src="images/tidy-data.png" width="1000px" style="display: block; margin: auto;" /> .right[Image credit: Hadley Wickham and Garrett Grolemund] --- # Data .pull-left[ ```r res ``` ``` ## # A tibble: 14 x 3 ## name type note ## <chr> <chr> <dbl> ## 1 Bob Oral 9 ## 2 Alice Oral 13 ## 3 John Oral 15 ## 4 Simon Oral 11 ## 5 Ines Oral 13 ## 6 Lisa Oral 11 ## 7 Omar Oral 10 ## 8 Bob Dossier 20 ## 9 Alice Dossier 16 ## 10 John Dossier 13 ## 11 Simon Dossier 19 ## 12 Ines Dossier 19 ## 13 Lisa Dossier 15 ## 14 Omar Dossier 13 ``` ] ```r ggplot(res) + aes(x = note) + geom_density(fill = "grey", col = "grey") + facet_wrap(~type) ``` <img src="M2-CM1-xaringan_files/figure-html/unnamed-chunk-45-1.png" width="360" /> - scatterplot Oral~Dossier? --- background-image: url(images/tidyr-logo.png) background-size: 20% background-position: 93% 7% # Tidy with `tidyr` ### Help to - create tidy data - create the right dataset for ggplot2 ### Two fundamental verbs of data tidying .pull-left[ - `pivot_longer` * to expand data in multiple variables * new version of `gather`, now deprecated ] .pull-right[ - `pivot_wider` * to pile up data * new version of `spread`, now deprecated ] ### Other verbs - `separate`, `unite`, `extract`, `complete`, etc. --- # `pivot_wider` .pull-left[ ```r res2 <- res %>% pivot_wider( names_from = type, values_from = note) res2 ``` ``` ## # A tibble: 7 x 3 ## name Oral Dossier ## <chr> <dbl> <dbl> ## 1 Bob 9 20 ## 2 Alice 13 16 ## 3 John 15 13 ## 4 Simon 11 19 ## 5 Ines 13 19 ## 6 Lisa 11 15 ## 7 Omar 10 13 ``` ] ```r ggplot(res2) + aes(x = Dossier, y = Oral) + geom_point() + lims(x = c(0,20), y=c(0,20)) ``` <img src="M2-CM1-xaringan_files/figure-html/unnamed-chunk-47-1.png" width="360" /> --- # `pivot_longer` ```r res2 %>% pivot_longer(cols = c(Dossier, Oral), names_to = "type", values_to = "note") ``` ``` ## # A tibble: 14 x 3 ## name type note ## <chr> <chr> <dbl> ## 1 Bob Dossier 20 ## 2 Bob Oral 9 ## 3 Alice Dossier 16 ## 4 Alice Oral 13 ## 5 John Dossier 13 ## 6 John Oral 15 ## 7 Simon Dossier 19 ## 8 Simon Oral 11 ## 9 Ines Dossier 19 ## 10 Ines Oral 13 ## 11 Lisa Dossier 15 ## 12 Lisa Oral 11 ## 13 Omar Dossier 13 ## 14 Omar Oral 10 ``` --- # `unite` to reunite two columns ```r uni <- res %>% unite(col=results,type,note,sep="-") uni ``` ``` ## # A tibble: 14 x 2 ## name results ## <chr> <chr> ## 1 Bob Oral-9 ## 2 Alice Oral-13 ## 3 John Oral-15 ## 4 Simon Oral-11 ## 5 Ines Oral-13 ## 6 Lisa Oral-11 ## 7 Omar Oral-10 ## 8 Bob Dossier-20 ## 9 Alice Dossier-16 ## 10 John Dossier-13 ## 11 Simon Dossier-19 ## 12 Ines Dossier-19 ## 13 Lisa Dossier-15 ## 14 Omar Dossier-13 ``` --- # `separate` to divide a column into multiples ```r uni %>% separate(results, c("type","value")) ``` ``` ## # A tibble: 14 x 3 ## name type value ## <chr> <chr> <chr> ## 1 Bob Oral 9 ## 2 Alice Oral 13 ## 3 John Oral 15 ## 4 Simon Oral 11 ## 5 Ines Oral 13 ## 6 Lisa Oral 11 ## 7 Omar Oral 10 ## 8 Bob Dossier 20 ## 9 Alice Dossier 16 ## 10 John Dossier 13 ## 11 Simon Dossier 19 ## 12 Ines Dossier 19 ## 13 Lisa Dossier 15 ## 14 Omar Dossier 13 ``` --- class: left, top, background-slide-light background-image: url(images/keyboard-woman.jpeg) background-size: cover ## Action - plot count of Product_Category_1 - plot prices according to Product_Category_1 --- class: inverse, middle, center # Tame your factors with forcats --- background-image: url(images/forcats-logo.png) background-size: 20% background-position: 93% 7% # forcats - package from tidyverse - just for factors ## Main commands - reordering levels * `fct_infreq`: ordering levels by frequency * `fct_reorder`, `fct_reorder2`: ordering according to other variables * `fct_relevel`: reordering manually - combining levels * `fct_collapse` * `fct_lump` - recoding levels * `fct_recode`