In this notebook, we will learn how to perform data manipulation — and create data processing pipelines — using a powerful package called dplyr.

dplyr is part of tidyverse, a collection of R packages for data science. tidyverse also includes ggplot2, one of the most widely used data visualization packages.

Bonus: If you just need a quick refresher on dplyr, you can refer to this cheat sheet: https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf.

Preliminaries

tidyverse is not built into R, so we first have to install it:

install.packages("tidyverse")

Afterwards, we have to tell R that we want to use tidyverse (in technical terms, we are loading the package):

library("tidyverse")

Goodbye Data Frames! Hello Tibbles

tidyverse introduced a faster and better version of R’s built-in data frame; we call this a tibble.

💡 If you are interested in diving into the differences between R’s built-in data frame and tidyverse’s tibble, you may refer to this article: https://jtr13.github.io/cc21fall1/tibble-vs.-dataframe.html

Since this notebook aims to familiarize ourselves with dplyr (and tidyverse), we will make a shift from data frames to tibbles.

We start by loading our dataset (note that tibbles are loaded using read_delim while data frames are loaded using read.delim):

data <- read_delim("phages.tsv")
Rows: 18406 Columns: 27── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (19): Accession, Description, Classification, Molecule, Modification Date, Low Coding Capacity Warning, Host, Lowest Taxa, Genus, Sub-family, Family, Order, Class,...
dbl  (7): Genome Length (bp), molGC (%), Number CDS, Positive Strand (%), Negative Strand (%), Coding Capacity(%), tRNAs
lgl  (1): Jumbophage
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We check the columns of our dataset:

str(data)
spc_tbl_ [18,406 × 27] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Accession                                               : chr [1:18406] "LR756511" "LR756508" "LR756504" "LR756503" ...
 $ Description                                             : chr [1:18406] "uncultured phage" "uncultured phage" "uncultured phage" "uncultured phage" ...
 $ Classification                                          : chr [1:18406] "uncultured phage environmental samples Viruses" "uncultured phage environmental samples Viruses" "uncultured phage environmental samples Viruses" "uncultured phage environmental samples Viruses" ...
 $ Genome Length (bp)                                      : num [1:18406] 595163 484177 636363 735411 642428 ...
 $ Jumbophage                                              : logi [1:18406] TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ molGC (%)                                               : num [1:18406] 42.1 39.6 26.4 32.2 31.5 ...
 $ Molecule                                                : chr [1:18406] "DNA" "DNA" "DNA" "DNA" ...
 $ Modification Date                                       : chr [1:18406] "26-MAR-2020" "26-MAR-2020" "26-MAR-2020" "26-MAR-2020" ...
 $ Number CDS                                              : num [1:18406] 1080 683 920 1014 971 ...
 $ Positive Strand (%)                                     : num [1:18406] 20.6 98.4 52 30.3 54.6 ...
 $ Negative Strand (%)                                     : num [1:18406] 79.35 1.61 48.04 69.72 45.42 ...
 $ Coding Capacity(%)                                      : num [1:18406] 93.2 89.6 92.8 93.1 94.5 ...
 $ Low Coding Capacity Warning                             : chr [1:18406] NA NA NA NA ...
 $ tRNAs                                                   : num [1:18406] 62 31 34 56 45 56 26 35 57 30 ...
 $ Host                                                    : chr [1:18406] "Unspecified" "Unspecified" "Unspecified" "Unspecified" ...
 $ Lowest Taxa                                             : chr [1:18406] "Unclassified" "Unclassified" "Unclassified" "Unclassified" ...
 $ Genus                                                   : chr [1:18406] "Unclassified" "Unclassified" "Unclassified" "Unclassified" ...
 $ Sub-family                                              : chr [1:18406] "Unclassified" "Unclassified" "Unclassified" "Unclassified" ...
 $ Family                                                  : chr [1:18406] "Unclassified" "Unclassified" "Unclassified" "Unclassified" ...
 $ Order                                                   : chr [1:18406] "Unclassified" "Unclassified" "Unclassified" "Unclassified" ...
 $ Class                                                   : chr [1:18406] "Unclassified" "Unclassified" "Unclassified" "Unclassified" ...
 $ Phylum                                                  : chr [1:18406] "Unclassified" "Unclassified" "Unclassified" "Unclassified" ...
 $ Kingdom                                                 : chr [1:18406] "Unclassified" "Unclassified" "Unclassified" "Unclassified" ...
 $ Realm                                                   : chr [1:18406] "Unclassified" "Unclassified" "Unclassified" "Unclassified" ...
 $ Baltimore Group                                         : chr [1:18406] "Unclassified" "Unclassified" "Unclassified" "Unclassified" ...
 $ Genbank Division                                        : chr [1:18406] "ENV" "ENV" "ENV" "ENV" ...
 $ Isolation Host (beware inconsistent and nonsense values): chr [1:18406] "Unspecified" "Unspecified" "Unspecified" "Unspecified" ...
 - attr(*, "spec")=
  .. cols(
  ..   Accession = col_character(),
  ..   Description = col_character(),
  ..   Classification = col_character(),
  ..   `Genome Length (bp)` = col_double(),
  ..   Jumbophage = col_logical(),
  ..   `molGC (%)` = col_double(),
  ..   Molecule = col_character(),
  ..   `Modification Date` = col_character(),
  ..   `Number CDS` = col_double(),
  ..   `Positive Strand (%)` = col_double(),
  ..   `Negative Strand (%)` = col_double(),
  ..   `Coding Capacity(%)` = col_double(),
  ..   `Low Coding Capacity Warning` = col_character(),
  ..   tRNAs = col_double(),
  ..   Host = col_character(),
  ..   `Lowest Taxa` = col_character(),
  ..   Genus = col_character(),
  ..   `Sub-family` = col_character(),
  ..   Family = col_character(),
  ..   Order = col_character(),
  ..   Class = col_character(),
  ..   Phylum = col_character(),
  ..   Kingdom = col_character(),
  ..   Realm = col_character(),
  ..   `Baltimore Group` = col_character(),
  ..   `Genbank Division` = col_character(),
  ..   `Isolation Host (beware inconsistent and nonsense values)` = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

We view our dataset (this opens a new tab in RStudio):

View(data)

Selecting Columns & Filtering Rows

To select columns, we use select(). The first argument is the dataset, and the succeeding arguments are the columns to be included.

Use case: Suppose we want to get the family, order, and class of each phage in our dataset.

data_subset <- select(data, Accession, Family, Order, Class)
data_subset

To filter rows, we use filter().

Use case: Suppose we want to remove all entries where the family, order, and class are unclassified.

data_subset <- filter(data_subset, Family != "Unclassified" & Order != "Unclassified" & Class != "Unclassified")
data_subset

Pipe: The “Then” Operator

The processing that we just performed — selecting columns then filtering rows — is actually a simple pipeline already! But notice how our code can easily become cluttered if we are to add more intermediate steps.

Fortunately, dplyr provides a convenient operator called a pipe: %>% (a shortcut to typing this operator is by pressing Ctrl+Shift+M or Cmd+Shift+M for Mac). We can think of %>% as equivalent to the English “then.”

To illustrate its usage, we rewrite our pipeline like so:

data_subset_using_pipe <- data %>% 
  select(Accession, Family, Order, Class) %>% 
  filter(Family != "Unclassified" & Order != "Unclassified" & Class != "Unclassified")
data_subset_using_pipe

Observe how the syntax is mostly the same, with the exception of the first argument of select() and filter(). Since we already specified data at the start of the pipeline, we do not need to pass it anymore as an argument to the data manipulation functions.

Adding Columns (Mutate)

To add columns, we use mutate().

Use case: We have a column called Genome Length (bp) but we want a new column where the genome length is expressed in terms of kbp.

# We enclose column names with spaces in backticks ``

data_with_new_column <- data %>%
    mutate(`Genome Length (kbp)` = `Genome Length (bp)` / 1000)
data_with_new_column

Getting a Column (Pull)

Suppose we want to get all the accessions of the phages in our dataset.

As we learned earlier, we can use select().

data %>% select(Accession)

This works, but this is a one-column tibble. What if we want a vector? We can use pull() instead.

data %>% pull(Accession)
   [1] "LR756511"        "LR756508"        "LR756504"        "LR756503"        "LR756502"        "LR756501"        "LR756500"        "LR745208"        "LR745206"       
  [10] "MK250029"        "MK250028"        "MK250027"        "MK250026"        "MK250025"        "MK250024"        "MK250023"        "MK250022"        "MK250021"       
  [19] "MK250020"        "MK250019"        "MK250018"        "MK250017"        "MK250016"        "MK250015"        "LC727904"        "LC727903"        "LC727902"       
  [28] "LC727901"        "LC727900"        "LC727899"        "LC727898"        "LC727897"        "LC727896"        "LC727895"        "LC727894"        "LC727893"       
  [37] "LC727892"        "LC727891"        "LC727890"        "LC727889"        "LC727888"        "LC727887"        "LC727886"        "LC727885"        "LC727884"       
  [46] "LC727883"        "LC727882"        "LC727881"        "LC727880"        "LC727879"        "LC727878"        "LC727877"        "LC727876"        "LC727875"       
  [55] "LC727874"        "LC727873"        "LC727872"        "LC727871"        "LC727870"        "LC727869"        "LC727868"        "LC727867"        "LC727866"       
  [64] "LC727865"        "LC727864"        "LC727863"        "LC727862"        "LC727861"        "LC727860"        "LC727859"        "LC727858"        "LC727857"       
  [73] "LC727856"        "LC727855"        "LC727854"        "LC727853"        "LC727852"        "LC727851"        "LC727850"        "LC727849"        "LC727848"       
  [82] "LC727847"        "LC727846"        "LC727845"        "LC727844"        "LC727843"        "LC727842"        "LC727841"        "LC727840"        "CP101890"       
  [91] "CP101896"        "LC727701"        "LC727700"        "LC727699"        "LC727698"        "LC727697"        "LC727696"        "LC727695"        "MZ047271"       
 [100] "OP312987"        "CP103976"        "ON602767"        "ON602766"        "ON602765"        "ON602764"        "ON602763"        "ON602762"        "ON602761"       
 [109] "ON602760"        "ON602759"        "ON602758"        "ON602757"        "ON602756"        "ON602755"        "ON602754"        "ON602753"        "ON602752"       
 [118] "ON602751"        "ON602750"        "ON602749"        "ON602748"        "ON602747"        "ON602746"        "ON602745"        "ON602744"        "ON602743"       
 [127] "ON602742"        "ON602741"        "ON602740"        "ON602739"        "ON602738"        "ON602737"        "ON602736"        "ON602735"        "ON602734"       
 [136] "ON602733"        "ON602732"        "ON602731"        "ON602730"        "ON602729"        "ON602728"        "ON602727"        "ON602726"        "ON602725"       
 [145] "ON602724"        "ON602723"        "OP132241"        "ON489264"        "ON809560"        "ON809559"        "OP056300"        "OP121182"        "ON843697"       
 [154] "ON331942"        "OL744217"        "OL744218"        "OL744215"        "OL744213"        "OL744212"        "OL744210"        "OL744209"        "OL744220"       
 [163] "OL744219"        "OL744216"        "OL744214"        "OL744211"        "OM621814"        "OP168678"        "ON814135"        "OP172878"        "OP172877"       
 [172] "OP172876"        "OP172875"        "OP172874"        "OP172873"        "OP172872"        "OP172871"        "OP172870"        "OP172869"        "OP172868"       
 [181] "OP172867"        "OP292222"        "OP292288"        "OM869715"        "OM869714"        "OM869713"        "OM869712"        "OM869711"        "OM869710"       
 [190] "OM869709"        "OM869708"        "OM869707"        "OM869706"        "OM869705"        "OM869704"        "OM869703"        "OM869702"        "OM869701"       
 [199] "OM869700"        "OM869699"        "OM869698"        "OM869697"        "OM869696"        "OM869695"        "OM869694"        "OM869693"        "OM869692"       
 [208] "OM869691"        "OM869690"        "OM869688"        "OM869685"        "OM869684"        "OM869683"        "OM869682"        "OM869681"        "OM869680"       
 [217] "OM869679"        "OM869678"        "OM869677"        "OM869672"        "OM869669"        "OM869668"        "OM869667"        "OM869666"        "OM869665"       
 [226] "OM869664"        "OM869663"        "OM869662"        "OM869661"        "OM869660"        "OM869659"        "OM869656"        "OM869655"        "OM869654"       
 [235] "OM869653"        "OM869652"        "OM869651"        "OM869650"        "OM869649"        "OM869648"        "OM869647"        "OM869646"        "OM869645"       
 [244] "OM869644"        "OM869643"        "OM869642"        "OM869641"        "OM869639"        "OM869638"        "OM869637"        "OM869636"        "OM869635"       
 [253] "OM869634"        "OM869633"        "OM869631"        "OM869630"        "OM869629"        "OM869628"        "OM869627"        "OM869626"        "OM869625"       
 [262] "OM869624"        "OM869623"        "OM869612"        "OM869611"        "OM869610"        "OM869609"        "OM869608"        "OM869607"        "OM869606"       
 [271] "OM869605"        "OM869604"        "OM869603"        "OM869594"        "OM869593"        "OM869592"        "OM869591"        "OM869590"        "OM869589"       
 [280] "OM869588"        "OM869587"        "OM869586"        "OM869585"        "OM869583"        "OM869580"        "OM869579"        "OM869578"        "OM869577"       
 [289] "OM869576"        "OM869575"        "OM869574"        "OM869573"        "OM869572"        "OM869571"        "OM869570"        "OM869569"        "OM869568"       
 [298] "OM869566"        "OM869565"        "OM869564"        "OM869563"        "OM869562"        "OM869561"        "OM869560"        "OM869559"        "OM869558"       
 [307] "OM869557"        "OM869556"        "OM869555"        "OM869554"        "OM869553"        "OM869552"        "OM869551"        "OM869550"        "OM869549"       
 [316] "OM869548"        "OM869547"        "OM869546"        "OM869545"        "OM869544"        "OM869543"        "OM869542"        "OM869541"        "OM869540"       
 [325] "OM869539"        "OM869538"        "OM869537"        "OM869536"        "OM869535"        "OM869534"        "OM869533"        "OM869532"        "OM869531"       
 [334] "OM869530"        "OM869529"        "OM869528"        "OM869527"        "OM869526"        "OM869525"        "OM869524"        "OM869523"        "OM869522"       
 [343] "OM869521"        "OM869520"        "OM869519"        "OM869518"        "OM869517"        "OM869516"        "OM869515"        "OM869514"        "OM869513"       
 [352] "OM869512"        "OM869511"        "OM869510"        "OM869509"        "OM869508"        "OM869507"        "OM869506"        "OM869505"        "OM869504"       
 [361] "OM869503"        "OM869502"        "OM869501"        "OM869500"        "OM869499"        "OM869498"        "OM869497"        "OM869496"        "OM869495"       
 [370] "OM869640"        "OM869567"        "OM869494"        "OM869493"        "OM869632"        "ON857943"        "ON857941"        "ON857940"        "ON857939"       
 [379] "ON857938"        "ON857937"        "ON857936"        "ON857935"        "ON857942"        "ON857934"        "ON857933"        "ON857932"        "ON857931"       
 [388] "ON857930"        "ON857929"        "ON857928"        "ON857927"        "ON857926"        "ON857925"        "ON922920"        "ON922919"        "ON922918"       
 [397] "ON922917"        "ON970626"        "ON970625"        "ON970624"        "ON970623"        "ON970622"        "ON970621"        "ON970620"        "ON970619"       
 [406] "ON970618"        "ON970617"        "ON970616"        "ON970615"        "ON970614"        "ON970613"        "ON970612"        "ON970611"        "ON970610"       
 [415] "ON970609"        "ON970608"        "ON970607"        "ON970606"        "ON970605"        "ON970604"        "ON970603"        "ON970602"        "ON970601"       
 [424] "ON970600"        "ON970599"        "ON970598"        "ON970597"        "ON970596"        "ON970595"        "ON970594"        "ON970593"        "ON970592"       
 [433] "ON970591"        "ON970590"        "ON970589"        "ON970588"        "ON970587"        "ON970586"        "ON970585"        "ON970584"        "ON970583"       
 [442] "ON970582"        "ON970581"        "ON970580"        "ON970579"        "ON970578"        "ON970577"        "ON970576"        "ON970575"        "ON970574"       
 [451] "ON970573"        "ON970572"        "ON970571"        "ON970570"        "ON970569"        "ON970568"        "ON970567"        "ON970566"        "ON970565"       
 [460] "ON970564"        "ON970563"        "ON970562"        "ON970561"        "MT552976"        "MT560059"        "MT560058"        "KJ801817"        "OP068344"       
 [469] "OP068342"        "OP068341"        "OP068340"        "OP068339"        "OP068343"        "OP068338"        "OP068337"        "OP068336"        "OP068335"       
 [478] "OP068334"        "OP068333"        "OP068332"        "OP068331"        "OP068330"        "ON585039"        "OP012469"        "ON630910"        "ON724018"       
 [487] "ON724017"        "ON724016"        "ON724015"        "ON724014"        "ON724013"        "ON724012"        "ON724011"        "ON724010"        "ON724009"       
 [496] "ON724008"        "ON755189"        "ON755187"        "ON755188"        "ON755186"        "ON755185"        "ON755184"        "ON755183"        "ON755182"       
 [505] "ON755181"        "ON755180"        "OP177727"        "ON911717"        "ON911716"        "ON169962"        "ON169961"        "ON169963"        "MT724048"       
 [514] "MK801680"        "ON506928"        "ON506927"        "ON506926"        "ON814136"        "ON814134"        "OP171943"        "OP171942"        "OP079919"       
 [523] "OP079918"        "ON856258"        "ON856257"        "ON778007"        "ON778006"        "ON687736"        "ON687735"        "OP021684"        "OP021683"       
 [532] "OP021682"        "OP021681"        "OP021680"        "OP021679"        "OP021678"        "OP021677"        "OP235318"        "OP197928"        "OP186293"       
 [541] "OP168898"        "OP142323"        "OP046317"        "OP022426"        "OP018999"        "OP019135"        "ON653033"        "OM962992"        "OP125534"       
 [550] "OP123707"        "OP125547"        "OP114732"        "OP094641"        "OP087522"        "OP136151"        "OP120783"        "OP117450"        "ON113334"       
 [559] "ON637763"        "OM032871"        "MZ571833"        "MZ571831"        "MZ571827"        "MZ571834"        "MZ571832"        "MZ571830"        "MZ571829"       
 [568] "MZ571828"        "MZ612130"        "MW924642"        "MW507127"        "MW055909"        "MW055907"        "MW321655"        "MW291020"        "MT952850"       
 [577] "MT657332"        "MT776807"        "MT639647"        "MT639642"        "MT521990"        "MT498067"        "MT498065"        "MT498051"        "MT498047"       
 [586] "MT498046"        "MT498040"        "MT498038"        "MT310864"        "MT316462"        "MT316459"        "MT316458"        "MN585990"        "MN586010"       
 [595] "MN586008"        "MN369745"        "MN428064"        "MN428062"        "MN284892"        "MN329679"        "MN234191"        "MK967392"        "MN010759"       
 [604] "MN010755"        "MK801730"        "MK878899"        "MK894434"        "MK620897"        "MK524502"        "MH744416"        "MH825700"        "MH779512"       
 [613] "MH651181"        "MH576963"        "MH509446"        "MH399783"        "MH371111"        "MH371118"        "MH371124"        "MH045561"        "MH045557"       
 [622] "MH045556"        "MZ570427"        "KP792622"        "OP018674"        "ON996340"        "ON989481"        "OP066531"        "ON872379"        "ON872163"       
 [631] "OK428535"        "ON548421"        "ON548420"        "ON548419"        "ON548418"        "ON548417"        "ON911718"        "ON911715"        "ON911714"       
 [640] "OP009288"        "OP009287"        "OP009286"        "OP009285"        "OP009284"        "OP009283"        "OP009282"        "OP009281"        "OP009280"       
 [649] "OP009279"        "OP009278"        "OP009277"        "OP009276"        "OP009275"        "OP009274"        "OP009273"        "OP009272"        "OP009271"       
 [658] "OP009270"        "OP009269"        "OP009268"        "OP009267"        "OP009266"        "OP009265"        "OM908766"        "MT995928"        "OP067662"       
 [667] "ON960072"        "ON548433"        "ON548432"        "ON506924"        "ON081052"        "ON001686"        "OM960734"        "OP038547"        "ON920540"       
 [676] "ON933637"        "OP028995"        "ON645346"        "ON645344"        "ON645343"        "ON645345"        "ON645342"        "ON645341"        "ON645340"       
 [685] "ON645339"        "ON645338"        "ON645337"        "ON854455"        "OM735689"        "OM858838"        "ON868915"        "ON881243"        "MZ707157"       
 [694] "MZ707156"        "ON950090"        "ON995367"        "ON996339"        "OM401714"        "OM401713"        "OM401712"        "ON880501"        "ON886116"       
 [703] "OL362280"        "MH153810"        "ON165415"        "OM982647"        "OM982646"        "MZ622713"        "OM416780"        "OM416779"        "OM416778"       
 [712] "OM416777"        "OM416776"        "OM416775"        "OM416774"        "OM416773"        "OM416772"        "OM416771"        "OM416770"        "OM416769"       
 [721] "OM416768"        "OM416767"        "OM416766"        "OM416765"        "OM416764"        "OM416763"        "MW398869"        "OM716002"        "OM716001"       
 [730] "OM716000"        "ON754979"        "OM328157"        "MZ574433"        "MZ574432"        "MZ574431"        "MZ574430"        "MZ574429"        "ON287378"       
 [739] "ON568341"        "ON615601"        "ON782582"        "ON715521"        "ON699007"        "ON711490"        "ON714422"        "ON677538"        "ON645936"       
 [748] "ON623732"        "ON624112"        "ON604651"        "ON571554"        "ON568193"        "ON551388"        "ON556632"        "OK625546"        "ON862890"       
 [757] "ON864052"        "ON464764"        "ON464762"        "ON464761"        "ON464760"        "ON464759"        "OX241591"        "OX241576"        "OX241575"       
 [766] "OX241574"        "OX241573"        "OX241572"        "OX241570"        "OX241569"        "OX241568"        "OX241567"        "OX241566"        "OX241563"       
 [775] "OX241562"        "OX241561"        "OX241560"        "OX241559"        "OX241558"        "OX241557"        "OX241556"        "OX241555"        "OX241504"       
 [784] "OX241501"        "OX241498"        "OX241494"        "OX241464"        "OX241455"        "OX241449"        "OX241448"        "OX241447"        "OX241446"       
 [793] "OX241445"        "OX241444"        "OX241443"        "OX241442"        "OX241441"        "OX241438"        "OX241437"        "OX241436"        "OX241435"       
 [802] "OX241434"        "OX241432"        "OX241431"        "OX241430"        "OX241429"        "OX241428"        "OX241427"        "OX241426"        "OX241425"       
 [811] "OX241423"        "ON456356"        "ON456354"        "ON456353"        "ON456352"        "ON456351"        "ON456350"        "ON456349"        "ON456348"       
 [820] "ON456347"        "ON456355"        "ON456346"        "ON456345"        "ON456344"        "ON456343"        "ON456342"        "ON456341"        "ON456340"       
 [829] "ON456339"        "ON456338"        "ON456337"        "ON456336"        "ON456335"        "ON456334"        "ON456333"        "ON456332"        "ON456331"       
 [838] "ON456330"        "ON881905"        "ON287377"        "OM258170"        "JQ245707"        "BK061312"        "BK061311"        "BK061310"        "BK061309"       
 [847] "BK061308"        "ON529854"        "ON239132"        "ON239131"        "MH626557"        "KF024722"        "JX042579"        "JN185608"        "JN020140"       
 [856] "ON464735"        "ON464736"        "ON637170"        "ON637251"        "ON637250"        "ON782130"        "MW221967"        "MH444512"        "MG913376"       
 [865] "MG596799"        "ON529866"        "ON529865"        "ON529858"        "ON529853"        "ON529851"        "ON529850"        "ON005621"        "OM953433"       
 [874] "ON807605"        "ON712643"        "MZ570263"        "ON853583"        "ON866946"        "ON190025"        "ON190024"        "LC644557"        "ON325435"       
 [883] "ON720977"        "ON720976"        "ON720975"        "ON758385"        "MW822601"        "MT889393"        "CP072507"        "OM262540"        "OL770108"       
 [892] "OL770107"        "OM065843"        "OM065842"        "OM065841"        "OM065840"        "OM065839"        "OM065838"        "OM065837"        "MZ821031"       
 [901] "OK632026"        "OK632025"        "MW427215"        "OX090893"        "OX090892"        "OL502173"        "ON721385"        "ON721384"        "OM103621"       
 [910] "CAKLQC020000001" "CAKLQD020000001" "CAKLQA020000001" "CAKLQA020000002" "CAKLQA020000003" "CAKLQE020000001" "CAKLQB020000001" "CAKLQB020000002" "CAKLQG020000001"
 [919] "ON529868"        "ON529864"        "ON529863"        "ON529867"        "ON529862"        "ON529861"        "ON529860"        "ON529859"        "ON529857"       
 [928] "ON529856"        "ON529855"        "ON529852"        "MZ234052"        "MZ234048"        "MZ234015"        "ON642070"        "OM908436"        "ON409570"       
 [937] "ON409569"        "ON409568"        "ON337196"        "ON337195"        "ON337194"        "ON375840"        "ON375839"        "ON375838"        "ON550478"       
 [946] "ON550260"        "ON631220"        "ON571632"        "MZ234029"        "MZ234026"        "ON212267"        "ON212266"        "ON212265"        "ON529291"       
 [955] "ON528933"        "ON528935"        "OM135992"        "OL840290"        "OL963731"        "MH494197"        "KC748969"        "OW991346"        "OW991345"       
 [964] "OX001802"        "OX016465"        "OX001577"        "OV696622"        "OV696621"        "OV696620"        "OV696619"        "OV696617"        "OV696614"       
 [973] "OV696612"        "OV696611"        "OV696610"        "OV696608"        "OK574338"        "MZ357096"        "MZ333136"        "MZ333135"        "MZ333134"       
 [982] "MZ333133"        "MZ333132"        "MZ333131"        "MT427400"        "OM141125"        "ON416862"        "ON286973"        "ON286975"        "ON286976"       
 [991] "ON286974"        "ON286972"        "ON464763"        "ON637248"        "ON526985"        "ON526984"        "ON526983"        "ON526982"        "ON526981"       
[1000] "ON526980"       
 [ reached getOption("max.print") -- omitted 17406 entries ]

Split-Apply-Combine Data Analysis

Let us now try to create more complex pipelines for exploratory data analysis — and, along the way, introduce some functions for aggregating entries and statistics!

Suppose we want to get the number of phages infecting each host genus. This is a good use case for group_by() and n() (for counting the number of entries in a group):

data %>% 
  select(Accession, Host) %>% 
  filter(Host != "Unspecified") %>%
  group_by(Host) %>% 
  summarize(count = n()) %>%
  arrange(desc(count))           # desc() means arrange in descending order

Suppose we want to get the mean and the median guanine-cytosine (GC) content of the phages when grouped by host genus:

# We enclose column names with spaces in backticks ``

data %>% 
  select(Accession, Host, `molGC (%)`) %>% 
  filter(Host != "Unspecified") %>%
  group_by(Host) %>% 
  summarize(mean_gc = mean(`molGC (%)`),
            median_gc = median(`molGC (%)`)) %>%
  arrange(desc(mean_gc))           # desc() means arrange in descending order

  1. De La Salle University, Manila, Philippines, ↩︎

LS0tDQp0aXRsZTogIk1hbmlwdWxhdGluZyBEYXRhIHdpdGggYGRwbHlyYCINCmF1dGhvcjogTWFyayBFZHdhcmQgTS4gR29uemFsZXNeW0RlIExhIFNhbGxlIFVuaXZlcnNpdHksIE1hbmlsYSwgUGhpbGlwcGluZXMsIGdvbnphbGVzLm1hcmtlZHdhcmRAZ21haWwuY29tXQ0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KSW4gdGhpcyBub3RlYm9vaywgd2Ugd2lsbCBsZWFybiBob3cgdG8gcGVyZm9ybSBkYXRhIG1hbmlwdWxhdGlvbiDigJQgYW5kIGNyZWF0ZSBkYXRhIHByb2Nlc3NpbmcgcGlwZWxpbmVzIOKAlCB1c2luZyBhIHBvd2VyZnVsIHBhY2thZ2UgY2FsbGVkIGBkcGx5cmAuDQoNCmBkcGx5cmAgaXMgcGFydCBvZiBbYHRpZHl2ZXJzZWBdKGh0dHBzOi8vd3d3LnRpZHl2ZXJzZS5vcmcvKSwgYSBjb2xsZWN0aW9uIG9mIFIgcGFja2FnZXMgZm9yIGRhdGEgc2NpZW5jZS4gYHRpZHl2ZXJzZWAgYWxzbyBpbmNsdWRlcyBgZ2dwbG90MmAsIG9uZSBvZiB0aGUgbW9zdCB3aWRlbHkgdXNlZCBkYXRhIHZpc3VhbGl6YXRpb24gcGFja2FnZXMuDQoNCioqQm9udXM6KiogSWYgeW91IGp1c3QgbmVlZCBhIHF1aWNrIHJlZnJlc2hlciBvbiBgZHBseXJgLCB5b3UgY2FuIHJlZmVyIHRvIHRoaXMgY2hlYXQgc2hlZXQ6IGh0dHBzOi8vZ2l0aHViLmNvbS9yc3R1ZGlvL2NoZWF0c2hlZXRzL3Jhdy9tYWluL2RhdGEtdHJhbnNmb3JtYXRpb24ucGRmLg0KDQojIyBQcmVsaW1pbmFyaWVzDQoNCmB0aWR5dmVyc2VgIGlzIG5vdCBidWlsdCBpbnRvIFIsIHNvIHdlIGZpcnN0IGhhdmUgdG8gaW5zdGFsbCBpdDoNCg0KYGBgDQppbnN0YWxsLnBhY2thZ2VzKCJ0aWR5dmVyc2UiKQ0KYGBgDQoNCkFmdGVyd2FyZHMsIHdlIGhhdmUgdG8gdGVsbCBSIHRoYXQgd2Ugd2FudCB0byB1c2UgYHRpZHl2ZXJzZWAgKGluIHRlY2huaWNhbCB0ZXJtcywgd2UgYXJlIGxvYWRpbmcgdGhlIHBhY2thZ2UpOg0KDQpgYGB7cn0NCmxpYnJhcnkoInRpZHl2ZXJzZSIpDQpgYGANCg0KIyMgR29vZGJ5ZSBEYXRhIEZyYW1lcyEgSGVsbG8gVGliYmxlcw0KDQpgdGlkeXZlcnNlYCBpbnRyb2R1Y2VkIGEgZmFzdGVyIGFuZCBiZXR0ZXIgdmVyc2lvbiBvZiBSJ3MgYnVpbHQtaW4gZGF0YSBmcmFtZTsgd2UgY2FsbCB0aGlzIGEgKip0aWJibGUqKi4gDQoNCvCfkqEgSWYgeW91IGFyZSBpbnRlcmVzdGVkIGluIGRpdmluZyBpbnRvIHRoZSBkaWZmZXJlbmNlcyBiZXR3ZWVuIFIncyBidWlsdC1pbiBkYXRhIGZyYW1lIGFuZCBgdGlkeXZlcnNlYCdzIHRpYmJsZSwgeW91IG1heSByZWZlciB0byB0aGlzIGFydGljbGU6IGh0dHBzOi8vanRyMTMuZ2l0aHViLmlvL2NjMjFmYWxsMS90aWJibGUtdnMuLWRhdGFmcmFtZS5odG1sDQoNClNpbmNlIHRoaXMgbm90ZWJvb2sgYWltcyB0byBmYW1pbGlhcml6ZSBvdXJzZWx2ZXMgd2l0aCBgZHBseXJgIChhbmQgYHRpZHl2ZXJzZWApLCB3ZSB3aWxsIG1ha2UgYSBzaGlmdCBmcm9tIGRhdGEgZnJhbWVzIHRvIHRpYmJsZXMuIA0KDQpXZSBzdGFydCBieSBsb2FkaW5nIG91ciBkYXRhc2V0IChub3RlIHRoYXQgdGliYmxlcyBhcmUgbG9hZGVkIHVzaW5nIGByZWFkX2RlbGltYCB3aGlsZSBkYXRhIGZyYW1lcyBhcmUgbG9hZGVkIHVzaW5nIGByZWFkLmRlbGltYCk6DQoNCmBgYHtyfQ0KZGF0YSA8LSByZWFkX2RlbGltKCJwaGFnZXMudHN2IikNCmBgYA0KV2UgY2hlY2sgdGhlIGNvbHVtbnMgb2Ygb3VyIGRhdGFzZXQ6DQoNCmBgYHtyfQ0Kc3RyKGRhdGEpDQpgYGANCg0KV2UgdmlldyBvdXIgZGF0YXNldCAodGhpcyBvcGVucyBhIG5ldyB0YWIgaW4gUlN0dWRpbyk6IA0KDQpgYGB7cn0NClZpZXcoZGF0YSkNCmBgYA0KDQojIyBTZWxlY3RpbmcgQ29sdW1ucyAmIEZpbHRlcmluZyBSb3dzDQoNClRvIHNlbGVjdCBjb2x1bW5zLCB3ZSB1c2UgYHNlbGVjdCgpYC4gVGhlIGZpcnN0IGFyZ3VtZW50IGlzIHRoZSBkYXRhc2V0LCBhbmQgdGhlIHN1Y2NlZWRpbmcgYXJndW1lbnRzIGFyZSB0aGUgY29sdW1ucyB0byBiZSBpbmNsdWRlZC4NCg0KX1VzZSBjYXNlOiBTdXBwb3NlIHdlIHdhbnQgdG8gZ2V0IHRoZSBmYW1pbHksIG9yZGVyLCBhbmQgY2xhc3Mgb2YgZWFjaCBwaGFnZSBpbiBvdXIgZGF0YXNldC5fDQoNCmBgYHtyfQ0KZGF0YV9zdWJzZXQgPC0gc2VsZWN0KGRhdGEsIEFjY2Vzc2lvbiwgRmFtaWx5LCBPcmRlciwgQ2xhc3MpDQpkYXRhX3N1YnNldA0KYGBgDQoNClRvIGZpbHRlciByb3dzLCB3ZSB1c2UgYGZpbHRlcigpYC4NCg0KX1VzZSBjYXNlOiBTdXBwb3NlIHdlIHdhbnQgdG8gcmVtb3ZlIGFsbCBlbnRyaWVzIHdoZXJlIHRoZSBmYW1pbHksIG9yZGVyLCBhbmQgY2xhc3MgYXJlIHVuY2xhc3NpZmllZC5fDQoNCmBgYHtyfQ0KZGF0YV9zdWJzZXQgPC0gZmlsdGVyKGRhdGFfc3Vic2V0LCBGYW1pbHkgIT0gIlVuY2xhc3NpZmllZCIgJiBPcmRlciAhPSAiVW5jbGFzc2lmaWVkIiAmIENsYXNzICE9ICJVbmNsYXNzaWZpZWQiKQ0KZGF0YV9zdWJzZXQNCmBgYA0KDQojIyBQaXBlOiBUaGUgIlRoZW4iIE9wZXJhdG9yDQoNClRoZSBwcm9jZXNzaW5nIHRoYXQgd2UganVzdCBwZXJmb3JtZWQg4oCUIHNlbGVjdGluZyBjb2x1bW5zIHRoZW4gZmlsdGVyaW5nIHJvd3Mg4oCUIGlzIGFjdHVhbGx5IGEgc2ltcGxlIHBpcGVsaW5lIGFscmVhZHkhIEJ1dCBub3RpY2UgaG93IG91ciBjb2RlIGNhbiBlYXNpbHkgYmVjb21lIGNsdXR0ZXJlZCBpZiB3ZSBhcmUgdG8gYWRkIG1vcmUgaW50ZXJtZWRpYXRlIHN0ZXBzLg0KDQpGb3J0dW5hdGVseSwgYGRwbHlyYCBwcm92aWRlcyBhIGNvbnZlbmllbnQgb3BlcmF0b3IgY2FsbGVkIGEgcGlwZTogYCU+JWAgKGEgc2hvcnRjdXQgdG8gdHlwaW5nIHRoaXMgb3BlcmF0b3IgaXMgYnkgcHJlc3NpbmcgQ3RybCtTaGlmdCtNIG9yIENtZCtTaGlmdCtNIGZvciBNYWMpLiBXZSBjYW4gdGhpbmsgb2YgYCU+JWAgYXMgZXF1aXZhbGVudCB0byB0aGUgRW5nbGlzaCAidGhlbi4iDQoNClRvIGlsbHVzdHJhdGUgaXRzIHVzYWdlLCB3ZSByZXdyaXRlIG91ciBwaXBlbGluZSBsaWtlIHNvOiANCg0KYGBge3J9DQpkYXRhX3N1YnNldF91c2luZ19waXBlIDwtIGRhdGEgJT4lIA0KICBzZWxlY3QoQWNjZXNzaW9uLCBGYW1pbHksIE9yZGVyLCBDbGFzcykgJT4lIA0KICBmaWx0ZXIoRmFtaWx5ICE9ICJVbmNsYXNzaWZpZWQiICYgT3JkZXIgIT0gIlVuY2xhc3NpZmllZCIgJiBDbGFzcyAhPSAiVW5jbGFzc2lmaWVkIikNCmRhdGFfc3Vic2V0X3VzaW5nX3BpcGUNCmBgYA0KDQpPYnNlcnZlIGhvdyB0aGUgc3ludGF4IGlzIG1vc3RseSB0aGUgc2FtZSwgd2l0aCB0aGUgZXhjZXB0aW9uIG9mIHRoZSBmaXJzdCBhcmd1bWVudCBvZiBgc2VsZWN0KClgIGFuZCBgZmlsdGVyKClgLiBTaW5jZSB3ZSBhbHJlYWR5IHNwZWNpZmllZCBgZGF0YWAgYXQgdGhlIHN0YXJ0IG9mIHRoZSBwaXBlbGluZSwgd2UgZG8gbm90IG5lZWQgdG8gcGFzcyBpdCBhbnltb3JlIGFzIGFuIGFyZ3VtZW50IHRvIHRoZSBkYXRhIG1hbmlwdWxhdGlvbiBmdW5jdGlvbnMuDQoNCiMjIEFkZGluZyBDb2x1bW5zIChNdXRhdGUpDQoNClRvIGFkZCBjb2x1bW5zLCB3ZSB1c2UgYG11dGF0ZSgpYC4NCg0KX1VzZSBjYXNlOiBXZSBoYXZlIGEgY29sdW1uIGNhbGxlZCBgR2Vub21lIExlbmd0aCAoYnApYCBidXQgd2Ugd2FudCBhIG5ldyBjb2x1bW4gd2hlcmUgdGhlIGdlbm9tZSBsZW5ndGggaXMgZXhwcmVzc2VkIGluIHRlcm1zIG9mIGticC5fDQoNCmBgYHtyfQ0KIyBXZSBlbmNsb3NlIGNvbHVtbiBuYW1lcyB3aXRoIHNwYWNlcyBpbiBiYWNrdGlja3MgYGANCg0KZGF0YV93aXRoX25ld19jb2x1bW4gPC0gZGF0YSAlPiUNCiAgICBtdXRhdGUoYEdlbm9tZSBMZW5ndGggKGticClgID0gYEdlbm9tZSBMZW5ndGggKGJwKWAgLyAxMDAwKQ0KZGF0YV93aXRoX25ld19jb2x1bW4NCmBgYA0KDQojIyBHZXR0aW5nIGEgQ29sdW1uIChQdWxsKQ0KDQpTdXBwb3NlIHdlIHdhbnQgdG8gZ2V0IGFsbCB0aGUgYWNjZXNzaW9ucyBvZiB0aGUgcGhhZ2VzIGluIG91ciBkYXRhc2V0Lg0KDQpBcyB3ZSBsZWFybmVkIGVhcmxpZXIsIHdlIGNhbiB1c2UgYHNlbGVjdCgpYC4NCg0KYGBge3J9DQpkYXRhICU+JSBzZWxlY3QoQWNjZXNzaW9uKQ0KYGBgDQoNClRoaXMgd29ya3MsIGJ1dCB0aGlzIGlzIGEgb25lLWNvbHVtbiB0aWJibGUuIFdoYXQgaWYgd2Ugd2FudCBhIHZlY3Rvcj8gV2UgY2FuIHVzZSBgcHVsbCgpYCBpbnN0ZWFkLg0KDQpgYGB7cn0NCmRhdGEgJT4lIHB1bGwoQWNjZXNzaW9uKQ0KYGBgDQojIyBTcGxpdC1BcHBseS1Db21iaW5lIERhdGEgQW5hbHlzaXMNCg0KTGV0IHVzIG5vdyB0cnkgdG8gY3JlYXRlIG1vcmUgY29tcGxleCBwaXBlbGluZXMgZm9yIGV4cGxvcmF0b3J5IGRhdGEgYW5hbHlzaXMg4oCUIGFuZCwgYWxvbmcgdGhlIHdheSwgaW50cm9kdWNlIHNvbWUgZnVuY3Rpb25zIGZvciBhZ2dyZWdhdGluZyBlbnRyaWVzIGFuZCBzdGF0aXN0aWNzIQ0KDQpTdXBwb3NlIHdlIHdhbnQgdG8gZ2V0IHRoZSBudW1iZXIgb2YgcGhhZ2VzIGluZmVjdGluZyBlYWNoIGhvc3QgZ2VudXMuIFRoaXMgaXMgYSBnb29kIHVzZSBjYXNlIGZvciBgZ3JvdXBfYnkoKWAgYW5kIGBuKClgIChmb3IgY291bnRpbmcgdGhlIG51bWJlciBvZiBlbnRyaWVzIGluIGEgZ3JvdXApOg0KDQpgYGB7cn0NCmRhdGEgJT4lIA0KICBzZWxlY3QoQWNjZXNzaW9uLCBIb3N0KSAlPiUgDQogIGZpbHRlcihIb3N0ICE9ICJVbnNwZWNpZmllZCIpICU+JQ0KICBncm91cF9ieShIb3N0KSAlPiUgDQogIHN1bW1hcml6ZShjb3VudCA9IG4oKSkgJT4lDQogIGFycmFuZ2UoZGVzYyhjb3VudCkpICAgICAgICAgICAjIGRlc2MoKSBtZWFucyBhcnJhbmdlIGluIGRlc2NlbmRpbmcgb3JkZXINCmBgYA0KDQpTdXBwb3NlIHdlIHdhbnQgdG8gZ2V0IHRoZSBtZWFuIGFuZCB0aGUgbWVkaWFuIGd1YW5pbmUtY3l0b3NpbmUgKEdDKSBjb250ZW50IG9mIHRoZSBwaGFnZXMgd2hlbiBncm91cGVkIGJ5IGhvc3QgZ2VudXM6DQoNCmBgYHtyfQ0KIyBXZSBlbmNsb3NlIGNvbHVtbiBuYW1lcyB3aXRoIHNwYWNlcyBpbiBiYWNrdGlja3MgYGANCg0KZGF0YSAlPiUgDQogIHNlbGVjdChBY2Nlc3Npb24sIEhvc3QsIGBtb2xHQyAoJSlgKSAlPiUgDQogIGZpbHRlcihIb3N0ICE9ICJVbnNwZWNpZmllZCIpICU+JQ0KICBncm91cF9ieShIb3N0KSAlPiUgDQogIHN1bW1hcml6ZShtZWFuX2djID0gbWVhbihgbW9sR0MgKCUpYCksDQogICAgICAgICAgICBtZWRpYW5fZ2MgPSBtZWRpYW4oYG1vbEdDICglKWApKSAlPiUNCiAgYXJyYW5nZShkZXNjKG1lYW5fZ2MpKSAgICAgICAgICAgIyBkZXNjKCkgbWVhbnMgYXJyYW5nZSBpbiBkZXNjZW5kaW5nIG9yZGVyDQpgYGA=