Introduction
The basepenguins package provides tools to convert R
scripts and R Markdown/Quarto documents (or other specified file types)
that use the palmerpenguins
package to use the versions of penguins
and
penguins_raw
from datasets (R ≥
4.5.0).
With R ≥ 4.5.0, the popular Palmer Penguins datasets are now directly available without loading the palmerpenguins package. This makes them more accessible, especially for new R users and for teaching purposes. However, there are some differences between the variable names in the palmerpenguins package and those in R’s datasets package:
palmerpenguins | datasets |
---|---|
bill_length_mm | bill_len |
bill_depth_mm | bill_dep |
flipper_length_mm | flipper_len |
body_mass_g | body_mass |
These shorter variable names in the base R version were chosen for
more compact code and data display. It does mean, however, that for
those wanting to use R’s version of penguins
, it isn’t
simply a case of removing the call to
library(palmerpenguins)
or replacing
palmerpenguins
with datasets
in
data("penguins", package = "palmerpenguins")
and the script
still running.
The basepenguins package takes care of converting
files by removing the call to palmerpenguins and making
the necessary conversions to variable names, ensuring that the resulting
scripts still run using the datasets (R ≥ 4.5.0)
versions of penguins
and penguins_raw
.
Package features
The basepenguins package provides four functions to convert files:
-
convert_files()
: Convert specified files to new output locations -
convert_files_inplace()
: Convert files in-place -
convert_dir()
: Convert files in a specified directory and its subdirectories to a new output directory, preserving nesting structure -
convert_dir_inplace()
: Convert files in a directory in-place
If using convert_files_inplace()
or
convert_dir_inplace()
, we recommend doing so in conjunction
with a version-control system such as git, so that any changes can be
easily checked.
Additionally, there are helper functions:
-
example_files()
andexample_dir()
: Access example files included in the package -
output_paths()
: Generate modified file paths -
files_to_convert()
: List files in a directory with specified extensions
What changes when converting files?
When a file is ‘convertible’, i.e. contains a call to
library(palmerpenguins)
or
data("penguins", package = "palmerpenguins")
and has one of
the specified extensions (by default "R"
, "r"
,
"qmd"
, "rmd"
, "Rmd"
), the
conversion makes these changes:
- Replaces
library(palmerpenguins)
(or same withpalmerpenguins
in quotes) with the empty string""
- Replaces
data("penguins", package = "palmerpenguins")
(with any style of quotes) withdata("penguins", package = "datasets")
- Replaces variable names:
-
bill_length_mm
→bill_len
-
bill_depth_mm
→bill_dep
-
flipper_length_mm
→flipper_len
-
body_mass_g
→body_mass
-
- Replaces
ends_with("_mm")
withstarts_with("flipper_"), starts_with("bill_")
Example directory and files
The package includes an example directory with four example files to
demonstrate how the conversion works. These are accessible through
example_files()
and example_dir()
.
# List all example files
example_files()
#> [1] "nested/not_a_script.md" "nested/penguins.qmd" "no_penguins.Rmd"
#> [4] "penguins.R"
These example files include:
-
penguins.R
: An R script using the palmerpenguins package -
no_penguins.Rmd
: An Rmarkdown file that includesends_with("_mm")
but not in the context of the palmerpenguins package -
nested/penguins.qmd
: A Quarto document using the palmerpenguins package -
nested/not_a_script.md
: Containslibrary(palmerpenguins)
, but is not a script type that is converted by default
You can examine the content of any of these files, e.g.:
penguins_script <- example_files("penguins.R")
cat(readLines(penguins_script), sep = "\n")
#> library(palmerpenguins)
#> library(ggplot2)
#> library(dplyr)
#>
#> # exploring scatterplots
#> penguins |>
#> select(body_mass_g, ends_with("_mm")) |>
#> ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
#> geom_point(aes(color = species, shape = species), size = 2) +
#> scale_color_manual(values = c("darkorange", "darkorchid", "cyan4"))
The example_dir()
function returns the path to the
directory containing all example files. It also has a
copy.dir
argument that allows you to copy all the example
files to a new directory. This is especially useful for testing the
conversion functions that modify files in-place without affecting the
original example files distributed with the package:
# Copy all example files to a new subdirectory of the working directory
example_dir("examples")
# List the files in the copied directory
list.files("examples", recursive = TRUE)
#> [1] "nested/not_a_script.md" "nested/penguins.qmd" "no_penguins.Rmd"
#> [4] "penguins.R"
Note that for the purposes of this vignette (and to adhere to CRAN
policies), the working directory has been set to a tempdir
and all new directories and files are written there, using relative
paths.
Converting files
The package offers two main approaches to converting files: creating
new converted versions with convert_files()
or modifying
files in place withconvert_files_inplace()
.
Let’s start by converting a single file to see how it works:
# Convert a single file to a new output file
convert_files(penguins_script, "converted_penguins.R")
#> - ends_with("_mm") replaced on line 7 in converted_penguins.R
#> - Please check the changed output files.
# Look at the converted file
cat(readLines("converted_penguins.R"), sep = "\n")
#>
#> library(ggplot2)
#> library(dplyr)
#>
#> # exploring scatterplots
#> penguins |>
#> select(body_mass, starts_with("flipper_"), starts_with("bill_")) |>
#> ggplot(aes(x = flipper_len, y = body_mass)) +
#> geom_point(aes(color = species, shape = species), size = 2) +
#> scale_color_manual(values = c("darkorange", "darkorchid", "cyan4"))
Notice how the function has:
- Removed the
library(palmerpenguins)
line - Replaced the variable names used in palmerpenguins with their datasets equivalents
- Modified
ends_with("_mm")
to usestarts_with()
patterns instead
Both the input
and output
parameters of
convert_files()
take a vector of file paths, allowing you
to convert multiple files at once.
If you want to overwrite the original files rather than creating new
ones, you can use convert_files_inplace()
, which works
exactly the same as convert_files()
, except that it doesn’t
take an output
argument - it is simply a convenience
wrapper around convert_files(input, input, extensions)
.
Return values and messages
All the convert_*()
functions invisibly return a list
with two components:
-
changed
: Files that were modified -
not_changed
: Files that were not modified (either they don’t have the specified extensions or they don’t use the palmerpenguins package)
If the output
paths are different than the
input
paths, the values in the changed
and
not_changed
vectors will be subsets of output
,
and they will be named with the corresponding input
paths.
If files are overwritten, then the values in changed
and
not_changed
will be subsets of input
and the
vectors will not be named.
This list is returned invisibly for two reasons:
- If many files are converted, and/or absolute file paths are used, this list can occupy a lot of console space
- With the list occupying a lot of console space, messages generated by the functions might be missed
The convert_*()
functions generate messages in the
following circumstances:
- If any files are changed, a message recommending you check the changed output files
- If any R Markdown or Quarto documents are changed, a message prompting you to re-knit or re-render them
- If any
ends_with("_mm")
substitutions are made, a message with the output file paths and line numbers of those changes
Converting a directory
To convert all convertible files in a directory (and its
subdirectories), use convert_dir()
. We’ll use the
"examples"
directory that we created above with the call to
example_dir("examples")
.
result <- convert_dir("examples", "converted_examples")
#> - ends_with("_mm") replaced on line 7 in converted_examples/penguins.R
#> - Please check the changed output files.
#> - Remember to re-knit or re-render and changed Rmarkdown or Quarto documents.
result
#> $changed
#> examples/nested/penguins.qmd
#> "converted_examples/nested/penguins.qmd"
#> examples/penguins.R
#> "converted_examples/penguins.R"
#>
#> $not_changed
#> examples/no_penguins.Rmd
#> "converted_examples/no_penguins.Rmd"
#> examples/nested/not_a_script.md
#> "converted_examples/nested/not_a_script.md"
To convert all files in a directory in place, use
convert_dir_inplace()
. A useful call is
convert_dir_inplace(".")
to overwrite all convertible files
in the working directory, though we don’t run that here, demonstrating
on a fresh copy of the example directory instead.
example_dir("in_place_dir")
result <- convert_dir_inplace("in_place_dir")
#> - ends_with("_mm") replaced on line 7 in in_place_dir/penguins.R
#> - Please check the changed output files.
#> - Remember to re-knit or re-render and changed Rmarkdown or Quarto documents.
result
#> $changed
#> [1] "in_place_dir/nested/penguins.qmd" "in_place_dir/penguins.R"
#>
#> $not_changed
#> [1] "in_place_dir/no_penguins.Rmd" "in_place_dir/nested/not_a_script.md"
Helper functions
Finding files with specific extensions
When working with large directories, the
files_to_convert()
function helps you find files with
specific extensions that might be candidates for conversion:
# List all files with convertible extensions in a directory
potential_files <- files_to_convert("examples")
potential_files
#> [1] "nested/penguins.qmd" "no_penguins.Rmd" "penguins.R"
It’s important to note that files_to_convert()
only
filters files by their extensions and does not look for
palmerpenguins
in their content.
By default, this function looks for files with extensions
"R"
, "r"
, "qmd"
,
"rmd"
, or "Rmd"
. You can specify different
extensions if needed, or return absolute file paths. See
files_to_convert()
for further details:
# Only look for R scripts
files_to_convert("examples", extensions = "R")
#> [1] "penguins.R"
# All extensions
files_to_convert("examples", extensions = NULL)
#> [1] "nested/not_a_script.md" "nested/penguins.qmd" "no_penguins.Rmd"
#> [4] "penguins.R"
Generating output paths
When converting files to new locations, the
output_paths()
function helps generate appropriate output
paths, based on the input paths (which are preserved as names). These
can then be passed to the output
argument in
convert_files()
. By default, output_paths()
adds a "_new"
suffix to the file name, but other suffixes,
or prefixes, can be specified. Other output directories can also be
given:
input_files <- files_to_convert("examples")
# Default
output_paths(input_files)
#> nested/penguins.qmd no_penguins.Rmd penguins.R
#> "nested/penguins_new.qmd" "no_penguins_new.Rmd" "penguins_new.R"
# Generate output paths with prefix instead, in new directory
output_paths(input_files, prefix = "base_", suffix = "", dir = "~/output")
#> nested/penguins.qmd no_penguins.Rmd
#> "~/output/nested/base_penguins.qmd" "~/output/base_no_penguins.Rmd"
#> penguins.R
#> "~/output/base_penguins.R"
Considerations regarding the ends_with("_mm")
substitution
The palmerpenguins
Get started vignette has examples of using
ends_with("_mm")
within calls to
dplyr::select()
, as a convenient way to select the
flipper_length_mm
, bill_length_mm
and
bill_depth_mm
columns.
This pattern presents a design challenge for
basepenguins. We need a way to select the
flipper_len
, bill_len
and
bill_dep
columns.
The most obvious substition for ends_with("_mm")
is
therefore flipper_len, starts_with("bill_")
, which
preserves the use of a tidyselect
function. However, suppose we have a previous call to
dplyr::select()
, and have converted the file with the
above. Then following code will generate an error, because
flipper_len
is no longer available to be selected:
penguins |>
select(bill_len, bill_dep) |>
select(flipper_len, starts_with("bill_"))
Although the above example is contrived, we don’t want to break
anyone’s code, so instead we replace ends_with("_mm")
with:
This won’t error, even if there are no column names starting with
"flipper_"
or "bill_"
. However, we shouldn’t
ever really need starts_with("flipper_")
as there is only
one column in penguins
that meets that criteria, so we
suggest manually checking this substitution and either replacing
starts_with("flipper_")
with flipper_len
if
flipper_len
is still a column in the data frame, or
removing starts_with("flipper_")
entirely if not.
To facilitate this, the convert_*()
functions all print
a message indicating where these substitutions were made, to help you
manually review and potentially refine these changes if desired.
The use of the ends_with("_mm")
pattern with the
penguins
dataset is also the reason why we only convert
files if library(palmerpenguins)
or
data("penguins", package = "palmerpenguins")
is found in
the file. It is possible to imagine different data frames for which this
selector could be used, and we don’t want to inadvertently alter those.
We provide an example file to demonstrate this:
#> ---
#> title: No penguins
#> ---
#>
#> A file to make sure we're not changing `ends_with("_mm")`
#> if the script doesn't load the palmerpenguins package.
#>
#> ```{r}
#> dat <- data.frame(length_mm = 1:3, depth_mm = 4:6)
#>
#> dat |>
#> dplyr::select(ends_with("_mm"))
#> ```
# Pass it to a convert function
convert_files(no_penguins_file, "no_penguins_converted.Rmd")
# The content doesn't change
cat(readLines("no_penguins_converted.Rmd"), sep = "\n")
#> ---
#> title: No penguins
#> ---
#>
#> A file to make sure we're not changing `ends_with("_mm")`
#> if the script doesn't load the palmerpenguins package.
#>
#> ```{r}
#> dat <- data.frame(length_mm = 1:3, depth_mm = 4:6)
#>
#> dat |>
#> dplyr::select(ends_with("_mm"))
#> ```
Even though this file contains ends_with("_mm")
, and is
an R Markdown file, it doesn’t use the palmerpenguins
package, so no substitutions are made. Notice also that there were no
messages generated when convert_files()
was called,
indicating that none of the input files changed.
Final considerations
Class
The versions of penguins
and penguins_raw
in R ≥ 4.5.0’s datasets package will always (just) have
class data.frame
. In contrast, the
palmerpenguins versions will have classes
tbl_df
, tbl
and data.frame
if the
tibble
package is installed on your computer (and just class
data.frame
if not).
penguins_raw
The versions of penguins_raw
in
palmerpenguins and datasets are
identical, except potentially for their class, as described above. No
specific changes are made to penguins_raw
by the
convert_*()
functions in basepenguins, but
by removing the call to library(palmerpenguins)
, the
datasets version will be used in any scripts, which is
always a data.frame
(never a tbl_df
).
The palmerpenguins package
Note that the palmerpenguins package provides features that are not in R, such as vignettes and articles on the package website. The package also contains the data in two csv files and provides a function to access them. And, of course, Allison Horst’s wonderful penguins artwork! The palmerpenguins package will remain on CRAN and keep its package website.
We are extremely grateful to the authors of palmerpenguins, Allison Horst, Alison Hill and Kristen Gorman, for their support for adding the Palmer Penguins data to datasets, and their enthusiasm about basepenguins.