| Title: | Create Messy Data from Clean Data Frames |
|---|---|
| Description: | For the purposes of teaching, it is often desirable to show examples of working with messy data and how to clean it. This R package creates messy data from clean, tidy data frames so that students have a clean example to work towards. |
| Authors: | Nicola Rennie [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-4797-557X>) |
| Maintainer: | Nicola Rennie <[email protected]> |
| License: | CC BY 4.0 |
| Version: | 0.1.0.9003 |
| Built: | 2026-05-26 07:50:06 UTC |
| Source: | https://github.com/nrennie/messy |
Add special characters to strings
add_special_chars(data, cols = NULL, messiness = 0.1)add_special_chars(data, cols = NULL, messiness = 0.1)
data |
input dataframe |
cols |
set of columns to apply transformation to. If |
messiness |
Percentage of values to change. Must be between 0 and 1. Default 0.1. |
a dataframe the same size as the input data.
add_special_chars(mtcars)add_special_chars(mtcars)
Randomly add whitespaces to the end of some values in all or a subset of columns.
add_whitespace(data, cols = NULL, messiness = 0.1)add_whitespace(data, cols = NULL, messiness = 0.1)
data |
input dataframe |
cols |
set of columns to apply transformation to. If |
messiness |
Percentage of values to change. Must be between 0 and 1. Default 0.1. |
a dataframe the same size as the input data.
add_whitespace(mtcars)add_whitespace(mtcars)
Randomly switch between title case and lowercase for character strings
change_case(data, cols = NULL, messiness = 0.1, case_type = "word")change_case(data, cols = NULL, messiness = 0.1, case_type = "word")
data |
input dataframe |
cols |
set of columns to apply transformation to. If |
messiness |
Percentage of values to change. Must be between 0 and 1. Default 0.1. |
case_type |
Whether the case should change based on
the |
a dataframe the same size as the input data.
change_case(mtcars)change_case(mtcars)
Randomly change the separators in character strings through random replacement
change_separators( data, cols = NULL, messiness = 0.1, sep_in = c("-", "_", " ", " "), sep_out = c("-", "_", " ", " ") )change_separators( data, cols = NULL, messiness = 0.1, sep_in = c("-", "_", " ", " "), sep_out = c("-", "_", " ", " ") )
data |
input dataframe |
cols |
set of columns to apply transformation to. If |
messiness |
Percentage of values to change. Must be between 0 and 1. Default 0.1. |
sep_in |
A single value, or vector, or list of what is considered
a separator in the input data. Default |
sep_out |
A single value, or vector, or list of what the separators
may be randomly with. Default |
a dataframe the same size as the input data.
change_separators(mtcars)change_separators(mtcars)
Duplicate columns and insert them into the dataframe at random
duplicate_columns(data, messiness = 0.1, random = TRUE, name_sep = "")duplicate_columns(data, messiness = 0.1, random = TRUE, name_sep = "")
data |
input dataframe |
messiness |
Probability that each column is duplicated. Must be between 0 and 1. Default 0.1. |
random |
Whether duplicated column names should be randomly selected
from other column names, or maintain the original. Default |
name_sep |
Separator to use for adding numbers to end of names. Default |
A dataframe with duplicated rows inserted
Jordi Rosell
duplicate_columns(mtcars, messiness = 0.1)duplicate_columns(mtcars, messiness = 0.1)
Duplicate rows and insert them into the dataframe in order or at random. May result in numbers being added to the end of row names.
duplicate_rows(data, messiness = 0.1, shuffle = FALSE)duplicate_rows(data, messiness = 0.1, shuffle = FALSE)
data |
input dataframe |
messiness |
Percentage of rows to duplicate. Must be between 0 and 1. Default 0.1. |
shuffle |
Insert duplicated data underneath original data or insert randomly |
A dataframe with duplicated rows inserted
Philip Leftwich, Barry Rowlingson
duplicate_rows(mtcars, messiness = 0.1)duplicate_rows(mtcars, messiness = 0.1)
Randomly make values missing in all data columns, or a subset of columns
make_missing(data, cols = NULL, messiness = 0.1, missing = NA)make_missing(data, cols = NULL, messiness = 0.1, missing = NA)
data |
input dataframe |
cols |
set of columns to apply transformation to. If |
messiness |
Percentage of values to change. Must be between 0 and 1. Default 0.1. |
missing |
A single value, vector, or list of what the
missing values will be replaced with. If length is greater
than 1, values will be replaced randomly.
Default |
a dataframe the same size as the input data.
make_missing(mtcars)make_missing(mtcars)
Make a data frame messier.
messy( data, messiness = 0.1, missing = NA, case_type = "word", sep_in = c("-", "_", " ", " "), sep_out = c("-", "_", " ", " ") )messy( data, messiness = 0.1, missing = NA, case_type = "word", sep_in = c("-", "_", " ", " "), sep_out = c("-", "_", " ", " ") )
data |
input dataframe |
messiness |
Percentage of values to change per function. Must be between 0 and 1. Default 0.1. |
missing |
A single value, vector, or list of what the
missing values will be replaced with. If length is greater
than 1, values will be replaced randomly.
Default |
case_type |
Whether the case should change based on
the |
sep_in |
A single value, or vector, or list of what is considered
a separator in the input data. Default |
sep_out |
A single value, or vector, or list of what the separators
may be randomly with. Default |
a dataframe the same size as the input data.
messy(mtcars)messy(mtcars)
Adds special characters and randomly capitalises characters in the column names of a data frame.
messy_colnames(data, messiness = 0.2)messy_colnames(data, messiness = 0.2)
data |
data.frame to alter column names |
messiness |
Percentage of values to change per function. Must be between 0 and 1. Default 0.1. |
data.frame with messy column names
Athanasia Monika Mowinckel
messy_colnames(mtcars)messy_colnames(mtcars)
Takes any date(times) column and transforms it into a character column, sampling from any number of random of valid character representations.
messy_datetime_formats( data, cols = NULL, formats = c("%Y/%m/%d %H:%M:%S", "%d/%m/%Y %H:%M:%S") ) messy_date_formats( data, cols = NULL, formats = c("%Y/%m/%d", "%d/%m/%Y") )messy_datetime_formats( data, cols = NULL, formats = c("%Y/%m/%d %H:%M:%S", "%d/%m/%Y %H:%M:%S") ) messy_date_formats( data, cols = NULL, formats = c("%Y/%m/%d", "%d/%m/%Y") )
data |
input dataframe |
cols |
set of columns to apply transformation to. If |
formats |
A vector of any number of valid |
a dataframe the same size as the input data.
Jack Davison
Other Messy date(time) functions:
messy_datetime_tzones(),
split_datetimes()
data <- data.frame(dates = rep(Sys.Date(), 10)) messy_date_formats(data)data <- data.frame(dates = rep(Sys.Date(), 10)) messy_date_formats(data)
Takes any number of datetime columns and changes their timezones either totally at random, or from a user-provided list of timezones.
messy_datetime_tzones(data, cols = NULL, tzones = OlsonNames(), force = FALSE)messy_datetime_tzones(data, cols = NULL, tzones = OlsonNames(), force = FALSE)
data |
input dataframe |
cols |
set of columns to apply transformation to. If |
tzones |
Valid time zones to sample from. By default samples from all
|
force |
By default ( |
a dataframe the same size as the input data.
Jack Davison
Other Messy date(time) functions:
messy_datetime_formats(),
split_datetimes()
data <- data.frame(dates = rep(Sys.time(), 10)) data$dates attr(data$dates, "tzone") messy <- messy_datetime_tzones(data, tzones = "Poland") messy$dates attr(messy$dates, "tzone")data <- data.frame(dates = rep(Sys.time(), 10)) data$dates attr(data$dates, "tzone") messy <- messy_datetime_tzones(data, tzones = "Poland") messy$dates attr(messy$dates, "tzone")
These functions can split the "date" and "time" components of POSIXt columns and the "hour", "month", and "day" components of Date columns into multiple columns.
split_datetimes(data, cols = NULL, class = c("character", "date")) split_dates(data, cols = NULL)split_datetimes(data, cols = NULL, class = c("character", "date")) split_dates(data, cols = NULL)
data |
input dataframe |
cols |
set of columns to apply transformation to. If |
class |
For |
a dataframe
Jack Davison
Other Messy date(time) functions:
messy_datetime_formats(),
messy_datetime_tzones()
# split datetimes data <- data.frame(today = Sys.time()) split_datetimes(data) # split dates data <- data.frame(today = Sys.Date()) data split_dates(data)# split datetimes data <- data.frame(today = Sys.time()) split_datetimes(data) # split dates data <- data.frame(today = Sys.Date()) data split_dates(data)
This function takes an arbitrary number of 'joining' columns and any number
of additional column names and splits a dataframe in two such that a user
could then re-join using merge() or dplyr::left_join(). The user may find
it appropriate to go on and apply messy() to each new dataframe
independently to impede rejoining.
unjoin(data, by, cols, distinct = "none", names = c("left", "right"))unjoin(data, by, cols, distinct = "none", names = c("left", "right"))
data |
input dataframe |
by |
a vector of column names which will be present in both outputs, to rejoin the dataframes |
cols |
specific columns to be present in the 'right' dataframe. implicitly, all other columns not in 'cols' will be present in the 'left' dataframe. |
distinct |
Apply |
names |
The names of the output list. If |
Real data is often found across multiple datasets. For example, in environmental monitoring, measurements at a monitoring station may need to be bound with metadata about the station such as geographic coordinates, or even meteorological data from an external source, to produce desired outputs. In clinical research it may be necessary to combine the results of a clinical trial with relevant patient information, such as weight or sex. This function undoes existing joins to present learners with an authentic problem to solve; joining two independent datasets to achieve some goal.
A list of two dataframes
Jack Davison
Other data deconstructors:
unrbind()
dummy <- dplyr::tibble( patient_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3), test = c(1, 2, 3, 1, 2, 3, 1, 2, 3), result = c("++", "+", "-", "--", "+", "-", "+", "++", "-"), sex = c("M", "M", "M", "M", "M", "M", "F", "F", "F"), age = c(50, 50, 50, 25, 25, 25, 30, 30, 30) ) unjoin( dummy, by = "patient_id", cols = c("sex", "age"), distinct = "right", names = c("tests", "patient_info") )dummy <- dplyr::tibble( patient_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3), test = c(1, 2, 3, 1, 2, 3, 1, 2, 3), result = c("++", "+", "-", "--", "+", "-", "+", "++", "-"), sex = c("M", "M", "M", "M", "M", "M", "F", "F", "F"), age = c(50, 50, 50, 25, 25, 25, 30, 30, 30) ) unjoin( dummy, by = "patient_id", cols = c("sex", "age"), distinct = "right", names = c("tests", "patient_info") )
This function splits a dataframe into any number of dataframes such that they
can be rejoined by using rbind()/dplyr::bind_rows() for unrbind() or
cbind()/dplyr::bind_cols() for uncbind(). The user may find it
appropriate to go on and apply messy() to each new dataframe independently
to impede rejoining.
unrbind(data, sizes = NULL, probs = NULL, names = NULL, shuffle = TRUE) uncbind(data, sizes = NULL, probs = NULL, names = NULL, shuffle = TRUE)unrbind(data, sizes = NULL, probs = NULL, names = NULL, shuffle = TRUE) uncbind(data, sizes = NULL, probs = NULL, names = NULL, shuffle = TRUE)
data |
input dataframe |
sizes |
A vector of numeric inputs summing to |
probs |
A vector of numeric inputs summing to |
names |
The names of the output list. If |
shuffle |
Shuffle rows in |
Real data can often be found in disparate files. For example, data reports
may come in monthly and require row-binding together to obtain a complete
annual time series. Scientific results may arrive from different laboratories
and require binding together for further analysis and comparisons. This
function may simulate a single dataframe having come from different sources
and requiring binding back together. Base R's split() offers an alternative
to unrbind(), but requires a pre-existing factor column to split by and
cannot as easily create random splits in the data.
A list of dataframes
Jack Davison
Other data deconstructors:
unjoin()
unrbind(dplyr::tibble(mtcars), probs = c(0.5, 0.3, 0.2)) uncbind(dplyr::tibble(mtcars), probs = c(0.5, 0.3, 0.2))unrbind(dplyr::tibble(mtcars), probs = c(0.5, 0.3, 0.2)) uncbind(dplyr::tibble(mtcars), probs = c(0.5, 0.3, 0.2))