Title: | Tools for Outbreak Investigation/Infectious Disease Surveillance |
---|---|
Description: | Create epicurves or epigantt charts in 'ggplot2'. Prepare data for visualisation or other reporting for infectious disease surveillance and outbreak investigation. Includes tidy functions to solve date based transformations for common reporting tasks, like (A) seasonal date alignment for respiratory disease surveillance, (B) date-based case binning based on specified time intervals like isoweek, epiweek, month and more, (C) automated detection and marking of the new year based on the date/datetime axis of the 'ggplot2'. An introduction on how to use epicurves can be found on the US CDC website (2012, <https://www.cdc.gov/training/quicklearns/epimode/index.html>). |
Authors: | Alexander Bartel [aut, cre] |
Maintainer: | Alexander Bartel <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.2 |
Built: | 2025-02-11 18:35:10 UTC |
Source: | https://github.com/biostats-dev/ggsurveillance |
Standardizes dates from multiple years to enable comparison of epidemic curves and visualization of seasonal patterns in infectious disease surveillance data. Commonly used for creating periodicity plots of respiratory diseases like influenza, RSV, or COVID-19.
align_dates_seasonal( x, dates_from = NULL, date_resolution = c("week", "isoweek", "epiweek", "day", "month"), start = NULL, target_year = NULL, drop_leap_week = TRUE ) align_and_bin_dates_seasonal( x, n = 1, dates_from, population = 1, fill_gaps = FALSE, date_resolution = c("week", "isoweek", "epiweek", "day", "month"), start = NULL, target_year = NULL, drop_leap_week = TRUE )
align_dates_seasonal( x, dates_from = NULL, date_resolution = c("week", "isoweek", "epiweek", "day", "month"), start = NULL, target_year = NULL, drop_leap_week = TRUE ) align_and_bin_dates_seasonal( x, n = 1, dates_from, population = 1, fill_gaps = FALSE, date_resolution = c("week", "isoweek", "epiweek", "day", "month"), start = NULL, target_year = NULL, drop_leap_week = TRUE )
x |
Either a data frame with a date column, or a date vector.
|
dates_from |
Column name containing the dates to align. Used when x is a data.frame. |
date_resolution |
Character string specifying the temporal resolution. One of:
|
start |
Numeric value indicating epidemic season start:
|
target_year |
Numeric value for the reference year to align dates to. The default target year is the start of the most recent season in the data. This way the most recent dates stay unchanged. |
drop_leap_week |
If |
n |
Numeric column with case counts. Supports quoted and unquoted column names. |
population |
A number or a numeric column with the population size. Used to calculate the incidence. |
fill_gaps |
Logical; If |
This function helps create standardized epidemic curves by aligning surveillance data from different years. This enables:
Comparison of disease patterns across multiple seasons
Identification of typical seasonal trends
Detection of unusual disease activity
Assessment of current season against historical patterns
The alignment can be done at different temporal resolutions (daily, weekly, monthly) with customizable season start points to match different disease patterns or surveillance protocols.
A data frame with standardized date columns:
year
: Calendar year from original date
week/month/day
: Time unit based on chosen resolution
date_aligned
: Date standardized to target year
season
: Epidemic season identifier (e.g., "2023/24")
current_season
: Logical flag for most recent season
Binning also creates the columns:
n
: Sum of cases in bin
incidence
: Incidence calculated using n/population
# Sesonal Visualization of Germany Influenza Surveillance Data library(ggplot2) influenza_germany |> align_dates_seasonal( dates_from = ReportingWeek, date_resolution = "epiweek", start = 28 ) -> df_flu_aligned ggplot(df_flu_aligned, aes(x = date_aligned, y = Incidence, color = season)) + geom_line() + facet_wrap(~AgeGroup) + theme_bw()
# Sesonal Visualization of Germany Influenza Surveillance Data library(ggplot2) influenza_germany |> align_dates_seasonal( dates_from = ReportingWeek, date_resolution = "epiweek", start = 28 ) -> df_flu_aligned ggplot(df_flu_aligned, aes(x = date_aligned, y = Incidence, color = season)) + geom_line() + facet_wrap(~AgeGroup) + theme_bw()
Creates age groups from numeric values using customizable break points and formatting options. The function allows for flexible formatting and customization of age group labels.
If a factor is returned, this factor includes factor levels of unobserved age groups. This allows for reproducible age groups, which can be used for joining data (e.g. adding age grouped population numbers for incidence calculation).
create_agegroups( values, age_breaks = c(5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90), breaks_as_lower_bound = TRUE, first_group_format = "0-{x}", interval_format = "{x}-{y}", last_group_format = "{x}+", pad_numbers = FALSE, pad_with = "0", collapse_single_year_groups = FALSE, na_label = NA, return_factor = FALSE )
create_agegroups( values, age_breaks = c(5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90), breaks_as_lower_bound = TRUE, first_group_format = "0-{x}", interval_format = "{x}-{y}", last_group_format = "{x}+", pad_numbers = FALSE, pad_with = "0", collapse_single_year_groups = FALSE, na_label = NA, return_factor = FALSE )
values |
Numeric vector of ages to be grouped |
age_breaks |
Numeric vector of break points for age groups. |
breaks_as_lower_bound |
Logical; if |
first_group_format |
Character string template for the first age group. Uses glue syntax. |
interval_format |
Character string template for intermediate age groups. Uses glue syntax. |
last_group_format |
Character string template for the last age group. Uses glue syntax. |
pad_numbers |
Logical or numeric; if numeric, pad numbers up to the specified length (Tip: use |
pad_with |
Character to use for padding numbers. Default: |
collapse_single_year_groups |
Logical; if |
na_label |
Label for |
return_factor |
Logical; if |
Vector of age group labels (character or factor depending on return_factor)
# Basic usage create_agegroups(1:100) # Custom formatting with upper bounds create_agegroups(1:100, breaks_as_lower_bound = FALSE, interval_format = "{x} to {y}", first_group_format = "0 to {x}" ) # Ages 1 to 5 are kept as numbers by collapsing single year groups create_agegroups(1:10, age_breaks = c(1, 2, 3, 4, 5, 10), collapse_single_year_groups = TRUE )
# Basic usage create_agegroups(1:100) # Custom formatting with upper bounds create_agegroups(1:100, breaks_as_lower_bound = FALSE, interval_format = "{x} to {y}", first_group_format = "0 to {x}" ) # Ages 1 to 5 are kept as numbers by collapsing single year groups create_agegroups(1:10, age_breaks = c(1, 2, 3, 4, 5, 10), collapse_single_year_groups = TRUE )
Creates a epicurve plot for visualizing epidemic case counts in outbreaks (epidemiological curves).
An epicurve is a bar plot, where every case is outlined. geom_epicurve
additionally provides
date-based aggregation of cases (e.g. per week or month and many more).
For week aggregation both isoweek (World + ECDC) and epiweek (US CDC) are supported.
stat_bin_date
and its alias stat_date_count
provide date based binning only. After binning the by date, these
stats behave like ggplot2::stat_count.
geom_epicurve( mapping = NULL, data = NULL, stat = "epicurve", position = "stack", date_resolution = NULL, week_start = getOption("lubridate.week.start", 1), width = NULL, relative.width = 1, ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE ) stat_bin_date( mapping = NULL, data = NULL, geom = "line", position = "stack", date_resolution = NULL, week_start = getOption("lubridate.week.start", 1), ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE ) stat_date_count( mapping = NULL, data = NULL, geom = "line", position = "stack", date_resolution = NULL, week_start = getOption("lubridate.week.start", 1), ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE )
geom_epicurve( mapping = NULL, data = NULL, stat = "epicurve", position = "stack", date_resolution = NULL, week_start = getOption("lubridate.week.start", 1), width = NULL, relative.width = 1, ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE ) stat_bin_date( mapping = NULL, data = NULL, geom = "line", position = "stack", date_resolution = NULL, week_start = getOption("lubridate.week.start", 1), ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE ) stat_date_count( mapping = NULL, data = NULL, geom = "line", position = "stack", date_resolution = NULL, week_start = getOption("lubridate.week.start", 1), ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE )
mapping |
Set of aesthetic mappings created by
|
data |
The data frame containing the variables for the plot |
stat |
either " |
position |
Position adjustment. Currently supports " |
date_resolution |
Character string specifying the time unit for date aggregation.
Set to
|
week_start |
Integer specifying the start of the week (1 = Monday, 7 = Sunday). |
width |
Numeric value specifying the width of the bars. If |
relative.width |
Numeric value between 0 and 1 adjusting the relative width of bars. Defaults to 1 |
... |
Other arguments passed to
|
na.rm |
If |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
geom |
The geometric object to use to display the data for this layer.
When using a |
Epi Curves are a public health tool for outbreak investigation. For more details see the references.
A ggplot2
geom layer that can be added to a plot
Centers for Disease Control and Prevention. Quick-Learn Lesson: Using an Epi Curve to Determine Mode of Spread. USA. https://www.cdc.gov/training/quicklearns/epimode/
Dicker, Richard C., Fátima Coronado, Denise Koo, and R. Gibson Parrish. 2006. Principles of Epidemiology in Public Health Practice; an Introduction to Applied Epidemiology and Biostatistics. 3rd ed. USA. https://stacks.cdc.gov/view/cdc/6914
scale_y_cases_5er()
, geom_vline_year()
# Basic epicurve with dates library(ggplot2) set.seed(1) plot_data_epicurve_imp <- data.frame( date = rep(as.Date("2023-12-01") + ((0:300) * 1), times = rpois(301, 0.5)) ) ggplot(plot_data_epicurve_imp, aes(x = date, weight = 2)) + geom_vline_year(year_break = "01-01", show.legend = TRUE) + geom_epicurve(date_resolution = "week") + labs(title = "Epicurve Example") + scale_y_cases_5er() + scale_x_date(date_breaks = "4 weeks", date_labels = "W%V'%g") + # Correct ISOWeek labels week'year coord_equal(ratio = 7) + # Use coord_equal for square boxes. 'ratio' are the days per week. theme_bw() # Categorical epicurve library(tidyr) library(outbreaks) sars_canada_2003 |> # SARS dataset from outbreaks pivot_longer(starts_with("cases"), names_prefix = "cases_", names_to = "origin") |> ggplot(aes(x = date, weight = value, fill = origin)) + geom_epicurve(date_resolution = "week") + scale_x_date(date_labels = "W%V'%g", date_breaks = "2 weeks") + scale_y_cases_5er() + theme_classic()
# Basic epicurve with dates library(ggplot2) set.seed(1) plot_data_epicurve_imp <- data.frame( date = rep(as.Date("2023-12-01") + ((0:300) * 1), times = rpois(301, 0.5)) ) ggplot(plot_data_epicurve_imp, aes(x = date, weight = 2)) + geom_vline_year(year_break = "01-01", show.legend = TRUE) + geom_epicurve(date_resolution = "week") + labs(title = "Epicurve Example") + scale_y_cases_5er() + scale_x_date(date_breaks = "4 weeks", date_labels = "W%V'%g") + # Correct ISOWeek labels week'year coord_equal(ratio = 7) + # Use coord_equal for square boxes. 'ratio' are the days per week. theme_bw() # Categorical epicurve library(tidyr) library(outbreaks) sars_canada_2003 |> # SARS dataset from outbreaks pivot_longer(starts_with("cases"), names_prefix = "cases_", names_to = "origin") |> ggplot(aes(x = date, weight = value, fill = origin)) + geom_epicurve(date_resolution = "week") + scale_x_date(date_labels = "W%V'%g", date_breaks = "2 weeks") + scale_y_cases_5er() + theme_classic()
Various ways of representing a vertical interval defined by y
,
xmin
and xmax
. Each case draws a single graphical object.
geom_epigantt( mapping = NULL, data = NULL, stat = "identity", position = "identity", ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE )
geom_epigantt( mapping = NULL, data = NULL, stat = "identity", position = "identity", ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE )
mapping |
Set of aesthetic mappings created by |
data |
The data to be displayed in this layer. There are three options: If A A |
stat |
The statistical transformation to use on the data for this layer.
When using a
|
position |
A position adjustment to use on the data for this layer. This
can be used in various ways, including to prevent overplotting and
improving the display. The
|
... |
Other arguments passed on to
|
na.rm |
If |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
A ggplot2
geom layer that can be added to a plot
Determines turn of year dates based on the range of either the x or y axis of the ggplot.
geom_vline_year()
draws vertical lines at the turn of each year
geom_hline_year()
draws horizontal lines at the turn of each year
geom_vline_year( mapping = NULL, position = "identity", year_break = "01-01", just = -0.5, ..., show.legend = NA ) geom_hline_year( mapping = NULL, position = "identity", year_break = "01-01", just = -0.5, ..., show.legend = NA )
geom_vline_year( mapping = NULL, position = "identity", year_break = "01-01", just = -0.5, ..., show.legend = NA ) geom_hline_year( mapping = NULL, position = "identity", year_break = "01-01", just = -0.5, ..., show.legend = NA )
mapping |
Mapping created using |
position |
Position adjustment, either as a string, or the result of a call to a position adjustment function. |
year_break |
String specifying the month and day of the year break ("MM-DD").
Defaults to: |
just |
Numeric offset in days (justification). Shifts the lines from the year break date.
Defaults to |
... |
Other arguments passed to
|
show.legend |
logical. Should this layer be included in the legends? |
A ggplot2 layer that can be added to a plot.
geom_epicurve()
, ggplot2::geom_vline()
library(ggplot2) set.seed(1) plot_data_epicurve_imp <- data.frame( date = rep(as.Date("2023-12-01") + ((0:300) * 1), times = rpois(301, 0.5)) ) ggplot(plot_data_epicurve_imp, aes(x = date, weight = 2)) + geom_epicurve(date_resolution = "week") + geom_vline_year(year_break = "01-01", show.legend = TRUE) + labs(title = "Epicurve Example") + scale_y_cases_5er() + scale_x_date(date_breaks = "4 weeks", date_labels = "W%V'%g") + # Correct ISOWeek labels week'year theme_bw()
library(ggplot2) set.seed(1) plot_data_epicurve_imp <- data.frame( date = rep(as.Date("2023-12-01") + ((0:300) * 1), times = rpois(301, 0.5)) ) ggplot(plot_data_epicurve_imp, aes(x = date, weight = 2)) + geom_epicurve(date_resolution = "week") + geom_vline_year(year_break = "01-01", show.legend = TRUE) + labs(title = "Epicurve Example") + scale_y_cases_5er() + scale_x_date(date_breaks = "4 weeks", date_labels = "W%V'%g") + # Correct ISOWeek labels week'year theme_bw()
The geometric mean is typically defined for strictly positive values. This function computes the geometric mean of a numeric vector, with the option to replace certain values (e.g., zeros, non-positive values, or values below a user-specified threshold) before computation.
geometric_mean( x, na.rm = FALSE, replace_value = NULL, replace = c("all", "non-positive", "zero") )
geometric_mean( x, na.rm = FALSE, replace_value = NULL, replace = c("all", "non-positive", "zero") )
x |
A numeric or complex vector of values. |
na.rm |
Logical. If |
replace_value |
Numeric or |
replace |
Character string indicating which values to replace:
|
Replacement Considerations:
The geometric mean is only defined for strictly positive numbers ().
Despite this, the geometric mean can be useful for laboratory measurements which can contain 0 or negative values.
If these values are treated as NA and are removed, this results in an upward bias due to missingness.
To reduce this, values below the limit of detection (LOD) or limit of quantification (LOQ)
are often replaced with the chosen limit, making this limit the practical lower limit of the measurement scale.
This is therefore an often recommended approach.
There are also alternatives approaches, where values are replaced by
either or
(or LOQ). These approaches create a gap in the distribution
of values (e.g. no values for
) and should therefore be used with caution.
If the replacement approach for values below LOD or LOQ has a material effect on the interpretation of the results, the values should be treated as statistically censored. In this case, proper statistical methods to handle (left) censored data should be used.
When replace_value
is provided, the function will first perform
the specified replacements, then proceed with the geometric mean calculation.
If no replacements are requested but zero or negative values remain and
na.rm = FALSE
, an NA
will be returned with a warning.
A single numeric value representing the geometric mean of the
processed vector x
, or NA
if the resulting vector is empty
(e.g., if na.rm = TRUE
removes all positive values) or if non-positive
values exist when na.rm = FALSE
.
# Basic usage with no replacements: x <- c(1, 2, 3, 4, 5) geometric_mean(x) # Replace all values < 0.5 with 0.5 (common in LOD scenarios): x3 <- c(0.1, 0.2, 0.4, 1, 5) geometric_mean(x3, replace_value = 0.5, replace = "all") # Remove zero or negative values, since log(0) = -Inf and log(-1) = NaN x4 <- c(-1, 0, 1, 2, 3) geometric_mean(x4, na.rm = TRUE)
# Basic usage with no replacements: x <- c(1, 2, 3, 4, 5) geometric_mean(x) # Replace all values < 0.5 with 0.5 (common in LOD scenarios): x3 <- c(0.1, 0.2, 0.4, 1, 5) geometric_mean(x3, replace_value = 0.5, replace = "all") # Remove zero or negative values, since log(0) = -Inf and log(-1) = NaN x4 <- c(-1, 0, 1, 2, 3) geometric_mean(x4, na.rm = TRUE)
A subset of the weekly German influenza surveillance data from January 2020 to January 2025.
influenza_germany
influenza_germany
An object of class tbl_df
(inherits from tbl
, data.frame
) with 1037 rows and 4 columns.
A data frame with 1,037 rows and 4 columns:
Reporting Week in "2024-W03" format
Age groups: 00+
for all and 00-14
, 15-59
and 60+
for age stratified cases.
Weekly case count
Calculated weekly incidence
License CC-BY 4.0: Robert Koch-Institut (2025): Laborbestätigte Influenzafälle in Deutschland. Dataset. Zenodo. DOI:10.5281/zenodo.14619502. https://github.com/robert-koch-institut/Influenzafaelle_in_Deutschland
A continuous ggplot scale for count data with sane defaults for breaks.
It uses base::pretty()
to increase the default number of breaks and prefers 5er breaks.
Additionally, the first tick (i.e. zero) is aligned to the lower left corner.
scale_y_cases_5er( name = waiver(), n = 8, n.min = 5, u5.bias = 4, expand = NULL, labels = waiver(), limits = NULL, oob = scales::censor, na.value = NA_real_, transform = "identity", position = "left", sec.axis = waiver(), guide = waiver(), ... ) scale_x_cases_5er( name = waiver(), n = 8, n.min = 5, u5.bias = 4, expand = NULL, labels = waiver(), limits = NULL, oob = scales::censor, na.value = NA_real_, transform = "identity", position = "bottom", sec.axis = waiver(), guide = waiver(), ... )
scale_y_cases_5er( name = waiver(), n = 8, n.min = 5, u5.bias = 4, expand = NULL, labels = waiver(), limits = NULL, oob = scales::censor, na.value = NA_real_, transform = "identity", position = "left", sec.axis = waiver(), guide = waiver(), ... ) scale_x_cases_5er( name = waiver(), n = 8, n.min = 5, u5.bias = 4, expand = NULL, labels = waiver(), limits = NULL, oob = scales::censor, na.value = NA_real_, transform = "identity", position = "bottom", sec.axis = waiver(), guide = waiver(), ... )
name |
The name of the scale. Used as the axis or legend title. If
|
n |
Target number of breaks passed to |
n.min |
Minimum number of breaks passed to |
u5.bias |
The "5-bias" parameter passed to |
expand |
Uses own expansion logic. Use |
labels |
One of:
|
limits |
One of:
|
oob |
One of:
|
na.value |
Missing values will be replaced with this value. |
transform |
For continuous scales, the name of a transformation object or the object itself. Built-in transformations include "asn", "atanh", "boxcox", "date", "exp", "hms", "identity", "log", "log10", "log1p", "log2", "logit", "modulus", "probability", "probit", "pseudo_log", "reciprocal", "reverse", "sqrt" and "time". A transformation object bundles together a transform, its inverse,
and methods for generating breaks and labels. Transformation objects
are defined in the scales package, and are called |
position |
For position scales, The position of the axis.
|
sec.axis |
|
guide |
A function used to create a guide or its name. See
|
... |
Additional arguments passed on to |
A ggplot2
scale object that can be added to a plot.
geom_epicurve()
, ggplot2::scale_y_continuous()
, base::pretty()
library(ggplot2) data <- data.frame(date = as.Date("2024-01-01") + 0:30) ggplot(data, aes(x = date)) + geom_epicurve(date_resolution = "week") + scale_y_cases_5er()
library(ggplot2) data <- data.frame(date = as.Date("2024-01-01") + 0:30) ggplot(data, aes(x = date)) + geom_epicurve(date_resolution = "week") + scale_y_cases_5er()
uncount()
is provided by the tidyr package, and re-exported
by ggsurveillance. See tidyr::uncount()
for more details.
uncount()
and its alias expand_counts()
are complements of dplyr::count()
: they take
a data frame with a column of frequencies and duplicate each row according to
those frequencies.
uncount(data, weights, ..., .remove = TRUE, .id = NULL) expand_counts(data, weights, ..., .remove = TRUE, .id = NULL)
uncount(data, weights, ..., .remove = TRUE, .id = NULL) expand_counts(data, weights, ..., .remove = TRUE, .id = NULL)
data |
A data frame, tibble, or grouped tibble. |
weights |
A vector of weights. Evaluated in the context of |
... |
Additional arguments passed on to methods. |
.remove |
If |
.id |
Supply a string to create a new variable which gives a unique identifier for each created row. |
A data.frame
with rows duplicated according to weights.
df <- data.frame(x = c("a", "b"), n = c(2, 3)) df |> uncount(n) # Or equivalently: df |> expand_counts(n)
df <- data.frame(x = c("a", "b"), n = c(2, 3)) df |> uncount(n) # Or equivalently: df |> expand_counts(n)