Title: | Techniques to Build Better Balance |
---|---|
Description: | Build better balance in causal inference models. 'halfmoon' helps you assess propensity score models for balance between groups using metrics like standardized mean differences and visualization techniques like mirrored histograms. 'halfmoon' supports both weighting and matching techniques. |
Authors: | Malcolm Barrett [aut, cre, cph]
|
Maintainer: | Malcolm Barrett <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0.9000 |
Built: | 2025-02-03 19:26:45 UTC |
Source: | https://github.com/r-causal/halfmoon |
This function replaces the counts in the default header of
gtsummary::tbl_svysummary()
tables to counts representing the
Effective Sample Size (ESS). See ess()
for details.
add_ess_header( x, header = "**{level}** \nESS = {format(n, digits = 1, nsmall = 1)}" )
add_ess_header( x, header = "**{level}** \nESS = {format(n, digits = 1, nsmall = 1)}" )
x |
( |
header |
( |
a 'gtsummary' table
svy <- survey::svydesign(~1, data = nhefs_weights, weights = ~ w_ate) gtsummary::tbl_svysummary(svy, include = c(age, sex, smokeyrs)) |> add_ess_header() hdr <- paste0( "**{level}** \n", "N = {n_unweighted}; ESS = {format(n, digits = 1, nsmall = 1)}" ) gtsummary::tbl_svysummary(svy, by = qsmk, include = c(age, sex, smokeyrs)) |> add_ess_header(header = hdr)
svy <- survey::svydesign(~1, data = nhefs_weights, weights = ~ w_ate) gtsummary::tbl_svysummary(svy, include = c(age, sex, smokeyrs)) |> add_ess_header() hdr <- paste0( "**{level}** \n", "N = {n_unweighted}; ESS = {format(n, digits = 1, nsmall = 1)}" ) gtsummary::tbl_svysummary(svy, by = qsmk, include = c(age, sex, smokeyrs)) |> add_ess_header(header = hdr)
This function computes the effective sample size (ESS) given a vector of
weights, using the classical formula (sometimes
referred to as "Kish's effective sample size").
ess(wts)
ess(wts)
wts |
A numeric vector of weights (e.g., from survey or inverse-probability weighting). |
The effective sample size (ESS) reflects how many observations you would have if all were equally weighted. If the weights vary substantially, the ESS can be much smaller than the actual number of observations. Formally:
Diagnostic Value:
Indicator of Weight Concentration: A large discrepancy between ESS and the actual sample size indicates that a few observations carry disproportionately large weights, effectively reducing the usable information in the dataset.
Variance Inflation: A small ESS signals that weighted estimates are more sensitive to a handful of observations, inflating the variance and standard errors.
Practical Guidance: If ESS is much lower than the total sample size, it is advisable to investigate why some weights are extremely large or small. Techniques like weight trimming or stabilized weights might be employed to mitigate the issue
A single numeric value representing the effective sample size.
# Suppose we have five observations with equal weights wts1 <- rep(1.2, 5) # returns 5, because all weights are equal ess(wts1) # If weights vary more, smaller than 5 wts2 <- c(0.5, 2, 2, 0.1, 0.8) ess(wts2)
# Suppose we have five observations with equal weights wts1 <- rep(1.2, 5) # returns 5, because all weights are equal ess(wts1) # If weights vary more, smaller than 5 wts2 <- c(0.5, 2, 2, 0.1, 0.8) ess(wts2)
The empirical cumulative distribution function (ECDF) provides an alternative
visualization of distribution. geom_ecdf()
is similar to
ggplot2::stat_ecdf()
but it can also calculate weighted ECDFs.
geom_ecdf( mapping = NULL, data = NULL, geom = "step", position = "identity", ..., n = NULL, pad = TRUE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE )
geom_ecdf( mapping = NULL, data = NULL, geom = "step", position = "identity", ..., n = NULL, pad = TRUE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE )
mapping |
Set of aesthetic mappings created by |
data |
The data to be displayed in this layer. There are three options: If A A |
geom |
The geometric object to use to display the data for this layer.
When using a
|
position |
A position adjustment to use on the data for this layer. This
can be used in various ways, including to prevent overplotting and
improving the display. The
|
... |
Other arguments passed on to
|
n |
if NULL, do not interpolate. If not NULL, this is the number of points to interpolate with. |
pad |
If |
na.rm |
If |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
a geom
In addition to the aesthetics for
ggplot2::stat_ecdf()
, geom_ecdf()
also accepts:
weights
library(ggplot2) ggplot( nhefs_weights, aes(x = smokeyrs, color = qsmk) ) + geom_ecdf(aes(weights = w_ato)) + xlab("Smoking Years") + ylab("Proportion <= x")
library(ggplot2) ggplot( nhefs_weights, aes(x = smokeyrs, color = qsmk) ) + geom_ecdf(aes(weights = w_ato)) + xlab("Smoking Years") + ylab("Proportion <= x")
Create mirrored histograms
geom_mirror_histogram( mapping = NULL, data = NULL, position = "stack", ..., binwidth = NULL, bins = NULL, na.rm = FALSE, orientation = NA, show.legend = NA, inherit.aes = TRUE )
geom_mirror_histogram( mapping = NULL, data = NULL, position = "stack", ..., binwidth = NULL, bins = NULL, na.rm = FALSE, orientation = NA, show.legend = NA, inherit.aes = TRUE )
mapping |
Set of aesthetic mappings created by |
data |
The data to be displayed in this layer. There are three options: If A A |
position |
A position adjustment to use on the data for this layer. This
can be used in various ways, including to prevent overplotting and
improving the display. The
|
... |
Other arguments passed on to
|
binwidth |
The width of the bins. Can be specified as a numeric value
or as a function that calculates width from unscaled x. Here, "unscaled x"
refers to the original x values in the data, before application of any
scale transformation. When specifying a function along with a grouping
structure, the function will be called once per group.
The default is to use the number of bins in The bin width of a date variable is the number of days in each time; the bin width of a time variable is the number of seconds. |
bins |
Number of bins. Overridden by |
na.rm |
If |
orientation |
The orientation of the layer. The default ( |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
a geom
library(ggplot2) ggplot(nhefs_weights, aes(.fitted)) + geom_mirror_histogram( aes(group = qsmk), bins = 50 ) + geom_mirror_histogram( aes(fill = qsmk, weight = w_ate), bins = 50, alpha = 0.5 ) + scale_y_continuous(labels = abs)
library(ggplot2) ggplot(nhefs_weights, aes(.fitted)) + geom_mirror_histogram( aes(group = qsmk), bins = 50 ) + geom_mirror_histogram( aes(fill = qsmk, weight = w_ate), bins = 50, alpha = 0.5 ) + scale_y_continuous(labels = abs)
A dataset containing various propensity score weights for
causaldata::nhefs_complete
.
nhefs_weights
nhefs_weights
A data frame with 1566 rows and 14 variables:
Quit smoking
Race
Age
Sex
Education level
Smoking intensity
Number of smoke-years
Exercise level
Daily activity level
Participant weight in 1971 (baseline)
ATE weight
ATT weight
ATC weight
ATM weight
ATO weight
Propensity score