Skip to contents

Base functions

epifitter provides functions to fit classic two-parameter population dynamics models to disease progress curve (DPC) data: exponential, monomolecular, logistic, and Gompertz.

The goal of fitting these models to DPCs is to estimate epidemiologically meaningful parameters: y0, which represents primary inoculum, and r, which represents the apparent infection rate. Choosing the model that best describes the epidemic helps users interpret disease progress and compare epidemics more effectively.

Two approaches can be used to obtain the parameters:

  • Linear regression models fitted to the transformed disease data required by each model.
  • Non-linear regression models fitted to the original disease intensity data.

Both approaches are available in epifitter. The simplest way to fit these models to a single epidemic is with fit_lin() and fit_nlin(). The alternative fit_nlin2() allows estimation of a third parameter, the upper asymptote, when the maximum disease intensity is not close to 100%. The fit_multi() function extends these workflows to multiple epidemic datasets.

First, we need to load the packages we’ll need for this tutorial.

Basic usage

Dataset

To use epifitter, at least two variables are needed: one representing assessment time and another representing disease intensity as a proportion (for example, incidence, severity, or prevalence). In designed experiments with replicates, a third variable identifying the experimental units is also needed.

Let’s simulate a DPC dataset for one epidemic measured in replicated plots. The simulated data resemble a polycyclic epidemic with a sigmoid shape. We can generate this dataset with sim_logistic().

dpcL <- sim_logistic(
  N = 100, # duration of the epidemics in days
  y0 = 0.01, # disease intensity at time zero
  dt = 10, # interval between assessments
  r = 0.1, # apparent infection rate
  alpha = 0.2, # level of noise
  n = 7 # number of replicates
)

Let’s give a look at the simulated dataset.

knitr::kable(head(dpcL), digits = 4)
replicates time y random_y
1 0 0.0100 0.0100
1 10 0.0267 0.0281
1 20 0.0695 0.0380
1 30 0.1687 0.1685
1 40 0.3555 0.3840
1 50 0.5999 0.6550

The object generated by sim_logistic() is a data frame with four columns. The y variable contains disease intensity values on a proportional scale (0 < y < 1). To facilitate visualization, we can plot the epidemic with the ggplot2 package.

ggplot(
  dpcL,
  aes(time, y,
    group = replicates
  )
) +
  geom_point(aes(time, random_y), shape = 1) + # plot the replicate values
  geom_point(color = "steelblue", size = 2) +
  geom_line(color = "steelblue") +
  labs(
    title = "Simulated 'complete' epidemics of sigmoid shape",
    subtitle = "Produced using sim_logistic()"
  )+
  theme_minimal_hgrid()

Plot of simulated logistic epidemics with replicate observations and the underlying mean disease progress curve over time.

Linear regression using fit_lin()

The fit_lin() requires at least the time and y arguments. In the example, we will call the random_y which represents the replicates. A quick way to call these variables attached to the dataframe is shown below.

f_lin <- fit_lin(
  time = dpcL$time,  
  y = dpcL$random_y
)

fit_lin() outputs a list object which contains several elements. Three elements of the list are shown by default: stats of model fit, Infection rate and Initial Inoculum

f_lin
## Results of fitting population models 
## 
## Stats:
##                  CCC r_squared    RSE
## Logistic      0.9977    0.9955 0.2146
## Gompertz      0.9776    0.9563 0.4873
## Monomolecular 0.9367    0.8809 0.6480
## Exponential   0.9131    0.8402 0.6237
## 
##  Infection rate:
##                 Estimate    Std.error      Lower      Upper
## Logistic      0.09961316 0.0007732535 0.09807276 0.10115356
## Gompertz      0.07111327 0.0017559327 0.06761527 0.07461126
## Monomolecular 0.05498609 0.0023351295 0.05033427 0.05963791
## Exponential   0.04462707 0.0022475856 0.04014965 0.04910449
## 
##  Initial inoculum:
##                    Estimate Linearized     lin.SE         Lower         Upper
## Logistic       1.014725e-02  -4.580354 0.04574629  9.271593e-03  0.0111046715
## Gompertz       3.243387e-05  -2.335663 0.10388238  3.012411e-06  0.0002239497
## Monomolecular -1.776962e+00  -1.021357 0.13814812 -2.656706e+00 -1.1088701071
## Exponential    2.846738e-02  -3.558996 0.13296895  2.184279e-02  0.0371010991

Model fit stats

The Stats element of the list shows how each of the four models predicted the observations based on three measures:

  • Lin’s concordance correlation coefficient CCC (Lin 2000), a measure of agreement that takes both bias and precision into account
  • Coefficient of determination r_squared (R2), a measure of precision
  • Residual standard deviation RSE for each model.

The four models are sorted from the high to the low CCC. As expected because the sim_logistic function was used to create the synthetic epidemic data, the the Logistic model was superior to the others.

Model coefficients

The estimates, and respective standard error and upper and lower 95% confidence interval, for the two coefficients of interest are shown in the Infection rate and Initial inoculum elements. For the latter, both the back-transformed (estimate) and the linearized estimate are shown.

Global stats

The element f_lin$stats_all provides a wide format dataframe with all the stats for each model.

knitr::kable(f_lin$stats_all, digits = 4)
best_model model r r_se r_ci_lwr r_ci_upr v0 v0_se r_squared RSE CCC y0 y0_ci_lwr y0_ci_upr
1 Logistic 0.0996 0.0008 0.0981 0.1012 -4.5804 0.0457 0.9955 0.2146 0.9977 0.0101 0.0093 0.0111
2 Gompertz 0.0711 0.0018 0.0676 0.0746 -2.3357 0.1039 0.9563 0.4873 0.9776 0.0000 0.0000 0.0002
3 Monomolecular 0.0550 0.0023 0.0503 0.0596 -1.0214 0.1381 0.8809 0.6480 0.9367 -1.7770 -2.6567 -1.1089
4 Exponential 0.0446 0.0022 0.0401 0.0491 -3.5590 0.1330 0.8402 0.6237 0.9131 0.0285 0.0218 0.0371

Model predictions

The predicted values are stored as a dataframe in the data element called using the same $ operator as above. Both the observed and (y) and the back-transformed predictions (predicted) are shown for each model. The linearized value and the residual are also shown.

knitr::kable(head(f_lin$data), digits = 4)
time y model linearized predicted residual
0 0.0100 Exponential -4.6052 0.0285 -0.0185
0 0.0100 Monomolecular 0.0101 -1.7770 1.7870
0 0.0100 Logistic -4.5951 0.0101 -0.0001
0 0.0100 Gompertz -1.5272 0.0000 0.0100
10 0.0281 Exponential -3.5736 0.0445 -0.0164
10 0.0281 Monomolecular 0.0285 -0.6024 0.6304

Plot of predictions

The plot_fit() produces, by default, a panel of plots depicting the observed and predicted values by all fitted models. The arguments pont_size and line_size that control for the size of the dots for the observation and the size of the fitted line, respectively.

plot_lin <- plot_fit(f_lin,
  point_size = 2,
  line_size = 1
) 

# Default plots
plot_lin 

Faceted comparison of observed data and fitted disease progress curves for the candidate linearized models.

Publication-ready plots

The plots are ggplot2 objects which can be easily customized by adding new layers that override plot paramaters as shown below. The argument models allows to select the models(s) to be shown on the plot. The next plot was customized for the logistic model.

# Customized plots

plot_fit(f_lin,
  point_size = 2,
  line_size = 1,
  models = "Logistic")+
  theme_minimal_hgrid(font_size =18) +
  scale_x_continuous(limits = c(0,100))+
  scale_color_grey()+
   theme(legend.position = "none")+
  labs(
    x = "Time",
    y = "Proportion disease "
    
  )

Customized plot of the logistic fitted curve with observed disease intensity points over time.

Non-linear regression

Two-parameters

The fit_nlin() function uses the Levenberg-Marquardt algorithm for least-squares estimation of nonlinear parameters. In addition to time and disease intensity, starting values for y0 and r should be given in the starting_par argument. The output format and interpretation is analogous to the fit_lin().

NOTE: If you encounter error messages saying “matrix at initial parameter estimates”, try to modify the starting values for the parameters to solve the problem.

f_nlin <- fit_nlin(
  time = dpcL$time,
  y = dpcL$random_y,
  starting_par = list(y0 = 0.01, r = 0.03)
)

f_nlin
## Results of fitting population models 
## 
## Stats:
##                  CCC r_squared    RSE
## Logistic      0.9982    0.9963 0.0246
## Gompertz      0.9959    0.9933 0.0367
## Exponential   0.8869    0.8245 0.1758
## Monomolecular     NA        NA     NA
## 
##  Infection rate:
##                 Estimate   Std.error      Lower      Upper
## Logistic      0.10250959 0.002086494 0.09835308 0.10666609
## Gompertz      0.07226208 0.002129110 0.06802067 0.07650348
## Exponential   0.01916079 0.001407090 0.01635772 0.02196386
## Monomolecular         NA          NA         NA         NA
## 
##  Initial inoculum:
##                   Estimate    Std.error         Lower        Upper
## Logistic      8.533242e-03 8.419049e-04  6.856081e-03 1.021040e-02
## Gompertz      1.466911e-08 2.542743e-08 -3.598492e-08 6.532314e-08
## Exponential   1.787016e-01 2.092668e-02  1.370135e-01 2.203897e-01
## Monomolecular           NA           NA            NA           NA

We can check the results using plot_fit.

plot_fit(f_nlin) +
  theme_minimal_hgrid()#changing plot theme
## Warning: Removed 101 rows containing missing values or values outside the scale range
## (`geom_line()`).

Faceted comparison of observed data and fitted curves from nonlinear two-parameter model fits.

Estimating K (maximum disease)

In many epidemics the last measure (final time) of a DPC does not reach the maximum intensity and, for this reason, estimation of maximum asymptote (carrying capacity K) may be necessary. The fin_lin2() provides an estimation of K in addition to the estimates provided by fit_lin().

Before demonstrating the function, we can transform our simulated data by creating another variable with y_random2 with maximum about 0.8 (80%). Simplest way is to multiply the y_random by 0.8.

dpcL2 = dpcL %>% 
  mutate(random_y = random_y * 0.8)

Then we run the fit_nlin2() for the new dataset.

f_nlin2 <- fit_nlin2(
  time = dpcL2$time,
  y = dpcL2$random_y,
  starting_par = list(y0 = 0.01, r = 0.2, K =  0.6)
)
f_nlin2
## Results of fitting population models 
## 
## Stats:
##                  CCC r_squared    RSE
## Logistic      0.9982    0.9963 0.0198
## Gompertz      0.9965    0.9939 0.0275
## Monomolecular     NA        NA     NA
## 
##  Infection rate:
##                 Estimate   Std.error      Lower      Upper
## Logistic      0.10267337 0.002532523 0.09762721 0.10771953
## Gompertz      0.06535311 0.002535110 0.06030179 0.07040442
## Monomolecular         NA          NA         NA         NA
## 
##  Initial inoculum:
##                   Estimate    Std.error         Lower        Upper
## Logistic      6.784477e-03 7.643834e-04  5.261410e-03 8.307544e-03
## Gompertz      5.923352e-07 8.447990e-07 -1.090964e-06 2.275634e-06
## Monomolecular           NA           NA            NA           NA
## 
##  Maximum disease intensity:
##                Estimate   Std.error     Lower     Upper
## Logistic      0.7994452 0.004854938 0.7897715 0.8091189
## Gompertz      0.8289752 0.009009765 0.8110228 0.8469275
## Monomolecular        NA          NA        NA        NA
plot_fit(f_nlin2)

Faceted comparison of observed data and fitted curves from nonlinear model fits that estimate the upper asymptote K.

NOTE: The exponential model is not included because it doesn’t have a maximum asymptote. The estimated value of K is the expected 0.8.

Fit models to multiple DPCs

Most commonly, there are more than one epidemics to analyse either from observational or experimental studies. When the goal is to fit a common model to all curves, the fit_multi() function is in hand. Each DPC needs an unique identified to further combined in a single data frame.

Data

Let’s use the sim_ family of functions to create three epidemics and store the data in a single data.frame. The Gompertz model was used to simulate these data. Note that we allowed to the y0 and r parameter to differ the DPCs. We should combine the three DPCs using the bind_rows() function and name the identifier (.id), automatically created as a character vector, for each epidemics as ‘DPC’.

epi1 <- sim_gompertz(N = 60, y0 = 0.001, dt = 5, r = 0.1, alpha = 0.4, n = 4)
epi2 <- sim_gompertz(N = 60, y0 = 0.001, dt = 5, r = 0.12, alpha = 0.4, n = 4)
epi3 <- sim_gompertz(N = 60, y0 = 0.003, dt = 5, r = 0.14, alpha = 0.4, n = 4)

multi_epidemic <- bind_rows(epi1,
  epi2,
  epi3,
  .id = "DPC"
)
knitr::kable(head(multi_epidemic), digits = 4)
DPC replicates time y random_y
1 1 0 0.0010 0.0010
1 1 5 0.0152 0.0162
1 1 10 0.0788 0.0762
1 1 15 0.2141 0.3436
1 1 20 0.3927 0.5165
1 1 25 0.5672 0.6408

We can visualize the three DPCs in a same plot

p_multi <- ggplot(multi_epidemic,
       aes(time, y, shape = DPC, group = DPC))+
  geom_point(size =2)+
  geom_line()+
  theme_minimal_grid(font_size =18) +
   labs(
    x = "Time",
    y = "Proportion disease "
    
  )
p_multi

Line plot comparing three simulated Gompertz disease progress curves identified by DPC.

Or use facet_wrap() for ploting them separately.

p_multi +
  facet_wrap(~ DPC, ncol = 1)

Faceted display of the three simulated disease progress curves, one panel for each DPC.

Using fit_multi()

fit_multi() requires at least four arguments: time, disease intensity (as proportion), data and the curve identifier (strata_cols). The latter argument accepts one or more strata include as c("strata1",strata2"). In the example below, the stratum name is DPC, the name of the variable.

By default, the linear regression is fitted to data but adding another argument nlin = T, the non linear regressions is fitted instead.

multi_fit <- fit_multi(
  time_col = "time",
  intensity_col = "random_y",
  data = multi_epidemic,
  strata_cols = "DPC"
)

All parameters of the list can be returned using the $ operator as below.

knitr::kable(head(multi_fit$Parameters), digits = 4)
DPC best_model model r r_se r_ci_lwr r_ci_upr v0 v0_se r_squared RSE CCC y0 y0_ci_lwr y0_ci_upr
1 1 Gompertz 0.1010 0.0027 0.0955 0.1065 -1.8740 0.0965 0.9648 0.3684 0.9821 0.0015 0.0004 0.0047
1 2 Monomolecular 0.0730 0.0032 0.0667 0.0794 -0.5698 0.1120 0.9140 0.4275 0.9550 -0.7679 -1.2140 -0.4116
1 3 Logistic 0.1602 0.0081 0.1440 0.1765 -4.5241 0.2856 0.8872 1.0899 0.9403 0.0107 0.0061 0.0189
1 4 Exponential 0.0872 0.0094 0.0682 0.1062 -3.9543 0.3339 0.6303 1.2741 0.7733 0.0192 0.0098 0.0375
2 1 Gompertz 0.1206 0.0031 0.1144 0.1269 -1.9129 0.1099 0.9678 0.4195 0.9837 0.0011 0.0002 0.0044
2 2 Monomolecular 0.0947 0.0038 0.0870 0.1024 -0.7282 0.1357 0.9241 0.5179 0.9605 -1.0714 -1.7206 -0.5772

Similarly, all data can be returned.

knitr::kable(head(multi_fit$Data), digits = 4)
DPC time y model linearized predicted residual
1 0 0.0010 Exponential -6.9078 0.0192 -0.0182
1 0 0.0010 Monomolecular 0.0010 -0.7679 0.7689
1 0 0.0010 Logistic -6.9068 0.0107 -0.0097
1 0 0.0010 Gompertz -1.9326 0.0015 -0.0005
1 5 0.0162 Exponential -4.1238 0.0297 -0.0135
1 5 0.0162 Monomolecular 0.0163 -0.2270 0.2432

If nonlinear regression is preferred, the nlim argument should be set to TRUE

multi_fit2 <- fit_multi(
  time_col = "time",
  intensity_col = "random_y",
  data = multi_epidemic,
  strata_cols = "DPC",
  nlin = TRUE)
knitr::kable(head(multi_fit2$Parameters), digits = 4)
DPC model y0 y0_se r r_se df CCC r_squared RSE y0_ci_lwr y0_ci_upr r_ci_lwr r_ci_upr best_model
1 Gompertz 0.0018 0.0011 0.1034 0.0043 50 0.9925 0.9854 0.0463 -0.0003 0.0040 0.0948 0.1120 1
1 Logistic 0.0354 0.0067 0.1489 0.0084 50 0.9875 0.9783 0.0598 0.0218 0.0489 0.1320 0.1657 2
1 Exponential 0.2513 0.0315 0.0259 0.0026 50 0.8476 0.7652 0.1880 0.1880 0.3145 0.0206 0.0312 3
1 Monomolecular NA NA NA NA NA NA NA NA NA NA NA NA 4
2 Gompertz 0.0019 0.0013 0.1157 0.0057 50 0.9906 0.9815 0.0523 -0.0008 0.0045 0.1041 0.1272 1
2 Logistic 0.0348 0.0067 0.1667 0.0095 50 0.9887 0.9789 0.0571 0.0213 0.0483 0.1477 0.1858 2

Want to estimate K?

If you want to estimate K, set nlin = TRUE and estimate_K = TRUE.

NOTE: If you do not set both arguments TRUE, K will not be estimated, because nlin defaut is FALSE. Also remember that when estimating K, we don’t fit the Exponential model.

multi_fit_K <- fit_multi(
  time_col = "time",
  intensity_col = "random_y",
  data = multi_epidemic,
  strata_cols = "DPC",
  nlin = T,
  estimate_K = T
)
knitr::kable(head(multi_fit_K$Parameters), digits = 4)
DPC model y0 y0_se r r_se K K_se df CCC r_squared RSE y0_ci_lwr y0_ci_upr r_ci_lwr r_ci_upr K_ci_lwr K_ci_upr best_model
1 Gompertz 0.0018 0.0013 0.1034 0.0063 1 0.0157 49 0.9925 0.9854 0.0468 -0.0008 0.0044 0.0908 0.1160 0.9685 1.0315 1
1 Logistic 0.0354 0.0075 0.1489 0.0102 1 0.0168 49 0.9875 0.9783 0.0605 0.0202 0.0505 0.1283 0.1694 0.9663 1.0337 2
1 Monomolecular NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3
2 Gompertz 0.0019 0.0015 0.1157 0.0077 1 0.0151 49 0.9906 0.9815 0.0528 -0.0012 0.0049 0.1001 0.1312 0.9696 1.0304 1
2 Logistic 0.0348 0.0073 0.1667 0.0111 1 0.0142 49 0.9887 0.9789 0.0577 0.0201 0.0495 0.1445 0.1890 0.9715 1.0285 2
2 Monomolecular NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3

Graphical outputs

Use ggplot2 to produce elegant data visualizations of models curves and the estimated parameters.

DPCs and fitted curves

The original data and the predicted values by each model are stored in multi_fit$Data. A nice plot can be produced as follows:

multi_fit$Data %>%
  ggplot(aes(time, predicted, color = DPC)) +
  geom_point(aes(time, y), color = "gray") +
  geom_line(size = 1) +
  facet_grid(DPC ~ model, scales = "free_y") +
  theme_minimal_hgrid()+
  coord_cartesian(ylim = c(0, 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
##  Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Facet grid of observed data and fitted disease progress curves for each DPC and model combination.

Using the dplyr function filter only the model of interest can be chosen for plotting.

multi_fit$Data %>%
  filter(model == "Gompertz") %>%
  ggplot(aes(time, predicted, color = DPC)) +
  geom_point(aes(time, y),
    color = "gray",
    size = 2
  ) +
  geom_line(size = 1.2) +
  theme_minimal_hgrid() +
  labs(
    x = "Time",
    y = "Disease Intensity"
  )

Plot of observed data and fitted Gompertz curves for each DPC over time.

Apparent infection rate

The multi_fit$Parameters element is where all stats and parameters as stored. Let’s plot the estimates of the apparent infection rate.

multi_fit$Parameters %>%
  filter(model == "Gompertz") %>%
  ggplot(aes(DPC, r)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = r_ci_lwr, ymax = r_ci_upr),
    width = 0,
    size = 1
  ) +
  labs(
    x = "Time",
    y = "Apparent infection rate"
  ) +
  theme_minimal_hgrid()

Point and error-bar plot of apparent infection rate estimates with confidence intervals for each DPC.

References

Lin L (2000). A note on the concordance correlation coefficient. Biometrics 56: 324 - 325.