HomeArtificial IntelligenceModeling censored information with tfprobability

Modeling censored information with tfprobability


Nothing’s ever good, and information isn’t both. One sort of “imperfection” is lacking information, the place some options are unobserved for some topics. (A subject for an additional publish.) One other is censored information, the place an occasion whose traits we need to measure doesn’t happen within the statement interval. The instance in Richard McElreath’s Statistical Rethinking is time to adoption of cats in an animal shelter. If we repair an interval and observe wait instances for these cats that truly did get adopted, our estimate will find yourself too optimistic: We don’t have in mind these cats who weren’t adopted throughout this interval and thus, would have contributed wait instances of size longer than the entire interval.

On this publish, we use a barely much less emotional instance which nonetheless could also be of curiosity, particularly to R bundle builders: time to completion of R CMD verify, collected from CRAN and offered by the parsnip bundle as check_times. Right here, the censored portion are these checks that errored out for no matter cause, i.e., for which the verify didn’t full.

Why will we care in regards to the censored portion? Within the cat adoption situation, that is fairly apparent: We would like to have the ability to get a sensible estimate for any unknown cat, not simply these cats that can grow to be “fortunate”. How about check_times? Properly, in case your submission is a kind of that errored out, you continue to care about how lengthy you wait, so although their share is low (

How can we mannequin durations for that censored portion, the place the “true period” is unknown? Taking one step again, how can we mannequin durations normally? Making as few assumptions as doable, the most entropy distribution for displacements (in house or time) is the exponential. Thus, for the checks that truly did full, durations are assumed to be exponentially distributed.

For the others, all we all know is that in a digital world the place the verify accomplished, it might take no less than as lengthy because the given period. This amount will be modeled by the exponential complementary cumulative distribution perform (CCDF). Why? A cumulative distribution perform (CDF) signifies the chance {that a} worth decrease or equal to some reference level was reached; e.g., “the chance of durations

Let’s see this in motion.

The info

The next code works with the present secure releases of TensorFlow and TensorFlow Chance, that are 1.14 and 0.7, respectively. Should you don’t have tfprobability put in, get it from Github:

These are the libraries we’d like. As of TensorFlow 1.14, we name tf$compat$v2$enable_v2_behavior() to run with keen execution.

Apart from the verify durations we need to mannequin, check_times stories numerous options of the bundle in query, reminiscent of variety of imported packages, variety of dependencies, measurement of code and documentation information, and so forth. The standing variable signifies whether or not the verify accomplished or errored out.

df  check_times %>% choose(-bundle)
glimpse(df)
Observations: 13,626
Variables: 24
$ authors         1, 1, 1, 1, 5, 3, 2, 1, 4, 6, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,…
$ imports         0, 6, 0, 0, 3, 1, 0, 4, 0, 7, 0, 0, 0, 0, 3, 2, 14, 2, 2, 0…
$ suggests        2, 4, 0, 0, 2, 0, 2, 2, 0, 0, 2, 8, 0, 0, 2, 0, 1, 3, 0, 0,…
$ relies upon         3, 1, 6, 1, 1, 1, 5, 0, 1, 1, 6, 5, 0, 0, 0, 1, 1, 5, 0, 2,…
$ Roxygen         0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,…
$ gh              0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,…
$ rforge          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ descr           217, 313, 269, 63, 223, 1031, 135, 344, 204, 335, 104, 163,…
$ r_count         2, 20, 8, 0, 10, 10, 16, 3, 6, 14, 16, 4, 1, 1, 11, 5, 7, 1…
$ r_size          0.029053, 0.046336, 0.078374, 0.000000, 0.019080, 0.032607,…
$ ns_import       3, 15, 6, 0, 4, 5, 0, 4, 2, 10, 5, 6, 1, 0, 2, 2, 1, 11, 0,…
$ ns_export       0, 19, 0, 0, 10, 0, 0, 2, 0, 9, 3, 4, 0, 1, 10, 0, 16, 0, 2…
$ s3_methods      3, 0, 11, 0, 0, 0, 0, 2, 0, 23, 0, 0, 2, 5, 0, 4, 0, 0, 0, …
$ s4_methods      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ doc_count       0, 3, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ doc_size        0.000000, 0.019757, 0.038281, 0.000000, 0.007874, 0.000000,…
$ src_count       0, 0, 0, 0, 0, 0, 0, 2, 0, 5, 3, 0, 0, 0, 0, 0, 0, 54, 0, 0…
$ src_size        0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000,…
$ data_count      2, 0, 0, 3, 3, 1, 10, 0, 4, 2, 2, 146, 0, 0, 0, 0, 0, 10, 0…
$ data_size       0.025292, 0.000000, 0.000000, 4.885864, 4.595504, 0.006500,…
$ testthat_count  0, 8, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0,…
$ testthat_size   0.000000, 0.002496, 0.000000, 0.000000, 0.000000, 0.000000,…
$ check_time      49, 101, 292, 21, 103, 46, 78, 91, 47, 196, 200, 169, 45, 2…
$ standing          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Of those 13,626 observations, simply 103 are censored:

0     1 
103 13523 

For higher readability, we’ll work with a subset of the columns. We use surv_reg to assist us discover a helpful and attention-grabbing subset of predictors:

survreg_fit 
  surv_reg(dist = "exponential") %>% 
  set_engine("survreg") %>% 
  match(Surv(check_time, standing) ~ ., 
      information = df)
tidy(survreg_fit) 
# A tibble: 23 x 7
   time period             estimate std.error statistic  p.worth conf.low conf.excessive
                                         
 1 (Intercept)     3.86      0.0219     176.     0.             NA        NA
 2 authors         0.0139    0.00580      2.40   1.65e- 2       NA        NA
 3 imports         0.0606    0.00290     20.9    7.49e-97       NA        NA
 4 suggests        0.0332    0.00358      9.28   1.73e-20       NA        NA
 5 relies upon         0.118     0.00617     19.1    5.66e-81       NA        NA
 6 Roxygen         0.0702    0.0209       3.36   7.87e- 4       NA        NA
 7 gh              0.00898   0.0217       0.414  6.79e- 1       NA        NA
 8 rforge          0.0232    0.0662       0.351  7.26e- 1       NA        NA
 9 descr           0.000138  0.0000337    4.10   4.18e- 5       NA        NA
10 r_count         0.00209   0.000525     3.98   7.03e- 5       NA        NA
11 r_size          0.481     0.0819       5.87   4.28e- 9       NA        NA
12 ns_import       0.00352   0.000896     3.93   8.48e- 5       NA        NA
13 ns_export      -0.00161   0.000308    -5.24   1.57e- 7       NA        NA
14 s3_methods      0.000449  0.000421     1.06   2.87e- 1       NA        NA
15 s4_methods     -0.00154   0.00206     -0.745  4.56e- 1       NA        NA
16 doc_count       0.0739    0.0117       6.33   2.44e-10       NA        NA
17 doc_size        2.86      0.517        5.54   3.08e- 8       NA        NA
18 src_count       0.0122    0.00127      9.58   9.96e-22       NA        NA
19 src_size       -0.0242    0.0181      -1.34   1.82e- 1       NA        NA
20 data_count      0.0000415 0.000980     0.0423 9.66e- 1       NA        NA
21 data_size       0.0217    0.0135       1.61   1.08e- 1       NA        NA
22 testthat_count -0.000128  0.00127     -0.101  9.20e- 1       NA        NA
23 testthat_size   0.0108    0.0139       0.774  4.39e- 1       NA        NA

Evidently if we select imports, relies upon, r_size, doc_size, ns_import and ns_export we find yourself with a mixture of (comparatively) highly effective predictors from completely different semantic areas and of various scales.

Earlier than pruning the dataframe, we save away the goal variable. In our mannequin and coaching setup, it’s handy to have censored and uncensored information saved individually, so right here we create two goal matrices as an alternative of 1:

# verify instances for failed checks
# _c stands for censored
check_time_c  df %>%
  filter(standing == 0) %>%
  choose(check_time) %>%
  as.matrix()

# verify instances for profitable checks 
check_time_nc  df %>%
  filter(standing == 1) %>%
  choose(check_time) %>%
  as.matrix()

Now we will zoom in on the variables of curiosity, organising one dataframe for the censored information and one for the uncensored information every. All predictors are normalized to keep away from overflow throughout sampling. We add a column of 1s to be used as an intercept.

df  df %>% choose(standing,
                    relies upon,
                    imports,
                    doc_size,
                    r_size,
                    ns_import,
                    ns_export) %>%
  mutate_at(.vars = 2:7, .funs = perform(x) (x - min(x))/(max(x)-min(x))) %>%
  add_column(intercept = rep(1, nrow(df)), .earlier than = 1)

# dataframe of predictors for censored information  
df_c  df %>% filter(standing == 0) %>% choose(-standing)
# dataframe of predictors for non-censored information 
df_nc  df %>% filter(standing == 1) %>% choose(-standing)

That’s it for preparations. However in fact we’re curious. Do verify instances look completely different? Do predictors – those we selected – look completely different?

Evaluating just a few significant percentiles for each courses, we see that durations for uncompleted checks are increased than these for accomplished checks all through, other than the 100% percentile. It’s not shocking that given the large distinction in pattern measurement, most period is increased for accomplished checks. In any other case although, doesn’t it appear like the errored-out bundle checks “had been going to take longer”?

accomplished 36 54 79 115 211 1343
not accomplished 42 71 97 143 293 696

How in regards to the predictors? We don’t see any variations for relies upon, the variety of bundle dependencies (other than, once more, the upper most reached for packages whose verify accomplished):

accomplished 0 1 1 2 4 12
not accomplished 0 1 1 2 4 7

However for all others, we see the identical sample as reported above for check_time. Variety of packages imported is increased for censored information in any respect percentiles apart from the utmost:

accomplished 0 0 2 4 9 43
not accomplished 0 1 5 8 12 22

Identical for ns_export, the estimated variety of exported features or strategies:

accomplished 0 1 2 8 26 2547
not accomplished 0 1 5 13 34 336

In addition to for ns_import, the estimated variety of imported features or strategies:

accomplished 0 1 3 6 19 312
not accomplished 0 2 5 11 23 297

Identical sample for r_size, the dimensions on disk of information within the R listing:

accomplished 0.005 0.015 0.031 0.063 0.176 3.746
not accomplished 0.008 0.019 0.041 0.097 0.217 2.148

And at last, we see it for doc_size too, the place doc_size is the dimensions of .Rmd and .Rnw information:

accomplished 0.000 0.000 0.000 0.000 0.023 0.988
not accomplished 0.000 0.000 0.000 0.011 0.042 0.114

Given our process at hand – mannequin verify durations bearing in mind uncensored in addition to censored information – we gained’t dwell on variations between each teams any longer; nonetheless we thought it attention-grabbing to narrate these numbers.

So now, again to work. We have to create a mannequin.

The mannequin

As defined within the introduction, for accomplished checks period is modeled utilizing an exponential PDF. That is as easy as including tfd_exponential() to the mannequin perform, tfd_joint_distribution_sequential(). For the censored portion, we’d like the exponential CCDF. This one just isn’t, as of at present, simply added to the mannequin. What we will do although is calculate its worth ourselves and add it to the “most important” mannequin chance. We’ll see this beneath when discussing sampling; for now it means the mannequin definition finally ends up easy because it solely covers the non-censored information. It’s product of simply the stated exponential PDF and priors for the regression parameters.

As for the latter, we use 0-centered, Gaussian priors for all parameters. Normal deviations of 1 turned out to work nicely. Because the priors are all the identical, as an alternative of itemizing a bunch of tfd_normals, we will create them abruptly as

tfd_sample_distribution(tfd_normal(0, 1), sample_shape = 7)

Imply verify time is modeled as an affine mixture of the six predictors and the intercept. Right here then is the entire mannequin, instantiated utilizing the uncensored information solely:

mannequin  perform(information) {
  tfd_joint_distribution_sequential(
    checklist(
      tfd_sample_distribution(tfd_normal(0, 1), sample_shape = 7),
      perform(betas)
        tfd_independent(
          tfd_exponential(
            charge = 1 / tf$math$exp(tf$transpose(
              tf$matmul(tf$solid(information, betas$dtype), tf$transpose(betas))))),
          reinterpreted_batch_ndims = 1)))
}

m  mannequin(df_nc %>% as.matrix())

All the time, we take a look at if samples from that mannequin have the anticipated shapes:

samples  m %>% tfd_sample(2)
samples
[[1]]
tf.Tensor(
[[ 1.4184642   0.17583323 -0.06547955 -0.2512014   0.1862184  -1.2662812
   1.0231884 ]
 [-0.52142304 -1.0036682   2.2664437   1.29737     1.1123234   0.3810004
   0.1663677 ]], form=(2, 7), dtype=float32)

[[2]]
tf.Tensor(
[[4.4954767  7.865639   1.8388556  ... 7.914391   2.8485563  3.859719  ]
 [1.549662   0.77833986 0.10015647 ... 0.40323067 3.42171    0.69368565]], form=(2, 13523), dtype=float32)

This seems to be positive: Now we have a listing of size two, one factor for every distribution within the mannequin. For each tensors, dimension 1 displays the batch measurement (which we arbitrarily set to 2 on this take a look at), whereas dimension 2 is 7 for the variety of regular priors and 13523 for the variety of durations predicted.

How probably are these samples?

m %>% tfd_log_prob(samples)
tf.Tensor([-32464.521   -7693.4023], form=(2,), dtype=float32)

Right here too, the form is appropriate, and the values look affordable.

The subsequent factor to do is outline the goal we need to optimize.

Optimization goal

Abstractly, the factor to maximise is the log probility of the info – that’s, the measured durations – beneath the mannequin.
Now right here the info is available in two components, and the goal does as nicely. First, we now have the non-censored information, for which

m %>% tfd_log_prob(checklist(betas, tf$solid(target_nc, betas$dtype)))

will calculate the log chance. Second, to acquire log chance for the censored information we write a customized perform that calculates the log of the exponential CCDF:

get_exponential_lccdf  perform(betas, information, goal) {
  e   tfd_independent(tfd_exponential(charge = 1 / tf$math$exp(tf$transpose(tf$matmul(
    tf$solid(information, betas$dtype), tf$transpose(betas)
  )))),
  reinterpreted_batch_ndims = 1)
  cum_prob  e %>% tfd_cdf(tf$solid(goal, betas$dtype))
  tf$math$log(1 - cum_prob)
}

Each components are mixed in somewhat wrapper perform that permits us to match coaching together with and excluding the censored information. We gained’t do this on this publish, however you could be to do it with your personal information, particularly if the ratio of censored and uncensored components is rather less imbalanced.

get_log_prob 
  perform(target_nc,
           censored_data = NULL,
           target_c = NULL) {
    log_prob  perform(betas) {
      log_prob 
        m %>% tfd_log_prob(checklist(betas, tf$solid(target_nc, betas$dtype)))
      potential 
        if (!is.null(censored_data) && !is.null(target_c))
          get_exponential_lccdf(betas, censored_data, target_c)
      else
        0
      log_prob + potential
    }
    log_prob
  }

log_prob 
  get_log_prob(
    check_time_nc %>% tf$transpose(),
    df_c %>% as.matrix(),
    check_time_c %>% tf$transpose()
  )

Sampling

With mannequin and goal outlined, we’re able to do sampling.

n_chains  4
n_burnin  1000
n_steps  1000

# hold observe of some diagnostic output, acceptance and step measurement
trace_fn  perform(state, pkr) {
  checklist(
    pkr$inner_results$is_accepted,
    pkr$inner_results$accepted_results$step_size
  )
}

# get form of preliminary values 
# to start out sampling with out producing NaNs, we are going to feed the algorithm
# tf$zeros_like(initial_betas)
# as an alternative 
initial_betas  (m %>% tfd_sample(n_chains))[[1]]

For the variety of leapfrog steps and the step measurement, experimentation confirmed {that a} mixture of 64 / 0.1 yielded affordable outcomes:

hmc  mcmc_hamiltonian_monte_carlo(
  target_log_prob_fn = log_prob,
  num_leapfrog_steps = 64,
  step_size = 0.1
) %>%
  mcmc_simple_step_size_adaptation(target_accept_prob = 0.8,
                                   num_adaptation_steps = n_burnin)

run_mcmc  perform(kernel) {
  kernel %>% mcmc_sample_chain(
    num_results = n_steps,
    num_burnin_steps = n_burnin,
    current_state = tf$ones_like(initial_betas),
    trace_fn = trace_fn
  )
}

# necessary for efficiency: run HMC in graph mode
run_mcmc  tf_function(run_mcmc)

res  hmc %>% run_mcmc()
samples  res$all_states

Outcomes

Earlier than we examine the chains, here’s a fast have a look at the proportion of accepted steps and the per-parameter imply step measurement:

0.995
0.004953894

We additionally retailer away efficient pattern sizes and the rhat metrics for later addition to the synopsis.

effective_sample_size  mcmc_effective_sample_size(samples) %>%
  as.matrix() %>%
  apply(2, imply)
potential_scale_reduction  mcmc_potential_scale_reduction(samples) %>%
  as.numeric()

We then convert the samples tensor to an R array to be used in postprocessing.

# 2-item checklist, the place every merchandise has dim (1000, 4)
samples  as.array(samples) %>% array_branch(margin = 3)

How nicely did the sampling work? The chains combine nicely, however for some parameters, autocorrelation remains to be fairly excessive.

prep_tibble  perform(samples) {
  as_tibble(samples,
            .name_repair = ~ c("chain_1", "chain_2", "chain_3", "chain_4")) %>%
    add_column(pattern = 1:n_steps) %>%
    collect(key = "chain", worth = "worth",-pattern)
}

plot_trace  perform(samples) {
  prep_tibble(samples) %>%
    ggplot(aes(x = pattern, y = worth, shade = chain)) +
    geom_line() +
    theme_light() +
    theme(
      legend.place = "none",
      axis.title = element_blank(),
      axis.textual content = element_blank(),
      axis.ticks = element_blank()
    )
}

plot_traces  perform(samples) {
  plots  purrr::map(samples, plot_trace)
  do.name(grid.organize, plots)
}

plot_traces(samples)

Trace plots for the 7 parameters.

Determine 1: Hint plots for the 7 parameters.

Now for a synopsis of posterior parameter statistics, together with the standard per-parameter sampling indicators efficient pattern measurement and rhat.

all_samples  map(samples, as.vector)

means  map_dbl(all_samples, imply)

sds  map_dbl(all_samples, sd)

hpdis  map(all_samples, ~ hdi(.x) %>% t() %>% as_tibble())

abstract  tibble(
  imply = means,
  sd = sds,
  hpdi = hpdis
) %>% unnest() %>%
  add_column(param = colnames(df_c), .after = FALSE) %>%
  add_column(
    n_effective = effective_sample_size,
    rhat = potential_scale_reduction
  )

abstract
# A tibble: 7 x 7
  param       imply     sd  decrease higher n_effective  rhat
                     
1 intercept  4.05  0.0158  4.02   4.08       508.   1.17
2 relies upon    1.34  0.0732  1.18   1.47      1000    1.00
3 imports    2.89  0.121   2.65   3.12      1000    1.00
4 doc_size   6.18  0.394   5.40   6.94       177.   1.01
5 r_size     2.93  0.266   2.42   3.46       289.   1.00
6 ns_import  1.54  0.274   0.987  2.06       387.   1.00
7 ns_export -0.237 0.675  -1.53   1.10        66.8  1.01

Posterior means and HPDIs.

Determine 2: Posterior means and HPDIs.

From the diagnostics and hint plots, the mannequin appears to work fairly nicely, however as there isn’t any easy error metric concerned, it’s laborious to know if precise predictions would even land in an applicable vary.

To ensure they do, we examine predictions from our mannequin in addition to from surv_reg.
This time, we additionally cut up the info into coaching and take a look at units. Right here first are the predictions from surv_reg:

train_test_split  initial_split(check_times, strata = "standing")
check_time_train  coaching(train_test_split)
check_time_test  testing(train_test_split)

survreg_fit 
  surv_reg(dist = "exponential") %>% 
  set_engine("survreg") %>% 
  match(Surv(check_time, standing) ~ relies upon + imports + doc_size + r_size + 
        ns_import + ns_export, 
      information = check_time_train)
survreg_fit(sr_fit)
# A tibble: 7 x 7
  time period         estimate std.error statistic  p.worth conf.low conf.excessive
                                    
1 (Intercept)  4.05      0.0174     234.    0.             NA        NA
2 relies upon      0.108     0.00701     15.4   3.40e-53       NA        NA
3 imports      0.0660    0.00327     20.2   1.09e-90       NA        NA
4 doc_size     7.76      0.543       14.3   2.24e-46       NA        NA
5 r_size       0.812     0.0889       9.13  6.94e-20       NA        NA
6 ns_import    0.00501   0.00103      4.85  1.22e- 6       NA        NA
7 ns_export   -0.000212  0.000375    -0.566 5.71e- 1       NA        NA
survreg_pred  
  predict(survreg_fit, check_time_test) %>% 
  bind_cols(check_time_test %>% choose(check_time, standing))  

ggplot(survreg_pred, aes(x = check_time, y = .pred, shade = issue(standing))) +
  geom_point() + 
  coord_cartesian(ylim = c(0, 1400))

Test set predictions from surv_reg. One outlier (of value 160421) is excluded via coord_cartesian() to avoid distorting the plot.

Determine 3: Take a look at set predictions from surv_reg. One outlier (of worth 160421) is excluded by way of coord_cartesian() to keep away from distorting the plot.

For the MCMC mannequin, we re-train on simply the coaching set and procure the parameter abstract. The code is analogous to the above and never proven right here.

We are able to now predict on the take a look at set, for simplicity simply utilizing the posterior means:

df  check_time_test %>% choose(
                    relies upon,
                    imports,
                    doc_size,
                    r_size,
                    ns_import,
                    ns_export) %>%
  add_column(intercept = rep(1, nrow(check_time_test)), .earlier than = 1)

mcmc_pred  df %>% as.matrix() %*% abstract$imply %>% exp() %>% as.numeric()
mcmc_pred  check_time_test %>% choose(check_time, standing) %>%
  add_column(.pred = mcmc_pred)

ggplot(mcmc_pred, aes(x = check_time, y = .pred, shade = issue(standing))) +
  geom_point() + 
  coord_cartesian(ylim = c(0, 1400)) 

Test set predictions from the mcmc model. No outliers, just using same scale as above for comparison.

Determine 4: Take a look at set predictions from the mcmc mannequin. No outliers, simply utilizing similar scale as above for comparability.

This seems to be good!

Wrapup

We’ve proven how you can mannequin censored information – or reasonably, a frequent subtype thereof involving durations – utilizing tfprobability. The check_times information from parsnip had been a enjoyable alternative, however this modeling method could also be much more helpful when censoring is extra substantial. Hopefully his publish has offered some steering on how you can deal with censored information in your personal work. Thanks for studying!

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments