Mortality
The goal of the analysis is to calculate the association between an exposure endpoint and death.
Data preprocessing
 Start of followup: 19980101 – we choose this date because we have complete coverage for all registries
 End of followup: death or 20191231
 If the date of diagnoses for the exposure endpoint happens before 19980101 we assume that it happened on 19980101.
 Only calculated if there are at least 10 deaths among individuals diagnosed with the exposure endpoint
Casecohort design
To improve computational speed, we used a casecohort design.
Briefly, from the original cohort, we selected a subcohort at the start of followup. The subcohort can include individuals that died. The size of the subcohort is 10,000 individuals.
The final population includes all the individuals in the subcohort and all the individuals that died outside the subcohort.
Cox regression
To perform the analyses, we used a Cox regression with a timevarying covariate, weighted by the inverse of the sampling probability to account for the casecohort design. Robust standard error was used. The model is defined as:
Surv(time,death) ~ exposure_endpoint + birth_year + sex

time is calculated as (date end of followup – date entry in the study) as defined in Data preprocessing (except for individuals diagnosed with the exposure endpoint where time is split from entry till diagnosis and from diagnosis till the end of follow up, see below).

exposure_endpoint is treated as a timevarying covariate. This means that an individual is unexposed (value of the variable is set to 0) from 19980101 until the diagnoses of the exposure endpoint and exposed (value of the variable is set to 1) after that. That is, if an individual experiences an exposure endpoint, it will have two rows in the dataset.
Lagged hazard ratios are computed with the following followup time windows: < 1 year, between 1 and 5 years, between 5 and 15 years.
The Cox regression is implemented using the lifelines library.
Absolute Risk (AR)
The absolute risk represents the probability of dying. It is defined as AR = 1  survival_probability. The survival probability is derived using the Breslow’s method assuming these values for the other covariates in the model:
 year of birth: 1959
 sex ratio: 50%
Survival analyses between endpoints
Associations between endpoints are calculated loosely following the approach described in the
NBCOMO study.
The goal of the analysis is to study the association between an exposure endpoint and an outcome endpoint.
E.g., what’s the association between a diagnosis of type 2 diabetes (exposure endpoint) and cardiovascular diseases (outcome endpoint).
Data preprocessing
 Start of followup: 19980101 – we choose this date because we have complete coverage for all registries
 End of followup: diagnose of the outcome endpoint or death or 20191231
 Prevalent cases (i.e. individuals that have been diagnosed with the outcome endpoint before 19980101) were removed from the study. We consider only incident cases.
 If the date of diagnoses for the exposure endpoint happens before 19980101 we assume that it happened on 19980101.
 Only consider endpoint pairs:
 with at least 10 individuals for each cell of the 2x2 contingency table between endpoint pairs.
 with at least 25 individuals having the outcome endpoint.
 where endpoints are not “overlapping”. That is, endpoints are not descendants of one another endpoint in the tree hierarchy or have overlapping underlying ICD codes.
Casecohort design
To improve computational speed, we used a casecohort design.
Briefly, from the original cohort, we selected a subcohort at the start of followup.
The subcohort can include outcome endpoints. The size of the subcohort is always 10,000 individuals randomly selected for each analysis.
The final population includes all the individuals in the subcohort and all the individuals that experience the outcome endpoints outside the subcohort.
Cox regression
To perform the analyses, we used a Cox regression with a timevarying covariate,
weighted by the inverse of the sampling probability to account for the casecohort design.
Robust standard error was used. The model is defined as:
Surv(time,outcome_endpoint) ~ exposure_endpoint + birth_year + sex

time is calculated as (date end of followup – date entry in the study) as defined in Data preprocessing
(except for individuals diagnosed with the exposure endpoint where time is split from entry till diagnosis and from diagnosis till the end of follow up, see below).

exposure_endpoint is treated as a timevarying covariate.
This means that an individual is unexposed (value of the variable is set to 0) from 19980101 until the diagnoses of the exposure endpoint and exposed (value of the variable is set to 1) after that.
That is, if an individual experiences an exposure endpoint, it will have two rows in the dataset.
Lagged hazard ratios are computed with the following followup time windows: < 1 year, between 1 and 5 years, between 5 and 15 years.
If an outcome endpoint happens outside the timewidow, the individual experience the disease is kept, but the outcome endpoint is not considered (i.e. variable is set to 0).
The Cox regression is implemented using the lifelines library.
Drug Statistics
The drug score is computed in a 2step process:
 Fit the data to the logistic model:
y ~ sex + yearofbirth + yearofbirth^2 + yearatendpoint + yearatendpoint^2
 Use the fitted model to predict the probability for the following data:
 sex = 0.5, assume an even number of females and males.
 yearofbirth = 1960, the mean year of birth of the FinnGen cohort.
 yearatendpoint = 2021, predict the probability at the end of the study.
The resulting probability value is the drug score. The highest the drug score is, the more likely the drug is to be taken after the given endpoint.