Version: 0.1-1
Author: Giampiero Marra [aut, cre]
Maintainer: Giampiero Marra <giampiero.marra@ucl.ac.uk>
Title: Data Sets for Copula Additive Distributional Regression Using R
Description: Data sets used in the book Marra and Radice (2025, ISBN:9781032973111) "Copula Additive Distributional Regression Using R", for illustrating the fitting of various joint (and univariate) regression models, with several types of covariate effects, in the presence of equations' errors association.
Depends: R (≥ 3.6.0)
Suggests: GJRM
License: GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]
NeedsCompilation: no
Packaged: 2025-06-25 16:25:43 UTC; Giampiero
Repository: CRAN
Date/Publication: 2025-06-29 10:30:02 UTC

AREDS: Age-related Eye Disease Study

Description

Real dataset of bivariate interval and right censored data with 628 subjects and three covariates. The dataset is a reshaped version of the AREDS data from the CopulaCenR package. The dataset was selected from the Age-related Eye Disease Study (AREDS Group, 1999). The two events are the progression times (in years) to late-AMD in the left and right eyes.

Usage

data(areds)

Format

war is a 628 row data frame with the following columns:

t11, t12

left and right bounds of the intervals for the left eye. If t12 = NA then the observation is right-censored.

t21, t22

left and right bounds of the intervals for the right eye. If t22 = NA then the observation is right-censored.

SevScore1, SevScore2

baseline AMD severity scores for left and right eyes, respectively. Possible values are: 4, 5, 6, 7, 8.

age

age at baseline.

rs2284665

a genetic variant covariate highly associated with late-AMD progression. Possible values are: 0, 1, 2.

cens1, cens2

type of censoring for left and right eyes.

cens

joint censoring indicator for left and right eyes.

Source

Data are from:

AREDS Group (1999), The Age-Related Eye Disease Study (AREDS): design implications. AREDS report no. 1. Control Clinical Trials, 20, 573-600.


Blood pressure data in children

Description

Blood pressure data in 11 year old children. The dataset is a subsample from Solomon-Moore et al. (2020).

Usage

data(bpc)

Format

bpc is a 1052 row data frame with the following columns:

sbp

Systolic Blood Pressure (mmHg).

dbp

Diastolic Blood Pressure (mmHg).

gender

1 = Male, 2 = Female.

bmi

Body Mass Index.

mvpa

Average minutes of moderate to vigorous physical activity per day.

sed

Average sedentary minutes per day.

Source

Data are from Solomon-Moore E, Salway R, Emm-Collison L, Thompson JL, Sebire SJ, Lawlor DA, Jago R (PI), 2020.


ACDIS data

Description

Fictitious data designed to closely replicate the characteristics and patterns observed in the Africa Centre Demographic Information System (ACDIS).

Usage

data(cd4)

Format

cd4 is a 2645 row data frame with the following columns:

cd4.count

CD4 count measurements.

hiv

Binary variable indicating whether an individual is HIV positive (hiv = 1) or not (hiv = 0).

age

Age in years.

location

Three levels: PER, RUR, URB.

marital

Six levels: Married, Polygamous, Divorced/Separated/Widowed, Engaged, Never Married, Under Legal Age.

water

If present or not.

education

Four levels: None, Primary, Junior Secondary, Upper Secondary.

distance1

Km to nearest primary school.

distance2

Km to nearest secondary school.

Source

The data have been produced as described in:

Tanser F. at al., (2007), Cohort Profile: Africa Centre Demographic Information System (ACDIS) and population-based HIV survey. International Journal of Epidemiology, 37(5), 956-962.


Simulated data with two endogenous variables

Description

Simulated data with two endogenous variables and binary outcome.

Usage

data(dataDE)

Format

dataDE is a 2000 row data frame with the following columns:

y1

First endogenous variable.

y2

Second endogenous variable.

y3

Binary outcome.

x1, x2

Covariates.

x3

Covariate influencing only y1.

x4

Covariate influencing only y2.

Examples


# Data have been simulated as shown below

n  <- 2000
x1 <- round(runif(n))
x2 <- runif(n)
x3 <- runif(n)
x4 <- rnorm(n)
u  <- rnorm(n)

y1 <- ifelse(-1.55 +               x1 - x2 + x3 + u + rnorm(n) > 0, 1, 0)
y2 <- ifelse(-0.25 -           0.5*x1 + x2 + x4 + u + rnorm(n) > 0, 1, 0)
y3 <- ifelse(-0.75 + 0.5*y1 - y2 + x1 + x2      + u + rnorm(n) > 0, 1, 0)

dataDE <- data.frame(y1, y2, y3, x1, x2, x3, x4)


Simulated data with double sample selection

Description

Simulated data with double sample selection and binary outcome.

Usage

data(dataDSS)

Format

dataDSS is a 10000 row data frame with the following columns:

y1

First selection.

y2

Second selection.

y3

Binary outcome.

x1, x2

Covariates.

x3

Covariate influencing only y1.

x4

Covariate influencing only y2.

y3.o

Original outcome, without missingness.

Examples


# Data have been simulated as shown below

n  <- 10000
x1 <- round(runif(n))
x2 <- runif(n)
x3 <- runif(n)
x4 <- rnorm(n)
u  <- rnorm(n)

y1 <- ifelse(-1.55 +     x1 - x2 + x3 + u + rnorm(n) > 0, 1, 0)
y2 <- ifelse(-0.25 - 0.5*x1 + x2 + x4 + u + rnorm(n) > 0, 1, 0)
y3 <- y3.o <- ifelse( -0.75 + x1 + x2 + u + rnorm(n) > 0, 1, 0)

y2 <- y2*y1
y3 <- y3*y2
y3 <- ifelse(y2 == 0, NA, y3)

dataDSS <- data.frame(y1, y2, y3, x1, x2, x3, x4, y3.o)


World Happiness Report Data

Description

Data from the 2019 World Happiness Report, an annual publication of the United Nations Sustainable Development Solutions Network.

Usage

data(happy)

Format

happy is a 155 row data frame with the following columns:

country

Country.

gdp

Gross domestic product per capita.

support

Indicator of social support (or having someone to count on in times of trouble) calculated at national level.

hle

Indicator of healthy life expectancies at birth.

freedom

Freedom to make life choices is the national average of responses to the question: Are you satisfied or dissatisfied with your freedom to choose what you do with your life?

generosity

Generosity is the residual of regressing national average of response to the question: Have you donated money to a charity in the past month? on GDP per capita.

corruption

Corruption Perception: The measure is the national average of the survey responses to two questions in the: Is corruption widespread throughout the government or not? and Is corruption widespread within businesses or not? The overall perception is just the average of the two 0-or-1 responses.

score

Subjective well-being. 1 low, 2 medium low, 3 medium, 4 high.


Hiring Incentive Experiment - HIE

Description

Full description available at the web link below.

Usage

data(hie)

Format

hie is a 7734 row data frame with the following columns:

agree

Equal to 1 if the individual is in the HIE group and agreed to participate, and 0 if the individual is assigned to the control group or refuses to participate.

bonus

Random allocation variable equal to 1 if the individual/employer was assigned to the hiring incentive experiment group and 0 to the control group. This is the IV.

benefit

Weekly benefit amount + dependents' allowance.

unemp.dur

Weeks of benefits.

status

Equal to 1 if unemp.dur < 26 and 0 otherwise.

age

Age of claimant.

gender

1 = male and 0 = female.

ethnicity

1 = black and 0 otherwise.

prearn

Claimant's pre-claim earnings.

Source

https://www.upjohn.org/data-tools/employment-research-data-center/illinois-unemployment-incentive-experiments


HIV Zambian data

Description

HIV Zambian data by region, together with polygons describing the regions' shapes.

Usage

data(hiv)
data(hiv.polys)

Format

hiv is a 6416 row data frame with the following columns:

consent

binary variable indicating consent to test for HIV.

status

binary variable indicating whether an individual is HIV positive (status = 1) or not (status = 0).

age

age in years.

education

years of education.

wealth

wealth index.

region

code identifying region, and matching names(hiv.polys). It can take nine possible values: 1 central, 2 copperbelt, 3 eastern, 4 luapula, 5 lusaka, 6 northwestern, 7 northern, 8 southern, 9 western.

marital

never married, currently married, formerly married.

std

had a sexually transmitted disease.

highhiv

had high risk sex.

partner

number of partners.

condom

used condom during last intercourse.

aidscare

equal to 1 if would care for an HIV-infected relative.

knowsdiedofaids

equal to 1 if know someone who died of HIV.

evertestedHIV

equal to 1 if previously tested for HIV.

smoke

smoker or not.

ethnicity

bemba, lunda (luapula), lala, ushi, lamba, tonga, luvale, lunda (northwestern), mbunda, kaonde, lozi, chewa, nsenga, ngoni, mambwe, namwanga, tumbuka, other.

language

English, Bemba, Lozi, Nyanja, Tonga, other.

interviewerID

interviewer identifier.

agehadsex

age the individual had sex.

religion

four categories.

sw

survey weights.

hiv.polys contains the polygons defining the areas in the format described below.

Details

The data frame hiv relates to the regions whose boundaries are coded in hiv.polys. hiv.polys[[i]] is a 2 column matrix, containing the vertices of the polygons defining the boundary of the ith region. names(hiv.polys) matches hiv$region (order unimportant).

Source

The data have been produced as described in:

McGovern M.E., Barnighausen T., Marra G. and Radice R. (2015), On the Assumption of Joint Normality in Selection Models: A Copula Approach Applied to Estimating HIV Prevalence. Epidemiology, 26(2), 229-237.

References

Marra G., Radice R., Barnighausen T., Wood S.N. and McGovern M.E. (2017), A Simultaneous Equation Approach to Estimating HIV Prevalence with Non-Ignorable Missing Responses. Journal of the American Statistical Association, 112(518), 484-496.


U.S. hospital data from the state of Virginia

Description

Data on 978 randomly selected patients admitted between January and September 2014 to an over-500-bed medical center (Lewis Gale Medical Center) in the state of Virginia.

Usage

data(hospital)

Format

hospital is a 978 row data frame with the following columns:

los

Patient length of hospital stay (in days).

died

In-hospital mortality. 1 dead, 0 alive.

age

Age of the patient.

gender

Either male or female

bmi

Body mass index.

severity

Subjective assessment of severity level of patient. Value between 1 and 4, with 1 representing the lowest severity level.

risk

Subjective assessment of risk of dying. Value between 1 and 4, with 1 representing the lowest level.

sp02

Oxygen saturation level.

sbp

Systolic blood pressure.

dbp

Diastolic blood pressure.

pulse

Pulse rate.

respiratory

Respiratory rate.

avpu

AVPU score (A: alert, V: responding to voice, P: responding to painful stimuli, U: unresponsive).

temp

Temperature.

Source

Azadeh-Fard N, Ghaffarzadegan N, Camelio JA (2016), Can a Patient's In-Hospital Length of Stay and Mortality Be Explained by Early-Risk Assessments?, PLoS ONE 11(9): e0162976.


Infant statistic data from North Carolina

Description

Individual-level infant mortality data on 20000 randomly selected births of female babies in the U.S. state of North Carolina, in 2008, together with polygons describing the county shapes.

Usage

data(infants)
data(NC.polys)

Format

infants is a 20000 row data frame with the following columns:

county

Number code identifying North Carolina county in which birth occurred, and matching names(NC.polys). It can take 100 possible values.

age

Age of mother.

wksgest

Completed weeks of gestation.

marital

Equal to 1 if married, and 0 otherwise.

grams

Infant's birth weight.

lbw

Equal to 1 if infant's birth weight < 2500 grams, and 0 otherwise.

ethnicity

Four categories of ethnicity: White, Hispanic, Black, Other.

educ

Education of mother: Primary, Secondary, Tertiary.

smoke

Equal to 1 if smoker, and 0 otherwise.

firstbirth

Equal to 1 if it was the mother's first birth, and 0 otherwise.

ptb

Equal to 1 if completed weeks of gestation < 37.

NC.polys contains the polygons defining the areas in the format described below.

Details

The data frame infants relates to the counties whose boundaries are coded in NC.polys. NC.polys[[i]] is a 2 column matrix, containing the vertices of the polygons defining the boundary of the ith county. names(NC.polys) matches infants$county (order unimportant).

Source

The data were compiled by the North Carolina State Center for Health Statistics (https://schs.dph.ncdhhs.gov/).


MEPS: Medical Expenditure Panel Survey (year 2012)

Description

Subsample of the 2012 MEPS data, collected and published by the U.S. Agency for Healthcare Research and Quality.

Usage

data(meps)

Format

meps is a 10638 row data frame with the following columns:

general

General health: 1 excellent, 2 very good, 3 good, 4 fair, 5 poor.

mental

Mental health (as above).

bmi

Body mass index.

income

Income.

age

Age.

gender

Male 1, Female 0.

ethnicity

1 white, 2 black, 3 native american, 4 others.

education

Education in years.

region

1 Northeast, 2 Midwest, 3 South, 4 West.

hypertension

Equal to 1 if hypertension present and 0 otherwise.

hyperlipidemia

Equal to 1 if hyperlipidemia present and 0 otherwise.

dvisit

Number of doctor (physicians) visits.

ndvisit

Number of non doctor visits (non-physician providers).

dvexpend

Expenditure on doctor visits.

ndvexpend

Expenditure on non doctor visits.

Source

https://meps.ahrq.gov


Civil war data

Description

Civil war data from Fearon and Laitin (2003).

Usage

data(war)

Format

war is a 6326 row data frame with the following columns:

onset

equal to 1 for all country-years in which a civil war started.

instab

equal to 1 if unstable government.

oil

equal to 1 for oil exporter country.

cwar

equal to 1 if the country had a distinct civil war ongoing in the previous year.

gdp

GDP per capita (measured as thousands of 1985 U.S. dollars) lagged one year.

ncontig

equal to 1 for non-contiguous state.

nwstate

equal to 1 for new state.

lpop

log(population size).

lmnt

log(mountainous).

ethfrac

measure of ethnic fractionalization (calculated as the probability that two randomly drawn individuals from a country are not from the same ethnicity).

relfrac

measure of religious fractionalisation.

poldem

measure of political democracy (ranges from -10 to 10) lagged one year.

Source

Data are from:

Fearon J.D., Laitin D.D. (2003), Ethnicity, Insurgency, and Civil War. The American Political Science Review, 97, 75-90.