% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/qlm_validate.R
\name{qlm_validate}
\alias{qlm_validate}
\title{Validate coded results against a gold standard}
\usage{
qlm_validate(
  ...,
  gold,
  by,
  level = NULL,
  average = c("macro", "micro", "weighted", "none"),
  ci = c("none", "analytic", "bootstrap"),
  bootstrap_n = 1000
)
}
\arguments{
\item{...}{One or more data frames, \code{qlm_coded}, or \code{as_qlm_coded} objects
containing predictions to validate. Must include a \code{.id} column and the
variable(s) specified in \code{by}. Plain data frames are automatically converted
to \code{as_qlm_coded} objects. Multiple objects will be validated separately
against the same gold standard, and results combined with a \code{rater} column
to distinguish them.}

\item{gold}{A data frame, \code{qlm_coded}, or object created with \code{\link[=as_qlm_coded]{as_qlm_coded()}}
containing gold standard annotations. Must include a \code{.id} column for joining
with objects in \code{...} and the variable(s) specified in \code{by}. Plain data frames
are automatically converted. \strong{Optional} when using objects marked with
\code{as_qlm_coded(data, is_gold = TRUE)} - these are auto-detected.}

\item{by}{Optional. Name of the variable(s) to validate (supports both quoted
and unquoted). If \code{NULL} (default), all coded variables are validated. Can
be a single variable (\code{by = sentiment}), a character vector
(\code{by = c("sentiment", "rating")}), or NULL to process all variables.}

\item{level}{Optional. Measurement level(s) for the variable(s). Can be:
\itemize{
\item \code{NULL} (default): Auto-detect from codebook
\item Character scalar: Use same level for all variables
\item Named list: Specify level for each variable
}
Valid levels are \code{"nominal"}, \code{"ordinal"}, or \code{"interval"}.}

\item{average}{Character scalar. Averaging method for multiclass metrics
(nominal level only):
\describe{
\item{\code{"macro"}}{Unweighted mean across classes (default)}
\item{\code{"micro"}}{Aggregate contributions globally (sum TP, FP, FN)}
\item{\code{"weighted"}}{Weighted mean by class prevalence}
\item{\code{"none"}}{Return per-class metrics in addition to global metrics}
}}

\item{ci}{Confidence interval method:
\describe{
\item{\code{"none"}}{No confidence intervals (default)}
\item{\code{"analytic"}}{Analytic CIs where available (ICC, Pearson's r)}
\item{\code{"bootstrap"}}{Bootstrap CIs for all metrics via resampling}
}}

\item{bootstrap_n}{Number of bootstrap resamples when \code{ci = "bootstrap"}.
Default is 1000. Ignored when \code{ci} is \code{"none"} or \code{"analytic"}.}
}
\value{
A \code{qlm_validation} object (a tibble/data frame) with the following columns:
\describe{
\item{\code{variable}}{Name of the validated variable}
\item{\code{level}}{Measurement level used}
\item{\code{measure}}{Name of the validation metric}
\item{\code{value}}{Computed value of the metric}
\item{\code{class}}{For nominal data: averaging method used (e.g., "macro", "micro",
"weighted") or class label (when \code{average = "none"}). For ordinal/interval
data: NA (averaging not applicable).}
\item{\code{rater}}{Name of the object being validated (from input names)}
\item{\code{ci_lower}}{Lower bound of confidence interval (only if \code{ci != "none"})}
\item{\code{ci_upper}}{Upper bound of confidence interval (only if \code{ci != "none"})}
}
The object has class \code{c("qlm_validation", "tbl_df", "tbl", "data.frame")} and
attributes containing metadata (\code{n}, \code{call}).

\strong{Metrics computed by measurement level:}
\itemize{
\item \strong{Nominal:} accuracy, precision, recall, f1, kappa
\item \strong{Ordinal:} rho (Spearman's), tau (Kendall's), mae
\item \strong{Interval:} icc, r (Pearson's), mae, rmse
}

\strong{Confidence intervals:}
\itemize{
\item \code{ci = "analytic"}: Provides analytic CIs for ICC and Pearson's r only
\item \code{ci = "bootstrap"}: Provides bootstrap CIs for all metrics via resampling
}
}
\description{
Validates LLM-coded results from one or more \code{qlm_coded} objects against a
gold standard (typically human annotations) using appropriate metrics based
on measurement level. For nominal data, computes accuracy, precision, recall,
F1-score, and Cohen's kappa. For ordinal data, computes accuracy and weighted
kappa (linear weighting), which accounts for the ordering and distance between
categories.
}
\details{
The function performs an inner join between \code{x} and \code{gold} using the \code{.id}
column, so only units present in both datasets are included in validation.
Missing values (NA) in either predictions or gold standard are excluded with
a warning.

\strong{Measurement levels:}
\itemize{
\item \strong{Nominal}: Categories with no inherent ordering (e.g., topics, sentiment
polarity). Metrics: accuracy, precision, recall, F1-score, Cohen's kappa
(unweighted).
\item \strong{Ordinal}: Categories with meaningful ordering but unequal intervals
(e.g., ratings 1-5, Likert scales). Metrics: Spearman's rho (\code{rho}, rank
correlation), Kendall's tau (\code{tau}, rank correlation), and MAE (\code{mae}, mean
absolute error). These measures account for the ordering of categories
without assuming equal intervals.
\item \strong{Interval/Ratio}: Numeric data with equal intervals (e.g., counts,
continuous measurements). Metrics: ICC (intraclass correlation), Pearson's r
(linear correlation), MAE (mean absolute error), and RMSE (root mean squared
error).
}

For multiclass problems with nominal data, the \code{average} parameter controls
how per-class metrics are aggregated:
\itemize{
\item \strong{Macro averaging} computes metrics for each class independently and takes
the unweighted mean. This treats all classes equally regardless of size.
\item \strong{Micro averaging} aggregates all true positives, false positives, and
false negatives globally before computing metrics. This weights classes by
their prevalence.
\item \strong{Weighted averaging} computes metrics for each class and takes the mean
weighted by class size.
\item \strong{No averaging} (\code{average = "none"}) returns global macro-averaged metrics
plus per-class breakdown.
}

Note: The \code{average} parameter only affects precision, recall, and F1 for
nominal data. For ordinal data, these metrics are not computed.
}
\examples{
# Load example coded objects
examples <- readRDS(system.file("extdata", "example_objects.rds", package = "quallmer"))

# Validate against gold standard (auto-detected)
validation <- qlm_validate(
  examples$example_coded_mini,
  examples$example_gold_standard,
  by = "sentiment",
  level = "nominal"
)
print(validation)

# Explicit gold parameter (backward compatible)
validation2 <- qlm_validate(
  examples$example_coded_mini,
  gold = examples$example_gold_standard,
  by = "sentiment",
  level = "nominal"
)
print(validation2)

}
\seealso{
\code{\link[=qlm_compare]{qlm_compare()}} for inter-rater reliability between coded objects,
\code{\link[=qlm_code]{qlm_code()}} for LLM coding, \code{\link[=as_qlm_coded]{as_qlm_coded()}} for converting human-coded data,
\code{\link[yardstick:accuracy]{yardstick::accuracy()}}, \code{\link[yardstick:precision]{yardstick::precision()}}, \code{\link[yardstick:recall]{yardstick::recall()}},
\code{\link[yardstick:f_meas]{yardstick::f_meas()}}, \code{\link[yardstick:kap]{yardstick::kap()}}, \code{\link[yardstick:conf_mat]{yardstick::conf_mat()}}
}
