Notes: Bootstrapping & OOL (C/D) are applied only to Type-1 text co-word inputs (rows=articles).
For numeric Type-2/3/4 matrices, C/D is skipped even if enabled.
A — Inter/ Intra-rater (core-theme sample)
Randomly sample a fraction of articles from the major themes (core themes).
Compare TAAA vs Reviewer 1, TAAA vs Reviewer 2, and Reviewer 1 vs Reviewer 2.
Outputs: confusion matrices + Cohen’s κ (kappa).
B — External generalization (optional)
If your data contains TRUE_LABEL (external ground truth), TAAA themes are compared to it.
Outputs: confusion matrix + Cohen’s κ.
C — Bootstrapping stability (optional)
Resample the sampled articles with replacement for bootstrap_reps times.
Recompute κ to quantify stability; provide mean and percentile CI (2.5/97.5).
D — OOL (holdout) robustness (optional)
Repeatedly draw random holdout subsets (size = holdout_frac) for ool_reps times.
Compute κ on the holdout subset to quantify robustness against dataset shift.
E — Sankey (Theme → Term flow)
Create SankeyMATIC-compatible text describing flows from theme leaders to terms (top nodes only).
Outputs: terms_sankeymatic.txt and a ready-to-open SankeyMATIC URL in the report.
How it works (2 stages)
Stage 1 — Auto-classify input data (4 types)
Type 1: text in (almost) all cells (keyword/phrase occurrences per row)
Type 2: numeric square matrix (k×k) with colnames == rownames
Type 3: numeric square matrix (k×k) with colnames != rownames
Type 4: numeric rectangular matrix (m×n) with colnames != rownames
Stage 2 — Convert to edges + nodes for FLCA
Type 1 → build co-occurrence edges: (term1, term2, WCD)
Type 2 → edges from upper triangular: WCD = matrix[i,j]
Type 3 & 4 → edges from all numeric cells: term1=rowname, term2=colname, WCD=value