AI Scoring on Chest Radiographs to Guide Biopsy Decisions in Suspected Lung Cancer: Evidence from a Biopsy-Proven Cohort
Author Block: E. G. KAHRAMAN, Y. VAROL, A. Selver, O. Ozdemir, E. Hasbay, Y. EROL; Izmir/TR
Purpose: To develop and evaluate a novel biopsy-indication scoring system based on TorchXRayVision (TxRV), an open-source deep learning model trained on large chest radiograph datasets, aiming to support clinical decision-making in differentiating malignant from benign lung lesions.
Methods or Background: Chest radiographs of 300 patients were screened; 285 (206 malignant, 79 benign) were eligible after excluding anterior–posterior views and indeterminate pathology. TxRV outputs for 18 radiological findings were extracted. For biopsy indication, six core features (effusion, pneumonia, nodule, mass, lung lesion, opacity) were used to generate three handcrafted scores (simple sum, weighted, maximum). For malignancy prediction, extended 18-feature models were tested: simple sum, weighted sum, logistic regression, and random forest. Statistical analyses included Mann–Whitney U tests, ROC/AUC, confusion matrices, and feature importance mapping.
Results or Findings: Nodule, mass, and lung lesion scores were significantly higher in malignant cases (p < 0.01), while opacity showed borderline association (p = 0.06). Weighted biopsy scoring yielded the highest discriminatory capacity, with a malignancy prevalence of ~80% above the 75th percentile cut-off. Logistic regression improved interpretability, achieving AUC 0.71 with balanced sensitivity/specificity. Random forest demonstrated superior performance (AUC 0.94, accuracy 90%), but feature importance confirmed that classic oncologic signs (mass, nodule, lesion) remained the strongest predictors. Precision–recall analysis supported these findings, with Random Forest showing F1 = 0.92, PPV 0.94, and NPV 0.85, underscoring robust diagnostic value.
Conclusion: TxRV-derived scoring provides an interpretable, reproducible framework to guide biopsy indication and malignancy risk stratification. Weighted scoring improved diagnostic balance, and logistic regression offered stable performance suitable for clinical translation. Such modeling may help reduce unnecessary lung biopsies.
Limitations: This was a single-center retrospective study with modest sample size. Random forest results suggest possible overfitting, highlighting the need for external validation in larger, multi-center cohorts.
Funding for this study: None
Has your study been approved by an ethics committee?: Yes
Ethics committee - additional information: This study was approved by the Ethics Committee of Izmir City Hospital (Approval Number: 2025/364 )
All procedures were conducted in accordance with the Declaration of Helsinki.