Article | June 2025

image.png

Radiomics helps radiologists to capture information that is often imperceptible to the human eye. These high-dimensional representations of image characteristics have shown increasing promise in predicting clinical outcomes.

Despite this potential, radiomics-based models are often constrained by limited access to annotated imaging data - particularly in subgroup analyses where sample sizes are further reduced. One emerging solution is the use of synthetic medical images, but their value depends on how well they replicate the radiomic properties of real data.

<aside> đź’ˇ

What is radiomics? Radiomics is defined as the process of extracting a large number of quantitative features from medical images (e.g., CT, MRI, PET) that are typically not discernible by the naked eye. Features include shape, texture, intensity, wavelet-based attributes and more.

</aside>

Does Synthetic Data follow a similar Radiomic Feature Distribution as Real Data?

To address this, we conducted a study to evaluate how closely synthetic CT images reflect real ones in terms of radiomics features. Specifically, we analyzed overall distribution similarities as well as fidelity within clinically relevant subgroups, aiming to assess whether synthetic data can help overcome radiomics’ data scarcity problem.

We used the NSCLC Radiomics dataset from The Cancer Imaging Archive (TCIA), which includes CT scans and clinical metadata from 416 patients, of which 331 were randomly selected to generate synthetic CT images.

Radiomic features were extracted from both real and synthetic images using standard feature sets. Additionally, we analyzed the clinical data to find important factors that explain differences between patients — the most important being the histological subtype.

Results show: Synthetic data mimics real data well on most of the features

1. Overall Feature Fidelity (No Fine-Tuning)

We first analyzed synthetic images generated by our model without any task-specific fine-tuning. The radiomic feature distributions of synthetic images were broadly consistent with those of real images across the full dataset.

Illustration 1: UMAP with top 5 discriminative non-correlated features

Illustration 1: UMAP with top 5 discriminative non-correlated features

2. Subgroup Analysis by Histological Subtype (With Fine-Tuning)

Clustering of the accompanied clinical metadata revealed histological subtype a major factor that explains the differences. We fine-tuned our generative model using subtype labels for different histological tumor types (adenocarcinoma, squamous-cell carcinoma and large-cell carcinoma)

This fine-tuning improved alignment between real and synthetic data within each subgroup:

Illustriation 2: UMAP of all 110 features for three specific histologies

Illustriation 2: UMAP of all 110 features for three specific histologies

Synthetic Data holds potential to solve data bottlenecks - especially in data scarce environments

This study demonstrates that synthetic CT images can closely replicate radiomic feature distributions from real NSCLC scans after fine-tuning. When stratified by clinically relevant variables such as histological subtype, synthetic images improve in fidelity after targeted fine-tuning.

Although minor discrepancies remain at the feature level, the overall agreement is strong. These findings support the use of synthetic imaging to alleviate data scarcity in radiomics-based studies. Future research should align synthetic generation more closely with specific downstream applications such as classification or survival prediction to unlock its full utility.


Jonas_Headshot.png

Contact us!

[email protected] www.ryver.ai

image.png