“In oncology, every data point matters in clinical research—and so does every step to make it usable.”
As oncology moves deeper into the era of precision medicine and data-driven care, one thing becomes clear: the value of meaningful insights depends on the quality and structure of the underlying data. Unfortunately, in cancer care, data doesn’t come pre-packaged. Oncology data is often disparate, siloed, messy, fragmented, and incomplete. Accurately capturing and mapping these data assets as they are generated and used is essential for transforming real-world data into real-world evidence. Real-world evidence (RWE) is playing an increasingly important role in regulatory, drug development, and healthcare decisions. As cancer care grows more complex, leveraging data from many patients treated outside clinical trials helps bridge the gap between randomized trial results and everyday clinical needs.[1]
That’s where oncology data mapping comes in—and why it’s one of the most underestimated challenges in cancer analytics. Mapping oncology data is a foundational step in making disparate clinical, molecular, and administrative data into usable real-world data—whether for research, clinical decision support, or operational reporting. Yet oncology brings a unique set of challenges that make data mapping particularly daunting in generating real-world evidence. Scientific and clinical literature has documented how technological advances and policy changes have fostered the use of real-world data (RWD) to improve clinical evidence generation. A definition centered on data collected at the point of care—distinct from conventional clinical trial data—emphasizes how RWD, even when generated through experimental designs, offers advantages in study efficiency and in balancing internal and external validity [2, 3]. However, persistent challenges in community oncology continue to hinder the effective mapping of real-world data and the generation of real-world evidence. Below are a few of the challenges contributing to the RWD collection dilemma.
The Fragmented Landscape of Oncology Data in the Community Setting
Unlike other specialties, oncology data spans many disparate and at times, disconnected sources. Sources of data originate from:
- Electronic Health Records (EHRs)
- Pathology and radiology systems
- Genomic and biomarker labs
- Pharmacy and infusion centers
- Payer claims and billing platforms
Each of these systems has its own data structures, standards (or lack thereof), and formats. And much of the clinically relevant content—like progression status, staging details, or histology subtypes—lives in unstructured notes or scanned documents, not neatly organized in structured fields.
The Staging and Diagnosis Dilemma
One of the cornerstones of oncology is understanding cancer stage (TNM classification), yet this is notoriously difficult to extract consistently. Some practices use structured TNM fields, others rely on free text. Diagnosis codes (e.g., ICD-10) don’t always specify histological subtypes, and patients with multiple primary tumors further complicate mapping logic. A recent study found that clinical stage for any given patient was missing from the structured fields 40.6% of the time while always documented in the oncologist’s clinical note [4].
For example, identifying a patient with triple-negative breast cancer or an EGFR+ non-small cell lung cancer may require piecing together data from pathology, molecular reports, and clinical notes—to derive a clinical stage for a patient.
Line of Therapy
Mapping treatment lines of therapy in oncology (e.g., first-line, second line) sounds straightforward but presents with a few exceptions:
- Sequencing drugs and regimens over time
- Distinguishing maintenance from curative therapy and progression
- Accounting for dose changes, interruptions, and intent of treatment
- Understanding where and why dose adjustment is important to determine is a new line of therapy and if switching from IV to oral happens in the treatment cycle
Even defining what constitutes a new line can vary by tumor type, payer requirements, and institutional protocols. Drug regimens are documented inconsistently, and drug names also vary between brand, generic, and regimen-based references, which further complicates normalization. Without clear definitions and time-aligned treatment history, analyses of treatment pathways or outcomes lose precision.
Genomics: The Wild West of Oncology Data and Clinical Evidence Generation
Genomic and biomarker testing is central to modern cancer care—but it creates chaos for data mapping:
- Test results are often returned in non-standardized formats (PDF, XML, text).
- Labs use different variant nomenclature (e.g., EGFR exon 19 vs. delE746-A750).
- Clinical actionability isn’t always annotated.
- Mapping a genomic alteration to a specific therapy (e.g., BRAF V600E → vemurafenib) requires advanced tools and curation. Although there has been progress no plug-and-play standard has been developed to date.
The Problem of Time: Temporal and Longitudinal Data Gaps
- Oncology is deeply longitudinal, but data often isn’t:
- Diagnosis, progression, and survival dates are often missing or estimated.
- Patients may move between care settings or drop out of the system.
- Claims data lags behind real-time care by months.
- This makes it difficult to build real-world time-to-event models like progression-free survival (PFS) or overall survival (OS) without careful mapping and imputation.
Quality, Governance & Regulatory Challenges contribute to efficient use of technology to map oncology data such as:
- Inconsistent documentation practices across providers and EMRs
- High missingness in key fields (e.g., Stage, ECOG status, performance scores)
- Delayed data availability in claims or registries
- Privacy requirements when combining structured and genomic data (HIPAA, GDPR)
- “As datasets become richer, so do the risks of re-identification—especially with genomic data involved.”
- Robust data governance, expert de-identification, and clear patient consent are not optional—they are essential.
Unlocking the Power of Oncology Data
Oncology patients generate large volumes of clinical, genomic, and treatment data annually, yet community oncology practices often struggle with fragmented, unstructured information that hinders data-driven care, clinical research, and real-world evidence generation. Addressing this challenge requires more than traditional ETL—it demands a strategic competency in data mapping to curate disparate data into a clinically coherent view. Our solution meets this need by integrating clinical, molecular, and administrative data from point-of-care sources, enabling accurate, actionable real-world data for research, decision-making, and operational reporting in even the most resource-constrained settings.
References
- Miksad RA, Abernethy AP. “Harnessing the power of real-world evidence (RWE): A checklist to ensure regulatory-grade data quality.” Clinical Pharmacology & Therapeutics, 2018; 103(2): 202–205.
- Khozin S, Blumenthal GM, Pazdur R.”Real-world data for clinical evidence generation in oncology.”Nature Reviews Clinical Oncology, 2017; 14(6): 365–376.
- Makady A, de Boer A, Hillege H, Klungel O, Goettsch W.
“What is real-world data? A review of definitions based on literature and stakeholder interviews.”Value in Health, 2017; 20(7): 858–865.
- Rocha, C. T., Hankala, I., Mekuria, L., McEvoy, O., Walker, J., Erickson, R. & Goede, P., (2024) Evaluation and Use of Natural Language Processing (NLP) Reasoning and Classification Models to Support Clinical Trial Patient Identification and Enrollment in the Community Oncology Setting, Journal of the Society for Clinical Data Management 1. doi: https://doi.org/10.47912/jscdm.363