Automated real-world data integration improves cancer outcome prediction


The Use of Machine Learning for Analyzing Real-World Data in Disease Prediction and Management: Systematic Review

Review

1College of Medicine, Qassim University, Buraidah, Saudi Arabia

2Applied Biotechnology, Faculty of Chemistry, Warsaw University of Technology, Warsaw, Poland

3Malaysian Health Technology Assessment Section, Medical Development Division, Ministry of Health Malaysia, Wilayah Persekutuan Putrajaya, Malaysia

4Health Economics and Health Technology Assessment, School of Health and Wellbeing, University of Glasgow, Glasgow, United Kingdom

5Health Sciences Research Center, Imam Mohammad ibn Saud Islamic University, Riyadh, Saudi Arabia

*these authors contributed equally

Corresponding Author:

Nasser Alotaiq, PhD

Health Sciences Research Center

Imam Mohammad ibn Saud Islamic University

Othman Bin Affan Rd. Al-Nada 13317

Riyadh

Saudi Arabia

Phone: 966 50 411 9153

Email: naalotaiq@imamu.edu.sa


Abstract

Background: Machine learning (ML) and big data analytics are rapidly transforming health care, particularly disease prediction, management, and personalized care. With the increasing availability of real-world data (RWD) from diverse sour

2025 Agenda

Skip to main content

June 25, 2025

Biodata Stage

  • We'll discuss the importance of making multi-modal, real world cancer patient data available and interpretable to researchers and physician scientists and the requisite computational tools that are required to analyse the data effectively.
  • Discuss methods of automated real world data integration.
  • Demonstrate how state of the art platforms are allowing the assimilation, storage and access of huge clinical data sets. See how this data is being manipulated to discover new clinically actionable cancer drivers and identify new opportunities for precision medicine.
  • The importance of having structured clinical data, the tools being used to structure data and the role of AI/ML (e.g. LLMs).

Automated real-world data integration improves cancer outcome prediction

  • Justin Jee

    (Memorial Sloan Kettering Cancer Center)

  • Christopher Fong

    (Memorial Sloan Kettering Cancer Center)

  • Karl Pichotta

    (Memorial Sloan Kettering Cancer Center)

  • Thinh Ngoc Tran

    (Memorial Sloan Kettering Cancer Center)

  • Anisha Luthra

    (Memorial Sloan Kettering Cancer Center)

  • Michele Waters

    (Memorial Sloan Kettering Cancer Center)

  • Chenlian Fu

    (Memorial Sloan Kettering Cancer Center)

  • Mirella Altoe

    (Memorial Sloan Kettering Cancer Center)

  • Si-Yang Liu

    (Memorial Sloan Kettering Cancer Center)

  • Steven B. Maron

    (Memorial Sloan Kettering Cancer Center
    Dana Farber Cancer Institute)

  • Mehnaj Ahmed

    (Memorial Sloan Kettering Cancer Center)

  • Susie Kim

    (Memorial Sloan Kettering Cancer Center)

  • Mono Pirun

    (Memorial Sloan Kettering Cancer Center)

  • Walid K. Chatila

    (Memorial Sloan Kettering Cancer Center)

  • Ino Bruijn

    (Memorial Sloan Kettering Cancer Center)

  • Arfath Pasha

    (Memorial Sloan Kettering Cancer Center)

  • Ritika Kundra

    (Memorial Sloan Kettering Cancer Center)

  • Benjamin Gross

    (Memorial Sloan Kettering Cancer Center)

  • Brooke Mastrogiacomo

    (Memorial Sloan Kettering Cancer C

    A research team from Memorial Sloan Kettering Cancer Center (MSK) is demonstrating that cancer outcome predictions can be improved by breaking down hospitals' traditional data silos and analyzing the information—including physicians' clinical notes—with the help of artificial intelligence (AI).

    A new study describes a real-time, automated approach developed at MSK that brings together doctors' free-text notes, clinical treatment and outcomes data, patient demographic data, and tumor genomic data from the MSK-IMPACT platform to identify biomarkers that can predict outcomes and likely responses to therapy. Dubbed MSK-CHORD (for Clinicogenomic Harmonized Oncologic Real-World Dataset), the effort is the largest of its kind, combing data from nearly 25,000 patients with non-small cell lung, breast, colorectal, prostate, and pancreatic cancers.

    The study was led by co-first authors Justin Jee, MD, Ph.D., Christopher Fong, Ph.D., Karl Pichotta, Ph.D., Thinh Ngoc Tran, Ph.D., and Anisha Luthra, and overseen by senior author Nikolaus Schultz, Ph.D., Director of MSK's Cancer Data Science Initiative. It is published in the journal Nature.

    The team found that cancer outcome predict

    automated real-world data integration improves cancer outcome prediction

    Structured, Multimodal Real-World Data Can Improve Cancer Outcome Prediction

    Machine learning models are revolutionizing cancer outcome predictions by analyzing vast amounts of patient data to identify patterns and insights that traditional methods often miss. However, efforts to build these models are limited by manual extraction of key data elements from unstructured data such as clinical notes and pathology reports. This process is time-consuming, error-prone, and limits scalability.

    However, a recent study published in Nature has demonstrated a promising pathway toward overcoming these obstacles to build robust, high-performing machine learning models from clinicogenomic data by leveraging AI to automatically annotate free-text clinician notes and reports.

    Using Multimodal, Structured Data to Improve Model Performance and Identify Biomarkers

    The study, conducted by a team of researchers at Memorial Sloan Kettering Cancer Center (MSK), introduces the MSK-CHORD dataset, a compilation of real-world clinical, radiographic, histopathologic, laboratory, and genomic sequencing data from 24,950 patients. The researchers achieved this by combining automatically-generated natural