Is Open-Source AI the Cure for Cancer? Groundbreaking Models Released for Free

Navigating the near future: an insider's perspective.

Dear Hippogram readers,

As I predicted earlier this year, the open-source movement is dramatically "infecting" the AI development landscape in healthcare, particularly in pathology. Several innovative startups and established companies are now releasing powerful open-source AI models, making groundbreaking advancements accessible to all. These models, designed to enhance cancer research, diagnostics, and patient outcomes, are now available for free on platforms like GitHub and Hugging Face. In this issue, we delve into the latest contributions from companies like PAIGE, Bio Optimus, Kaiko AI, Hist AI, and Microsoft Research, exploring how their open foundation models are setting new standards in medical AI.

Open Foundation Models from for-profit organisations

PAIGE (US-based Startup)

Paige is an AI-powered pathology solutions provider that has raised a total funding of $239M over 7 rounds. And now they did what nobody believed was possible. They've just released two open-source AI foundation models for pathology: Virchow, developed in collaboration with Microsoft Research, and PRISM. Both models are available on Hugging Face and aim to support researchers, diagnosticians, and developers in cancer research.

Open Paige Foundation Models

VIRCHOW:

Training Data: 1.5 million WSI (Whole Slide Images), 600 million parameters.
Performance: Published in Nature Medicine, excels in pan-cancer and rare cancer detection.
Capabilities: Tile embeddings and fine-tuning.
License: Apache 2.0 License.
Model Access: Virchow Model

Additional Resources:

Paper: Virchow Performance

PRISM:

Training Data: 587,000 WSI and 195,000 clinical reports.
Performance: Pan-cancer detection, subtyping, IHC biomarkers prediction.
Capabilities: Slide embeddings and fine-tuning.
License: CC-BY-NC-ND 4.0 License.
Model Access: PRISM Model
GitHub: PRISM Code

Additional Resources:

Paper Model Performance: PRISM Performance

BIO OPTIMUS (FRANCE-based Startup)

Bioptimus is a French startup, a spin-off of Owkin focused on building universal AI foundation models in biology. They raised $35M seed round to develop an AI foundational model focused on biology. Their seed round was led by Sofinnova Partners, Bpifrance’s Large Venture fund, Frst, Cathay Innovation, Headline, Hummingbird, NJF Capital, Owkin, Top Harvest Capital and Xavier Niel.

Their flagship project, H-optimus-0, is a large-scale AI model for pathology with 1.1 billion parameters. It's trained on hundreds of millions of images from over 500,000 histopathology slides across 4,000 clinical practices. The model aims to assist in various diagnostic tasks, including cancer

OPTIMUS-0

Key Features:

Training Data: 500,000 pathology slides.
Performance: State-of-the-art in key diagnostic tasks, high performance on both tile-level and slide-level tasks.
Applications: Identifies tissue types, tissue characteristics, biomarkers, and metastasis across various cancers.
Open Source Availability: Promotes collaboration and further advancements in pathology AI.
Modell Access: GitHub repository
License: Apache 2.0

KAIKO AI (NETHERLANDS-based STARTUP)

Kaiko.ai is indeed a data and AI company that bridges medical research and clinical application through the development and integration of multimodal foundation models in oncology. They aim to unlock insights hidden within clinical data to improve patient outcomes and redefine the delivery of care. Kaiko.ai was founded and majority owned by the Hartwig Foundation, which is described as a charitable foundation supporting medical research, education and art. The foundation established and maintains a large database and runs a central medical DNA/RNA sequencing facility. The company collaborates with leading European cancer research institutes, particularly the Netherlands Cancer Institute (NKI-AVL), to develop their multimodal foundation models and clinical AI stack. Their work focuses on advancing oncology research and clinical applications by leveraging these advanced AI technologies.

KAIKO-1

Key Features:

Training Data: TCGA (The Cancer Genome Atlas, 20,000 primary cancer and matched normal samples spanning 33 cancer types), CRC (Colorectal Cancer) datasets, BACH, MHIST, PatchCAM, CoNSeP.
Performance: Includes evaluation framework for foundation models, facilitating thorough assessment and benchmarking.
Applications: Aims to enhance diagnostic capabilities across various pathology tasks, including tissue identification, characteristic analysis, and biomarker detection.
Open Source Availability: Available on GitHub to encourage collaboration and further development in the pathology AI field.
Model Access: https://github.com/kaiko-ai/towards_large_pathology_fms
License: Kaiko Non-Commercial Public License

Additional Resources:

Paper: arXiv

HIST AI (US-based STARTUP)

HistAI is a startup focused on developing large vision foundation models for pathology and computational oncology. They are based in the United States and did not disclose their funding partners. HistAI recently launched their first family of vision foundation models called Hibou, including Hibou-L and Hibou-B. Their flagship model, Hibou-L, was trained on 1.2 billion pathology slides and is available to subscribers of their CELLDX platform via API. Hibou-B is freely distributed under an open-source license that allows commercial use. HistAI's models were trained on over 1.1 million whole slide images from a diverse proprietary dataset encompassing human tissues, organs, and various stains. The company aims to accelerate innovation in computational pathology and enhance drug development pipelines and laboratory diagnostics

HIBO-B

Training Data: Proprietary, diverse dataset: 936,441 H&E, 202,464 non-H&E, 2,676 cytology.
Performance: Demonstrates high accuracy and robustness in various diagnostic tasks, from tissue identification to biomarker detection.
Applications: Enhances diagnostic capabilities across different types of pathology tasks, including detailed tissue analysis and metastasis identification.
Open Source Availability: Available on GitHub and Huggingface to promote open collaboration and further advancements in pathology AI.
Model Access: GitHub repository and Huggingface
License: Apache 2.0

Additional Resources:

Paper: arXiv

Microsoft Research

Microsoft, in collaboration with Providence Health System and the University of Washington, developed Prov-GigaPath. This whole-slide pathology foundation model is pretrained on 1.3 billion pathology image tiles from 171,189 digital whole-slides. It covers 31 major tissue types from over 30,000 patients and uses Microsoft's LongNet technology for long-context modeling of whole-slide images. The model is designed to capture global patterns across entire slides, improving predictions for cancer mutations and subtypes.

PROV-GIGAPATH

Key Features:

Training Data: 1.3 billion image tiles from 171,189 whole slides.
Performance: State-of-the-art on 25 out of 26 digital pathology tasks.
Capabilities: Vision–language pretraining, improved predictions for cancer mutations and subtypes. Developed using Microsoft's LongNet technology.
Applications: Captures global patterns across slides, enhancing diagnostic accuracy.
Model Access: PROV-GIGAPATH
Code Access: PROV-GIGAPATH
License: Prov Gigapath License

Comparative Table

These projects represent significant advancements in open-source AI for pathology, each offering unique features and capabilities to support research and improve diagnostic processes in the field.

Regulatory Scope

Open-source AI has some important implications for the In Vitro Diagnostic Medical Devices Regulation (IVDR) in the European Union:

Regulatory scope: The IVDR applies to open-source AI systems used in in vitro diagnostic medical devices, just as it does to proprietary systems. There are no broad exemptions for open-source AI under the IVDR.
Data Transparency and Provenance: For AI models to comply with IVDR and the AI Act, clear documentation of the data used for training and validation is essential. This includes:
1. Data Origin: The source of the data must be clearly stated. This involves detailing whether the data is "free-to-use" and the conditions under which it was obtained and processed.
2. Validation: The AI model must be validated by a team of pathologists or other relevant experts to ensure its accuracy and reliability.
3. Representativeness: The training data should be representative of the target population, considering factors such as age distribution, ethnicity, and other demographic variables.
4. Provenance: Comprehensive provenance information must be maintained to track the history and handling of the data throughout its lifecycle. This ensures that the data's integrity and suitability for the intended diagnostic purpose can be verified.
Classification and risk assessment: Open-source AI systems used in IVDs would need to be classified and risk-assessed according to IVDR rules. The open nature of the code does not inherently change the risk classification.
Performance evaluation: Open-source AI algorithms used in IVDs would need to undergo rigorous performance evaluation as required by the IVDR. This includes analytical performance, clinical performance, and scientific validity assessments.
Technical documentation: Manufacturers using open-source AI in IVDs would need to provide comprehensive technical documentation, including details on the AI algorithm, its development process, and performance data.
Post-market surveillance: The IVDR requires ongoing monitoring and updates for AI-based IVDs. For open-source systems, this could involve tracking community updates and assessing their impact.
Transparency: While open-source AI provides inherent transparency of code, manufacturers would still need to meet IVDR requirements for providing clear information to users about the device's functioning, limitations, and performance.
Quality management: The use of open-source AI does not exempt manufacturers from implementing a quality management system as required by the IVDR.
Liability considerations: Manufacturers remain responsible for the performance and safety of their IVDs, even when using open-source AI components.
Innovation potential: Open-source AI could potentially accelerate innovation in IVDs, but regulatory compliance under the IVDR still needs to be ensured.
Challenges in change management: The collaborative and rapidly evolving nature of open-source AI may pose challenges for manufacturers in managing and documenting changes as required by the IVDR.

It's important to note that while open-source AI offers benefits like transparency and collaborative development, it does not circumvent the regulatory requirements of the IVDR. Manufacturers using open-source AI in IVDs must still ensure full compliance with all applicable IVDR provisions to ensure safety and performance of their devices.

Closing Comment

These groundbreaking publications mark the beginning of the commodification of medical knowledge, a shift that will compel many in the healthcare industry to rethink their business models and will change everything. This transformative movement aligns with the vision I've been advocating and speaking about for the past five years, and I have since begun advising investors, founders, boardrooms and strategy teams across the spectrum, from agile startups to Fortune 500 giants. Crafting successful open source strategies is challenging and requires thorough preparation. As open-source AI continues to democratize access to cutting-edge medical technologies, the entire landscape of healthcare will inevitably evolve, driving innovation and improving patient outcomes on a global scale. Feel free to contact me if you need help.

Credits: Special thanks go to Walter de Back and Falk Zakrzewski (co-founders of Katana Labs GmbH, AI startup in digital pathology and spatial biology) who helped me to provide this overview of foundation models.

Updated 6.08.2024 (added data source for Hibou (https://arxiv.org/pdf/2406.05074)