Massive, Transformer, Openness -- 🦛 💌 Hippogram #9
The 9th edition of the Hippogram focuses on large transformer models and openness
I'm Bart de Witte, and I've been inside the health technology industry for more than 20 years as a social entrepreneur. During that time, I've witnessed and was part of the evolution of technologies that are changing the face of healthcare, business models, and culture in unexpected ways.
In my newsletter, I share knowledge and insights about building a more equitable and sustainable global digital health. Sharing knowledge is also what Hippo AI Foundation, named after Hippocrates, focuses on, and it is an essential part of a modern Hippocratic oath. Know-How will increasingly result from the data we produce, so it's crucial to share it in our digital health systems.
Welcome to our newsletter for health and tech professionals - the bi-weekly Hippogram.
Language Models and Healthcare the Perfect Wedding
The cool thing about having followed the machine learning developments for over 10 years now is that somebody invents something unimaginable that makes you totally reconsider what's possible, like models that can play Go or generate hyper-realistic faces every few years. Today, the mind-blowing discovery rocking everyone's world is a type of neural network called a transformer.
A team of Google researchers started this new wave of unexpected progression in AI research. When they published their paper “attention is all you need”, they realized that 5 years later, their methods would lead to breakthroughs in natural language processing, biology, and even image classification. Google's PaLM, the largest model as of early 2022, outperforms average humans on grade school logic and math (BIG-bench) by simulating reasoning steps. Transformers are like this magical machine learning hammer that makes every problem into a nail. If you've heard of the trendy new ML models BERT, GPT-3, or T5, these models are based on transformers.
Transformers are models that can translate texts, write poems and opinion pieces, and even generate computer code. Unfortunately, most of us are unaware of how this silent revolution, the transformer, changes our world.
Transformer models such as CODEX were implemented under Co-Pilot on the open-source software development platform GitHub, significantly increasing the productivity of software developers and thus reducing the cost of software production. It translates natural human language into software code. For those unfamiliar with Github, it's the home to 83 million developers who work together on software projects.
Understanding Clinical Notes
In healthcare, research teams have been used on all sorts of data.. For example, they have been successfully used for the clinical text de-identification problem (>99% accuracy; 96.7% precision/recall). Or GatorTron which is a large pre-trained language model, using a corpus of >90 billion words. Its performance was measured on clinical language-related tasks at different linguistic levels (phrase level, sentence level, and document level) using publicly-available benchmark datasets from the clinical domain. It recognized clinical concepts and medical relations and outperformed biomedical and clinical machine learning models for all clinical NLP tasks. It’s logical to think that transformer models will substantially impact the performance of medical image to text transcription. However, there are still limited studies examining large transformer models in the clinical domain due to the absence of large and freely available datasets and massive computing requirements.
The most significant breakthrough has been in biology. For those who do not have the expertise and still want to understand why transformers will exponentially change the speed of innovations and hopefully progress, I try to summarise biology in 5 sentences (dumber than biology for dummies).
Simplified, one could say that DNA is the cell's blueprint, whereas proteins are the machinery built from this blueprint. The DNA contains a sequence of bases often encoded as the letters A, T, G and C. This sequence is then transcribed into RNA and translated further into the language of amino acids. Amino acids are just molecules, but they form proteins when connected to a long string. Similarly to how we encode bases as the letters A T G and C, we can also encode the 20 amino acids with letters from the alphabet. This amino-acid sequence is essential for a well-functioning protein. The idea that biological function and structure are recorded in the statistics of protein sequences selected through evolution has a long history.
The significance of applying transformer models to biology is that we can see the underlying mechanisms behind biology that haven't been accessible to us before for the first time. We can use this growing knowledge to find new diagnostic methods or even new drugs.
New models such as the open sources RoseTTAfold or Alphafold2 (Google Deepmind) can predict the 3D shape of proteins from their genetic sequence with, for the most part, pinpoint accuracy. The publication and the freely available, open-source code of both prediction models came as an earthquake. This has led to a continuous exponential growth of the number of published papers, which is again proof that open source accelerates innovation and progress faster than closing and patenting these models.
In biochemistry, now researchers are using AI to develop new neural networks that “hallucinate” proteins with new, stable structures. This breakthrough expands our capacity to understand the construction of proteins. Similar things are happening in the RNA space. And last weekend, I had a long conversation with Berlin-based Jakob Uszkoreit, who conducted deep learning research in the Google Brain research group. With his new venture called inceptive, he is using novel deep learning techniques and transformer models to design new molecules. His long-term vision is to write RNA-based software for our cells and he compares this work with the creation of computer software. According to Jakob, computer software can be viewed as the end product of a process that begins with writing code that is then compiled into bits. By analogy, his vision is to create a code language he calls bioprolegy, where the input describes the desired behavior and the output is RNA molecules. Molecules that exhibit the desired behavior when they are in our cell. 3% of our genomes describe proteins, but 90% of our genomes are transcribed into RNA, making this vision a huge upheaval that will take longer. Longer because they need to develop novel wetland experiences at scale that are specifically optimized to generate training data.
Flaws and Limitations
The foundation models that are needed to create these scientific breakthroughs have not been free from flaws and limitations. Whereas this might be less the case for biology, these large language models are famously capable of generating toxic language full of stereotypes and harmful bias; that troubling tendency results from training data that might include hateful language. Nor the training data nor the software code of these transformer models has been opened. This is given the rising capabilities of these models a big issue. As I mentioned above Google’s PaLM, outperforms average humans on grade school logic and math by simulating reasoning steps. For a system that never learned mathematics, following Chain of Thought Prompting is impressive enough to wake up all those who do not believe in the power of deep learning and still call it glorified statistics:
Question to Palm. Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Output: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.*
Connor Leahy, who works for Aleph Alpha in Heidelberg, Germany, is co-founder of an open-source project focused on building an open-source language model, called EleutherAI. GPT-NeoX-20B as a 20-billion-parameter model was until recently the largest publicly available model.
Earlier this month, Meta (The parent company of Facebook) took a big step toward transparency. For the first time, Meta will make its fully trained large language model, known as OPT-175B, available to any researcher who wants to use it for scientific scrutiny. But Meta's Open Pretrained Transformer is only available for non-commercial use, which is why it is semi-open, which is still a great step forward. Meta published its code and a logbook documenting the training process. In addition, the logbook includes daily updates from team members about the training data: how it was added to the model and when what worked and what didn't. Such a logbook should become an essential requirement for model certification in healthcare.
Meta isn't exactly known for being transparent and opening up its algorithms behind Facebook and Instagram. The fact that it's not available for commercial use makes it limited, but it's a big step in the right direction.
I hope our policymakers and investors are paying close attention. Openness seems to be the new hype in the field of AI research. What happens if this progress continues? Transformer models are self-supervising and no longer necessarily require labelled data to train. What they need is access to vast amounts of health data. These current models are currently trained on hundreds of billions of tokens. Enabling free and open access to an unprecedented volume of data while maintaining privacy and accountability should be our common goal if we want to accelerate progress in healthcare. This is a technical and engineering challenge, which is why I supported Elon's vision of an open-source Twitter algorithm.
Needles to say, this can only be achieved if we de-economize and democratize access to data, which remains my main concern and focus.
Innovator of the week
Joelle Pineau, the director of Meta AI Research Labs has been fighting for increased openness in AI for a number of years, and is a significant reason behind Meta AI's move to give open access to Meta's OPT. Pineau was instrumental in changing the way research is published in some of the world's most prestigious conferences by creating a checklist of items that researchers must provide with their findings, including code and instructions about how experiments are carried out. She has championed that culture at Meta's AI lab since joining the company in 2017.
The false hope of current approaches to explainable artificial intelligence in health care
"Explainability approaches can't yet guarantee that an individual's judgment is accurate, build user confidence, or justify the use of AI suggestions in clinical practice. Until there are significant improvements in explainable AI, we must approach these systems as black boxes, justified in their usage not by just-so rationalizations, but by their dependable and experimentally proved performance.
Behind the Scenes:
Investments in digital health have accelerated the deployment of data-driven Deep learning models for healthcare applications. Concerns about the ethical framework of interpretability, fairness, and bias in healthcare situations where human lives are at stake are not sufficient.
Why it matters:
In a recent study, published in Nature, Stanford researchers reviewed the 130 medical AI devices cleared at the time by the FDA. The researchers found that 126 out of the 130 devices were evaluated using only previously collected data, meaning that no one gauged how well the AI algorithms work on patients in combination with active human clinician input. Moreover, less than 13% of the publicly available summaries of approved device performances reported sex, gender, or race/ethnicity.
What Bart thinks:
Black box machine learning models are currently being used for high stakes decision-making throughout society, causing problems throughout healthcare, criminal justice, and in other domains. People have hoped that creating methods for explaining these black box models will alleviate some of these problems. Progress can only come with the development of open sourced global AI models that trained on open data repositories and publicly shared algorithms.
Call to Action
Julia Katrin Rohde helped us to make healthcare a bit more sustainable. Thank you Julia for spreading the word 😜
About Bart de Witte
Bart de Witte is a leading and distinguished expert for digital transformation in healthcare in Europe but also one of the most progressive thought leaders in his field. He focuses on developing alternative strategies for creating a more desirable future for the post-modern world and all of us. With his Co-Founder, Viktoria Prantauer, he founded the non-profit organisation HIPPO AI Foundation, located in Berlin.
About Hippo AI Foundation
The Hippo AI Foundation is a non-profit that accelerates the development of open-sourced medical AI by creating data and AI commons (e.q. data and AI as digital common goods/ open-source). As an altruistic "data trustee", Hippo unites, cleanses, and de-identifies data from individual and institutional data donations. This means data that is made available without reward for open-source usage that benefits communities or society at large, such as the use of breast-cancer data to improve global access to breast cancer diagnostics.
Share Your Feedback
Was it useful? Help us to improve!
With your feedback, we can improve the letter. Click on a link to vote: