Thank you for subscribing to our content. You will now receive the latest news in your inbox.

May 18, 2023

No Priors 🎙️117: The AI Will See You Now: Exploring Biomedical AI and Google’s Med-PaLM2 With Karan Singhal

What if AI could revolutionize healthcare with advanced language learning models? Sarah and Elad welcome Karan Singhal, Staff Software Engineer at Google Research, who specializes in medical AI and the development of MedPaLM2. On this episode, Karan emphasizes the importance of safety in medical AI applications and how language models like MedPaLM2 have the potential to augment scientific workflows and transform the standard of care.

Other topics include the best workflows for AI integration, the potential impact of AI on drug discoveries, how AI can serve as a physician's assistant, and how privacy-preserving machine learning and federated learning can protect patient data, while pushing the boundaries of medical innovation.

No Priors is now on YouTube! Subscribe to the channel on YouTube and like this episode.

Show Notes

[00:22] - Google's Medical AI Development
[08:57] - Medical Language Model and MedPaLM 2 Improvements
[18:18] - Safety, cost/benefit decisions, drug discovery, health information, AI applications, and AI as a physician's assistant.
[24:51] - Privacy Concerns - HIPAA's implications, privacy-preserving machine learning, and advances in GPT-4 and MedPOM2.
[37:43] - Large Language Models in Healthcare and short/long term use.


Sarah Guo: Welcome to No Priors. Today we're speaking with Karin Single. A researcher at Google where he's a leader on medical AI, specifically on Med-PaLM 2, where he and a team are working on responsible path to generative AI In healthcare. Google just announced the launch of its next generation language model, PaLM 2, with improved multi-leg reasoning and coding capabilities, which is behind Med-PaLM two. So it's a great time to be speaking with Karin about everything he and his team are working on. Karin, welcome to No Priors.

Karan Singhal: Hey guys.

Sarah Guo: So you've been working in this field for a long time. Tell us about how you ended up working on medical AI at Google. I think I saw you started a fake news detector using AI as a 19-year old.

Karan Singhal: Yeah, that was one of my first AI projects. I really got into AI thinking about how it could be used in socially responsible ways. And for me, I was thinking around the time of the 2016 election that, maybe a little bit naively, that AI-based solutions could be a bit of help for things like misinformation and detecting that. I think in the longer run, I mean I've thought of that as a more naive project, and I think in the longer run I've been thinking more about how it can help shape the trajectory of AI to be more beneficial and more broadly. And I think for me, thinking about the medical setting has been motivated largely by thinking about the fact that it's a great place to think about concerns around safety, reducing hallucination and misinformation as well here.

Thinking about how we can produce medical question and answers that are less likely to be harmful and all these kinds of things. And that motivation, I think has driven us to this point where really going for the jugular in terms of thinking about how to train these models, make them better in the setting. And so very excited about that kind of work.

Sarah Guo: Have you been working on the medical domain your entire time with Google?

Karan Singhal: No. I mean for me, this is just something I've gotten into the last year and a half. So I've been new to it. I've been learning from an excellent team and it's been an amazing journey so far.

Sarah Guo: What else has been the most interesting in your work at Google so far?

Karan Singhal: Yeah. I started working out in representation learning and federated learning. So this is the technology, representation learning in particular, is the technology underlying a lot of the deep neural networks of today, including GPT-3, GPT-4, and so on. And so this is largely about learning representations of text, of images, of other modalities such that you can efficiently encode them, you can learn from them in the future, you can generalize some new texts and images and so on. As the work for this really started back in the beginning of the deep learning era in 2013 with combinatorial neural networks and scaling those up and Word2Vec around 2015 and Glove and all these things.

And I think since then, we've been working on technologies around self supervised learning, around doing that in a privacy preserving way. And so, after a couple years of working on that at Google, I had the opportunity to quickly grow and start to lead a team. I got to the point where I was thinking, "Okay, I've up-scaled in a lot of ways. I've gotten to the point where I can mentor many other researchers in a lot of ways. And now it's a great time to be thinking about my next thing and going for something ambitious in terms of shaping the trajectory of AI." And so, about a year and a half ago, a few of us had the idea to think about this medical setting as a setting in which these concerns are especially important and there was a ripe opportunity to think about this paradigm of foundation models and medical AI.

And so, within Google we had the opportunity to pitch what's called a brain moonshot, which is an internal incubator program for ambitious research projects. And this is a lot of cool research projects that you've heard of from Google, have eventually come out of this program. As we pitched that, we got it accepted and funded. We got the ability to get a bunch of compute to bring other folks on board with the sponsorship of a bunch of leaders. And our first thing together was really Med-PaLM. And so that was a really amazing thing for us to be able to work on together.

Elad Gil: Can you talk a little bit about PaLM and how that's related to Med-PaLM and what PaLM is to begin with and then how Med-PaLM is different?

Karan Singhal: Yeah, absolutely. I mean, so the original Med-PaLM work built on this model called PaLM, which stands for Pathways Language Model. And so this is really an infrastructure that Google has built to be able to scale up large language model training, that is Google wide. And so the first PaLM model was released in 2022, which was this 540 B decoder only transformer model at the time, the largest densely activated model. And it realized these breakthrough achievements in code, in multilingual capabilities, in reasoning. And so, I think a lot of the work with respect to improving benchmarks specifically that we're seeing with PaLM, Med-PaLM, GPT-4 recently, I think all comes down to a lot of the improvements that were made during PaLM, during the training of PaLM.

And so shortly after PaLM, there was this Minerva work where maybe I think a few months after the PaLM work itself, people were able to show that on STEM benchmarks that there was this zero to a hundred or zero to 60 at least, effect where you went from random chance to solid performance across a bunch of benchmarks. And that laid the ground foundation for a lot of the work that Jason Way and others have had on thinking about immersion abilities in large language models. And so for us, that was part of the motivation for looking at multiple choice benchmarks as well for Med-PaLM. And so for Med-PaLM in particular, what we did was we took Med-PaLM, this general large language model trained on web-scale data, and then further aligned it to the medical domain. We evaluated it base, but also thought about given its limitations in long form medical question answering, thinking about things like safety, factuality, low likelihood of outputting an answer with bias, what do we need to do to better align that model with this domain? And so, really Med-Palm was an attempt to do that.

Elad Gil: Yeah. So basically it sounds like you started off with PaLM and PaLM was tested against a bunch of different types of tests. And so, you could take the MCAT or you could take other types of effectively tests for professional gradation or for knowledge, understanding. And then it sounds like you then said, "Hey, this seems really interesting. We're starting to get really good performance here, and so can we do something that's in the medical domain specifically." And that was Med-PaLM. And so how did you do that alignment that you mentioned, was it some form of RLHF? Was it some other form of fine-tuning? Was it how you trained the model to begin with? What was the difference in terms of Med-PaLM versus PaLM?

Karan Singhal: Yeah, absolutely. I mean, when we tried evaluating PaLM in the medical setting, we noticed that it was, out of box on multiple choice questions, performing pretty well. And when we took a variation of PaLM, the flan PaLM model, which was again work from Jason Way and team, this is an instruction to model, a model that's been trained to follow instructions better. Again, it was able to perform quite well out of the box. And this was the first model that was able to perform above the past mark on the med QA set of US assembly style questions. But then what we noticed is that when we evaluated it on long form medical question answering, actually getting the model to generate a response, there was a lot of limitations and when we compared that to clinician performance, it actually didn't do super well.

And so really that was the motivation for that Med-PaLM specific alignment. And so what we did there was really thinking about instruction prompt tuning, which was this technique which we explored in that Med-PaLM paper. Which is a data efficient technique and a technique that doesn't require too much data to work because getting labels from doctors is expensive, which took a bunch of expert demonstrations of good behavior from doctors and then use that to tune the parameters of the model and do that in a way that's a little bit more learned than prompting, but also less expensive than full fine tuning.

Elad Gil: And so you did that and then, I guess if you start looking at this now shift from PaLM to PaLM 2 and from Med-PaLM to Med-PaLM 2, did you basically just reproduce that same approach for Med-PaLM 2 or did you do anything different there?

Karan Singhal: Yeah. This is the work of many folks other than myself, so just to preface it with that. I mean I think a few things that have been important have been, one is better objectives for crew training and using something like a mixture of objectives, training objective. And so that's been something that's been crucial. And so this is work that started with UL 2, a paper that was released also last year. And then two other things that ended up being super important. One is following the optimal scaling laws that were empirically evaluated again in this work. And I think, there's been a few works that have tried to do this from OpenAI and DeepMind. And again, this work tried to understand in this context, what are the optimal scaling laws with respect to data and compute and how do you trade those things off?

And so this paper, again, found something similar to the Chinchilla paper, which was that the total amount of data being used for these models was relatively low compared to the number of parameters and that if we wanted to add in more data, we could do so and we could train a better model in a more compute efficient way. So this model also did that. So that's an important improvement as well. And the third thing was improvements in the data that were used to train the model. And so this especially focused on multilingual data, including more multilingual data and more code data in a bunch of different coding languages as well.

Sarah Guo: Maybe just zooming out a little bit in terms of when you might apply some of these different techniques to align a model to a specific domain, do you have a framework in your mind for why you might do full pre-training from scratch. Why you might do fine-tuning, why you might do a more efficient form of fine-tuning and when you can just get away with prompt tuning or prompting. How do you think about that?

Karan Singhal: Yeah. This is a great question. I think it really comes down to the data that's available both in quantity and relevance to a particular topic. I think if you have an infant supply of data that's relevant for the specific problem that you're trying to solve, then probably the best thing to do is pre-train everything from scratch and do everything end to end. And if you don't mind, computing money as well. If you are working on a task in which general pre-training data in the web confers general advantages to that task. And so that could be domain knowledge or could be general abilities like reasoning, which is very applicable across many tasks. Which I think is the case for medical reasoning as well. Then I think it makes a lot of sense to build on top of an existing model, especially if you're sensitive to things like cost or compute, which most people are these days.

And so I think on that spectrum between things like prompting and prompt tuning, all the way to full fine-tuning, I think it largely comes down to... So given an existing pre-trained model, which is I think a big hurdle for most teams and most people to train a large scale pre-trained model, then the question is do you prompt it? Do you prompt tune it? Do you full fine-tune it? I think that largely comes down to data. If you have three to five examples, let's say, then I would prompt it. If you have maybe 10 or 50 examples, it would either be prompt tuning or fine-tuning. I think generally in that realm, prompt tuning and fine-tuning performs similarly. And I would prefer prompt tuning if you're at all sensitive to things like compute or cost. If you care about the best performance and you have more than a hundred examples, then probably fine-tuning is your best bet. And it's not as expensive as full pre-training if you're doing it with a model that's been pre-trained, of course.

Sarah Guo: When you thought about evaluation of this model, you must have been surveying the landscape for the other medical, probably science specific and then medical specific models. What's out there and how did you guys think about evaling and changing eval?

Karan Singhal: Yeah, absolutely. And this is not the first work to explore the potential of a large language model in science or biomedicine. And so I think it's important to acknowledge all the work that's come before us. What we saw when we first came into this work and tried to understand what other models existed, what other evaluation has been done, was that one, there was a few exciting works from other teams like Allactica or BioGPT and so on that we thought we could learn from and benefit from. And so that was a really exciting thing to be able to see. And the second thing we saw was that there was a bit of a shortage of a systematic way of doing evaluation of these models. And so it didn't feel like there was a systematic way to think about automated evaluation of the clinical knowledge of these models. So for example, via multiple choice benchmarks, there were a few popular benchmarks like the med QA benchmark, but it varied across paper what benchmarks they were studying.

In some cases, we felt like these benchmarks were not high quality. And so that was one thing that we saw. I mean, another thing that we saw, which was more acute, I think, was a lack of detailed human evaluation across many of these works. And so there was some steps in this direction that we were able to build on, but I think for the most part, a lot of these models that have already existed didn't have detailed human evaluation given a use case like medical question answering. And so I think that, to us, was a significant limitation as we think about the real world potential of these models because when it comes down to it, we have to make sure that it actually serves humans and is beneficial to humans. And so for us, that was a significant motivator for the Med-PaLM work being relatively evaluation forward and thinking carefully about human evaluation with both physicians and lay people.

Elad Gil: How do you think about where that bar is? Because I think it's one of those things that having started a medical centric company before, on the one hand, you really want to be cautious in terms of providing people back information that's accurate. And so when I was working actively on the operating side of color, we spent a lot of time agonizing over ensuring that the results that we provided back to patients were as accurate as possible, particularly in the context of anything that had to do with core genetic or other information. The flip side of it is, I remember I took my son to the emergency room when he was younger and the doctor said, "I'm going to go research this case and I'll be right back." And I had to go ask him another follow-up question and I go around the corner and he's in his cube, literally googling the symptoms. And so-

Sarah Guo: Oh no.

Elad Gil: It wasn't like he had some deep accurate source, he was just making things up effectively. I mean, I've seen Google results and you're clicking around and he was just clicking around. I was like, "Oh my gosh." And I could see the query. So I knew he was looking at my kid's symptoms, he had no idea. And so there's this bar from, hey, it needs to be incredibly accurate and correct on through to, well, the state of the art actually isn't that amazing in many circumstances. And so how do you think about the right quality bar for these sorts of things, in terms of real use application or practice?

Karan Singhal: That's an amazing, great question. I think as you said, there's two competing forces here. Obviously the stakes are high in the medical setting and counter-factually, you want to make sure that the information you provide versus the information they would've otherwise gotten is actually high quality. And so that's very careful as you think about any informational use case for these models. At the same time, I think it's useful to recognize that people are searching for health information online and indecision is a decision as well. And so, a large percentage, roughly 10% of searches on the internet are for health information. And some of these are coming from physicians themselves as you mentioning a lot. And so I think that there is a responsibility to think about how to shepherd this technology carefully and safely towards that real world impact for patient health information.

And I think that is crucial as well. And I think one thing that has been missing from our work so far is really grounded evaluations in a specific use case in a workflow to show that there is a benefit, both in terms of safety in the short term and in terms of long term patient outcomes as well. And so, I think that could be a health informational use case, it could be other clinical workflows, but I think that's one thing that we have to really make sure we do and are careful about before any real world use case here.

Elad Gil: Yeah, that makes sense. Yeah, it definitely feels like in the medical world, the importance of safety is paramount and at the same time there's very little cost benefit being done anymore. And so there's interviews with Janssen and other giants of the industry basically saying, "We need to think about the benefit side, not just the cost side or the safety side." And what you're working on, I think is so important in terms of if you think of the really big areas of societal impact, it's what you folks are doing. If you could provide amazing health equity globally for everyone in terms of this information, how powerful is that? I mean, that's fundamental and maybe education is the other one. And it feels like AI really has a promise in both of these areas. And so I always worry about how do you make sure that this can get to market because it's so valuable but there's going to be all these regulatory or safety obstacles that in some cases are merited, but in some cases may actually prevent the emergence of really important applications.

So I think it's awesome that you folks are working on all this and are being so thoughtful about it. How do you think about what workflows this is going to be most useful for? So if you look at a lot of the bio or biomedical AI companies, for some reason they keep doing drug development. A, why do you think that is? Because this seems like such an important part of healthcare and probably the bigger driver of healthcare efficacy. And so A, why is everybody just going and building another protein folding model or molecular company and B, where you think are the best applications of what you've been working on?

Karan Singhal: Yeah, these are great questions. I think on the drug discovery front, there's a bit of a playbook here which any new company here looking for some revenue in the short term can follow. And that could be a safe option. There are, for example, existing AI augmented pipelines for doing things like given small molecule chemistry, predicting things like absorption or toxicity. And it's relatively easy to see that some of the more modern models if placed into these pipelines, could perform better. And so there's a relatively safe bet there. And so I think that probably accounts for a lot of the popularity of that as a use case. I totally agree that there is a chance to go for the jugular here, in terms of health information for example. And so I think this is something that is going to be crucial but I think it is also something where a lot of the big players are more risk averse.

And so the people who gave access to health information or provide access to health information are also thinking not maybe super counter-factually, about the positive benefits of things and they're thinking more about the risks. And so I think that is also a concern that's been slowing folks down both in terms of big companies and smaller companies. And I think there is an opportunity to think more about that and what that could look like. And I think the company that gets that right or the set of companies that get that right, I think will also have a seat at the conversation when it comes to policy and regulation and things like that. And so they have the chance to shape what this looks like for the future. And so, I think that's going to be potentially quite impactful.

Elad Gil: Yeah. It seems very exciting because if you look at healthcare, it's 20% of GDP. Pharmaceuticals are about 20% of that, and then drug development is a fraction of that. So really what folks are focused on in terms of the types of models that you're building, is at least 16% of GDP. Maybe it's more than that. If some of the pharma stuff is more clinical decision making around who gets a certain pharmaceutical. Do you view this as a technology that's initially a physician's assistant? Do you view it as something that helps with adjudication of medical claims and billing? There's so many places where this can insert. I'm just curious, where do you think you'll see this technology popping up first?

Karan Singhal: Yeah, I think we're already starting to see it in some clinical workflows when it comes to documentation and building. I think there are a lot of companies and people thinking about taking models like GPT-4 and applying them in that setting. And I think that is definitely going to be something, and I think that is also going to be something where players like Epic are going to be able to partner with existing models and I think potentially deliver real value there. And I think that's very exciting and I think that's something that also general domain models will be potentially quite good at as well. I think where there might be more of a need for specialized models is when it comes down to higher stakes workflows, and I think that might look in the short term more like a physician's assistant.

And so imagine for example, an agent that can work with the radiologist, help them interpret a scan and leverage the benefits of AI to help contextualize what a patient's medical record or any previous scans or different angles of scans that a patient has had to help a radiologist write a more accurate report. I think that's something, that's the thing which I think is in the sweet spot of both feasible today, leverages the benefits of AI in terms of taking an additional context and potential multi-modality and all these things. And it's also potentially in a sweet spot with respect to regulation as well. And so I think that's something that could happen in the medium to short term.

Elad Gil: How do you architect a model or workflow in this context to deal with things like HIPAA or patient privacy? So I feel like healthcare data is unique from the context of what you're allowed to do in terms of who you send it to, with what permissions from users. So is just you have to get the right user opt in and then it's fine or is there extra work that you need to do in terms of blinding data or doing other things relative to the prompts or queries you're sending in?

Karan Singhal: Yeah, it's a great question. I mean, I think this is something that people are just trying right now and just seeing what happens. And it's interesting, people are just putting in patient information into GPT-4, sometimes they're redacting information and all these things. I mean, I think the ideal way to do this obviously is more privacy forward, I think in terms of building trust with the relevant stakeholders and all these things. I think a starting point is just models that are able to automatically redact very sensitive information from being sent further down a pipeline. I think that's something that's a very low hanging fruit that many people can do. There's also potential for HIPAA compliance within organization. So I know some organizations working in the space are partially HIPAA compliant or are trying to make that claim. And I think that's something that's useful and I think that's something that we should work towards as well.

I think in the longer run, I think a lot of these concerns I think are actually unclear in terms of how things will work out. I think there is a bigger question about software of unknown providence and how that will be used and regulated in the future. There could be some situation in which these things actually end up being very hard to scale up and apply in the real world for high stake settings, but I think we'll probably end up with a scenario where it'll become obvious that we need to and that we must and that doing so will improve patient outcomes. And so then I think it'll be time to have a serious conversation about what regulating these models and making sure privacy concerns are mitigated looks like. And I think we have yet to have that discussion.

Elad Gil: Yeah. HIPAA is interesting from the context of it was an incredibly well intentioned piece of legislation, but the flip side of it is it's really backfired in all sorts of ways in terms of actual patient good. And you see that sometimes as well and in terms of some of the things that as you sign up for a clinical trial or either that you can actually do with your own data, where sometimes you're constrained from accessing it. I know of one example where somebody had brain cancer, they had a glioblastoma and it was a researcher at MIT and he participated in a small clinical trial and then they were unable because of compliance to give him his own data so that he could try and discover drugs against his own glioblastoma, his own brain cancer.

And so sometimes you see these very well-intentioned approaches in terms of the protocols around a clinical trial or around HIPPA are other things that are very well-intentioned in terms of what they want to do, but then sometimes they may backfire as you start to enter the modern data world, since I think that legislature is now almost 30 years old. And so I just think it was set up for a world that's very different from what we have now in terms of the liquidity influence of your ability to interact with information and patients driving their own diagnoses and things like that. So, my hope is that some of these things get rebalanced in the AI world since it could be so valuable to things like what you're doing.

Sarah Guo: I was just going to say that is the status quo. And you've also worked on the areas of privacy, preserving machine learning and federated learning. Those areas have broadly taken a backseat to, let's say, scaling and aligning these more centralized models. Do you see a place for that technology in this field?

Karan Singhal: Yeah, that's a great question. So I mean, as I mentioned before, the first couple years of my career were really thinking more about privacy, preserving machine learning and federated learning and scaling that up and coming up with new algorithms that can learn new things without sending all the data to a centralized place. And so in a lot of ways that has a very natural fit with this setting. And part of my motivation when I first started working on the setting was bringing in a lot of that expertise and bringing it into that setting. My sense is that, I think one hesitation I have there, is that I think a lot of the most impactful work that's going to happen in this setting is going to happen with the largest and most capable models, at least for the next few years it seems like.

And I think that one thing that we're seeing is that even without any patient health information put into these models, for example, Med-PaLM and Med-PaLM 2 are trained without any patient health information, they're just taking all the knowledge of PaLM and PaLM 2 and then just aligning them and making them behave in a certain way. I think in the short term, there is this thing that we see where models like GPT-4 and Med-PaLM and Med-PaLM 2 are able to do surprisingly well without any patient health information. And so it seems like we can get fairly far with that.

I mean, in the longer run, I do think that coming back to that question of data and how do you think about how to train a model depending on how much data you have and how relevant that data is, the ideal thing would be to have access to all the data but in a privacy preserving way. In a way that people are in control of their data, are able to revoke access to that data and are able to benefit from that shared understanding of their data. And so that's the ideal world. But I think there are real world obstacles to doing federated learning on health data. Which actually increased activation energy to the point where in the next few years, I doubt that the biggest advances are going to come from using federate learning approaches.

But I think there are kind of intermediate solutions which people often sometimes refer to as federated, but maybe are not technically federated. Which are things like trusted execution environments or other environments in which models are running but the folks at Google don't have access to the data or the direct access to the models. And so there is this ability to silo any patient health information in the future, potentially, or any other data that's quite sensitive from engineers or other folks at big companies or small companies.

Sarah Guo: Going back to perhaps more promising near term areas of research, you've had this idea of building a medical assistant as of laboratory for safety and alignment research. Can you talk about that?

Karan Singhal: Yeah, absolutely. I mean, this is a lot of what got me thinking about the setting, especially coming into the setting as somebody who didn't have much of a medical background in terms of expertise. I was really thinking about what are the big things that I could do to help shape the trajectory of AI or nudge it in a more beneficial direction. And thinking about AI safety seriously, in terms of both short term and longer term risks, I think was important to me. And so, one thing I've become more convinced of about over time is this idea that many organizations right now, Google, DeepMind, Anthropic OpenAI are right now looking at the idea of a general chat assistant and instead of doing alignment research in a vacuum, are looking at that setting as a way in which we can think about better refining these models and better aligning them to human values.

I think there's a good chance that this setting, this medical setting, for example, medical question answering or maybe more broadly, I think ends up being a better scenario to study concerns about technical safety and to mitigate concerns like misalignment with human values or hallucinations or things like that. So, I think this comes down to things like making sure the incentives are aligned with respect to releasing products. So for example, I think if any organization wants to release products in the space, it actually needs to work on these problems more so than, I think ChatGPT. I think it also comes down to the stakes of the setting. I think everybody feels like the stakes of the setting are high enough that everybody feels like these issues are especially important and there's no debate about that. And I think there's also some more subtle technical points.

I think one issue that alignment researchers are now working on this idea of scalable oversight. Which means, how do you give human feedback to a model when human feedback might not be super well-informed or it might be unreliable because AI capabilities are starting to reach human level. And so, when we start to get to that point, things like RLHF start to fail and starts to become unclear what to do. And so I actually think the medical setting is a scenario in which this is already more obvious. So you're already in the setting in which you need experts to be able to evaluate answers. And one thing we're seeing with Med-PaLM 2 is we get closer to physician level performance on medical question answering is that it's hard to tell the difference anymore. It's hard to tell the difference between different models. It's hard to tell the difference between models and physicians. And when you're at that point where it's uninformed oversight, then it becomes very tricky to think about aligning to human values. And so that problem is super well motivated in the setting and that's something I'm very excited about.

Elad Gil: What do you think is a solution to that? Because if you look at the gaming analog, which is probably a bad analog here, once machines were better than humans at things like Go or Chess or other things, people started learning off of the things that the machines were doing that were unique or creative or different or the problem solving was very different. And if we really want this technology to be incredibly valuable for medical applications, in some cases we may end up with these suggestions that will really work well but that, to your point, people may misinterpret or misunderstand. And so how do you think about evaluating things when the AI will be better than a person at medical adjudication or better than an expert?

Karan Singhal: Yeah. I mean this is a really interesting question. I don't think I have all the answers but I think there are approaches that people at Google and other organizations have been looking at. I mean, I think a couple ideas here that I think are interesting and useful. One, is the idea of self refinement or self-critique of these models. And so, this is the idea that these models can take their own responses, give critiques often guided with human feedback. And so that's the place where human feedback comes in. But some of these techniques, there is no human feedback and in that case, I'm not sure if that's as valuable. Give critiques guided by human feedback and then use that to produce better answers. That's one line of approaches.

I think a second line of approaches is around debate. And so the idea here is that it's easier for a human to judge a debate between two different answers than to judge the answer itself. And so, the standard for verification is a bit lower here. And so, there's that ability for humans to be able to judge a response that potentially they wouldn't be able to judge otherwise via things like debate. So that's another thing. I mean another thing which is people are working on as well is thinking about how we can take AIs that are less capable and use them to supervise other AI that are more capable. And so this is the motivation. I mean this is partly the motivation of RLFH as well, even though it's about human feedback. It's about training a reward model that takes into account human feedback and then, at that point it's AI feedback from then on, and then you use your RL algorithm and then you get rewards from your reward model.

RLAIF or Constitutional AI builds on that idea but there's also limitations to that approach as well. I mean, I think if you ask researchers across all these organizations, have we solved this problem? Do we know what we're supposed to do? I think most of them would say no. And it seems like a pretty consequential problem. So I'm excited for more folks to work on it.

Elad Gil: Yeah. One thing that I feel like would also be generated as a side effect of all this is just you end up with these really interesting closed loop data sets over time that may be unique outside of an EMR or somewhere else or a really robust medical record system because if you have effectively a physician's assistant or something else and then you have the endpoint of what happened based on treatment, you actually have a really interesting retrospective data mining training set.

Karan Singhal: Yeah, I mean I think that's another opportunity for feedback for these models could have a huge impact on the world.

Elad Gil: Yeah, it'll actually be data driven medicine. Which I think sometimes happens, but sometimes doesn't. So it's very exciting. I guess one more question is just, there's amazing potential here. And if I look at the history of medical technology, in the 1970s, there was something known as the Mincing project at Stanford where they built an expert system, which was an old computer program of its time or a computer program of its day that was a precursor to some of the things that eventually happened in AI. They had a expert system that outperformed all of Stanford's medical staff on the prediction of the infectious disease that somebody had. So 40 years ago almost, we had a machine that outperformed people in terms of diagnosis but it never got adopted.

And so often when I look at medical technologies, there's this almost anti-adoption curve, in some cases for the things that may be most impactful. How has the medical field embraced or not embraced these AI models? Is it different this time? Are people excited about it? Are they not excited? Is it really depend on the type of physician? I'm just curious what the reaction has been from the medical community to date.

Karan Singhal: Absolutely. That's a really great question. I think when we started this brain moonshot, which we call it within Google, that was actually our motivation. It was really to think about the fact that these models had already exist, and there was this opportunity to catalyze the medical AI community to really think about them carefully and think about the promise there. And to catalyze the AI community to think about how we can resolve any remaining limitations that would prevent real world uptake. And so this was really our goal. And I think when we started this, there was much less conversation about the potential for large language models and foundation models for healthcare. I mean partly because of, I think largely also because of other work that's gone on with GPT-4 and excitement around that, I think there's much more conversation about how these models can be used in the setting in a productive way.

And I think that's really exciting and I think there's a lot of optimism, I see. But there's also a lot of justified concern about the potential limitations of these models and how we can get over them. Personally, I mean from what I've seen from giving talks to different groups and chatting with different folks and different stakeholders, I think there's a widely held optimism about this technology and about the potential. But I think there's also a little bit of fear that I think people have seen in other domains. I think programmers often feel a little bit of fear when they see GPT-4, for example. And I think it's not necessarily a fear that jobs will be replaced in the short term or things like that, but it's more of a fear of, look how fast things are moving. This is nuts. Think about just the improvement from Med-PaLM 1, GPT-4, Med-PaLM 2 in three months. It's absolutely crazy.

It's definitely an inflection point for AI as you guys know, and I think it's definitely a good time to think about what are the most important problems we need to solve versus getting caught up in the hype wave and forgetting to solve the most important problems as well.

Sarah Guo: I think back to a lot sort of point earlier, thinking about the actual benefits of these technologies at scale if adopted even at human and at some defined superhuman level should we come to some sort of agreement as a democratic society about what eval looks like, is really important. In that, if you just think about what the status quo is for somebody who has a complex case in a median background in America, what do they know about the error metrics of their doctor? In a field that's also advancing in parallel to AI, the specific rare condition that they have, it's not super encouraging. And so in terms of leverage for a field where the status quo is not sufficient, not as a comment on the class of physicians and researchers, but just in terms of the quality of care that we want to be able to offer every person, it seems like we want to set a reasonable safety case, not a unlimited safety case. Which is I think is one of the things that has held back other mission critical AI applications in the past.

Maybe on that note, one last ask for you in terms of encouraging some optimism, you're working on the state of the art in this field and thinking about the barriers to the applied use, five years from now, how do you hope we are using large language models in the medical field?

Karan Singhal: Yeah, I guess I think about this in two broad buckets. I think there are two broad types of things that we can do for large language models in the medical field. I think the first is increasing the standard of care, very broadly. And so that looks a lot like increasing access to health information, providing assistance to physicians, so the radiology example I gave earlier. Potentially clinical decision support like double checking a doctor's decision or quality assurance for a radiologist report. So if a radiologist is dictating a report, they say no plural effusion scene but then it's written down as plural effusion scene, then maybe an AI double checks that and just make sure that that's what was intended. I think augmenting telemedicine I think is a short term opportunity that I think in the next five years is very achievable.

I think the other big bucket of things that is very much achievable is augmenting scientific workflows. And I think this could be a longer term thing than five years but I think there's also short term things that we can do as well. So thinking about looking at correlations across modalities and existing data to find novel biomarkers for existing diseases that we know about or using large language models as research assistance. So I think there's already a lot of work on the idea of literature search and augmenting literature search with large language models. I think there's a lot of opportunity there. And that goes a little bit beyond what Med-PaLM is likely going to do.

But I think that's something that I think is going to be really promising with respect to the future of AI because I think the long term, when things go really well with AI, it's going to be because we've solved a lot of the most pressing scientific problems of today. And I think that's going to be because it augmented scientists, it helped scientists, it helped us figure out what are the things that we're missing, and I think there's a lot of potential there. So I'm also really excited about that in the long term.

Sarah Guo: Awesome. Wrapping up, is there anything else you think we should touch on?

Karan Singhal: Yeah, absolutely. I mean, I think for real world uptake of these models that there are a few large language model capabilities, in some cases, that already exist but we need to figure out the right way to do them. And I think a few of them are just multi-modality, which is something that we were working on. We previewed last week at IO. And grounding and authoritative sources, I think is important as well. Thinking about how these models can use tool form like approaches to, for example, query authoritative medical information like a human would, but potentially better. And I think that's also one way of getting around the risk-averseness that you see in this area with respect to health information. If you're able to attribute information to an authoritative source, I think that has been something that has progressed this area in big companies before.

And so where for example, Google is doing that with health information is largely because it can attribute things to the Mayo Clinic and other organizations. And so, I think that's going to be really important for moving this forward. I think also solid research, thinking about better ways to improve the ways we are taking in human feedback. I think the jury's still out with respect to how to best collect human feedback even. I think people are still debating things like whether or not pair wise comparison versus rewrites are the best things to do. And that's a valuable thing to think about. I think another thing to think about is how to actually use that human feedback in the most valuable way, especially given all the scalable oversight concerns that you guys mentioned. I think that's a significant limitation of Med-Palm as it is today. And I think there's a lot of exciting things to do and I think a lot of these questions are foundational questions for AI more broadly, but become more acute and more relevant in this setting.

Sarah Guo: It's been great to have you on No Priors. Thanks for doing this.

Elad Gil: Yeah, thanks so much for joining.

Karan Singhal:Thanks guys.