No Priors 🎙️107, with Percy Liang, Stanford Center for Research on Foundation Models: what is the role of academia in the age of scaling AI? (TRANSCRIPT)
SARAH:
We're very pleased today to have Dr. Percy Yang, professor of computer science here at Stanford, and director of the Center for Research on Foundation Models. A recently founded center here, Dr. Lang is the author of Over a hundred heavily cited research papers around helping machines understand natural language, helping humans reason about those models, and has contributed a number of novel technical approaches and creative uses of data to the machine learning field. And as a special treat, we're recording here in his office at Stanford today. Thanks, Percy.
PERCY:
Great, welcome.
SARAH:
So I think just to start, can you tell us a little bit about how you got into the machine learning research field and your personal background?
PERCY:
Yeah, so I've been in the field of machine learning and natural language processing for over 20 years. I started getting into it in undergrad. I was undergrad at m I t I liked theory, I had a fascination with languages. I was fascinated by how humans could just be exposed to just strings of, uh, text, I mean speech, and somehow acquire very sophisticated understanding of the world and also syntax and learned that in a fairly unsupervised way. And I wanted to, my dream was to get computers to do the, the same. So then I went to grad school, uh, at Berkeley, and then after that started at Stanford. And ever since I've been pursuit of developing systems that can really truly understand natural language. And of course in the last four years, this once upon time could of dream has really taken off in a, in a sense, maybe in a, not a way that I would necessarily ex expect.
But with a coming out of large language models such as G P T three, it's truly kind of astonishing how much of the structure of language and the world that these models can can capture. In some ways it kind of hearkens back. When I actually first started N L P I was, um, training language models, but of a very different type. It was based on hit and mark of models, and there the goal was to discover hidden structure in, in text. And we were, I was very excited by the fact that it could learn about, tease apart what words were like city names versus days of week and so on. But now it's kind of on a completely different level.
SARAH:
You've worked on multiple generations of N L P at this point, pushing the forefront of semantic parsing. Was there a moment at which you decided that, you know, you were gonna focus on foundation models in large language models?
PERCY:
Yeah, there was a very decisive moment, and that moment was when G P GT three came out that was in middle of the pandemic. And it was just, it wasn't so much the capabilities of the model that shocked me, but it was a way that the model was trained, which was basically taking a massive amount of text and asking the model to predict the next word over and over again, know billions of times what rose from. It was not only a model of that could generate fluent text, but also a model that could do in context learning, which means that you can prompt a language model with instructions, uh, for example, summarize this document, give it some examples, and have the model on the fly in context, figure out what the task was. And this was a paradigm shift in my, in my opinion, because it changed the way that we conceptualize machine learning and N L P systems from these bespoke systems where you're, it's trained to do question answering to trained to do this, to just a general substrate where you can ask the model to do various things.
And the idea of a task, which is so central to ai, I think begins to dissolve. And I find that extremely exciting. And that's the reason later in 2021, we founded the Center for Research on Foundation models. We coined the term foundation models because we thought there was something that was happening in the world that was, that somehow large language models didn't really capture the significance. And it was not just about language, it was about images and multi-modality. It was a more general phenomenon. And we coined the term foundation models and then the center started. And it's been sort of, you know, a kind of a rollercoaster ride, uh, ever since.
ELAD:
We're gonna be talking a thing about both your experiences in research and academia and then we'll also separately be talking about Together, which is a company you're involved with now. Could you tell us a little bit more about what the Center does and what you're focused on?
PERCY:
Yeah, so the Center for Research on Foundation models sort of two years ago is under the Human-centered AI Institute at, uh, Stanford. And the main mission of the center is, I would say, to increase transparency and accessibility to foundation models. So foundation models are becoming more and more ubiquitous, but at the same time, one thing we have noticed is the lack of transparency and accessibility of these models. So if you think the last decade of deep learning, it has profited a lot from having a culture of openness with tools like Pie Torch or TensorFlow data sets that are open people publishing openly about their research. And this has led to a lot of community and, uh, and progress, uh, not just in academia, but also in industry with different startups and hobbyists and whoever just getting involved. And what we're seeing now is sort of a retreat of that open culture where models are now being only accessible via APIs. We don't really know all the secret sauce that's going behind them and they're sort of limited access. What's
SARAH:
Your diagnosis of why that's happening?
PERCY:
I think that this is very natural because these models take a lot of, you know, capital to train. They're enormous amount of, you know, congen a lot of value and it's a competitive advantage. So, you know, incentives are to, to keep these under control. There's also another factor, which is, you know, safety reasons. I think these models are extremely powerful and maybe the models right now I think are, well, if they were out and open it, it would be maybe okay, but in the future these models could be extremely good. And having them completely anything goes running amok. We might have to think about that a little bit more carefully.
ELAD:
How do you think all this evolves in terms of, if you look at the history of M ml or NLP or ai, we've had these waves of innovation in academia and then we've had waves of innovation and implementation and industry, and it's in some cases we've had both happening simultaneously, but it feels a little bit like it's ping ponged over time in different ways now that people are starting to be more closed in terms of some of these models and on the industry side and publishing less and being less open, how do you view the role of academia and industry diverging, if at all? Like, do you think it'll be different types of research that each type of institution tackles? Do you think there'll be overlap and sort of curious how you, how you view all that evolving?
PERCY:
I mean, I think industry and academia have very distinctive and important functions and I always, when I tell my students, well, we should be working on things that are lean on academia's competitive advantage. And historically I think this has meant different things. So before ML was that big, I think a lot of academic research was really about developing the tools to make these models work at all. I remember working on systems and being, uh, ML models back in grad school and basically it wasn't working <laugh>, I mean, computer wasn't working, vision wasn't working, question answering wasn't working. And I think the goal of academia there was to make things work. And, and a lot of the advances that were born out in academia then influenced other ideas and influenced other ideas before it started clicking. And now we're seeing this, uh, a lot of the fruits of both academia industries, research fueling this kind of industry drive that you see today.
And now today, I think, uh, it's, the dynamic is quite different because it's no longer academia's job isn't just to get things to work because you can do that in other ways. There's a lot of resources going into tech companies where there's, if you have data and compute, you can just sort of scale and blast through a lot of barriers. And I think a lot of the role of academia is understanding because these models for all their impressive feats, we just don't understand what they work, how they work, what the principles are, how does this training data, how does model architecture affect the different behaviors? What is, uh, the best way to weight data, how do you, what's the training objective? Many of these questions I think could benefit from a more rigorous, you know, analysis. The other piece, which is a different type of understanding is understanding social impact.
And this is going back to the question about what is crf M'S role is CFM is a center with over 30 different faculty across 10 different departments at Stanford. So it's quite interdisciplinary. So we're looking at foundation models, not just from a technical perspective of how do you get these models to work, but also thinking about the, their economic impact, the challenges when it comes to copywriting legal, we're working on a paper that explores some of those questions. We're looking at, you know, different questions of, you know, social biases and thinking through carefully how the impact of these models have on issues of, you know, homogenization where you have a central model that's making perhaps, uh, decisions for a single user across all the different aspects. So some of these are the types of questions. They're also also people at the center looking at, uh, risks of disinformation monitoring. To what extent these, uh, these tools are so persuasive, which they are getting increasingly so. And what are the actual risks, uh, when it comes to, let's say, foreign state actors leveraging this technology? And there's also people at the center who are in medicine and we're exploring ways of leveraging foundation models in deployment in actual clinical practice.
ELAD:
How near term do you think some of the deployments are? Because I, if you go back into the seventies, there was like the micing project here at Stanford, which was an expert system that outperformed Stanford medical school staff at predicting what infectious disease somebody had, for example. And that was 50 years ago, or almost 50 years ago, and it never really got implemented in the real world. And so one, one of my concerns sometimes in terms of the impact of some of these things is are there industries that are resistant to adoption or resistance to change? And is it exciting to hear that, you know, at Stanford they're, they're actually starting to look at how do you actually integrate these things into real clinical care? Do you view those things as very far out on the healthcare side? Do you view them as sort of near, I know that isn't the main topic we're gonna cover, but I'm just a little bit curious given how close you are to all this.
PERCY:
Yeah, I think it's a, it's a good question. I think there are a bunch of different issues that need to be resolved. For example, foundation models, uh, are train on a lot of data. How do you deal with privacy? How do you deal with robustness? Because once you're talking about, you know, in the healthcare spaces, uh, especially there are cases where we know that these models can still hallucinate facts and sound very confident in doing so. And how do you, I
ELAD:
Have some doctors like that too. <laugh>.
PERCY:
Yeah, there you go. So, but
SARAH:
You've also taken a point of view that we should, you know, expect superhuman if we, if we see superhuman performance from these models, like holding them to the standard of a human doctor is actually insufficient as well,
PERCY:
Right? Yeah, I I think that's a, that's a great point is that for ages, human level has been the target for, for ai and that has really been kind of a north star that has fueled many dreams and efforts and so on over the decades. But I think we're getting to a point where a lot ma many axes, it's superhuman or should be superhuman. And I think we should maybe define more of an objective measure of like what we actually want. We want something that's very reliable, is grounded, uh, you know, I often want more statistical evidence when I, you know, speak to doctors and sometimes fail to get that and have something that would be sort of a lot more principled and and rational. This is more of a general statement about how we should think about technology, not just chasing after mimicking a human because we already have a lot of humans now.
ELAD:
It's an interesting point. It's really fascinating to watch all this evolve. Right now you've done extensive research on natural language processing and computational semantics. Can you explain what those terms mean and how they're relevant to the development of ai?
PERCY:
So computational semantics is the process where you take language, text and compute, quote unquote meaning from it. And that is something I'm not gonna <laugh>, you know, maybe try to attempt to define. There's a huge, uh, literature of linguistics and, you know, philosophy about what, what meaning is. I will say that a lot of my research in, in the past maybe 10, five to 10 years ago was adopting this view that language is a programming language. It computes, you can give orders, you can instruct, you can do things with language, and therefore it was natural to model natural language as a formal language. So a lot of semantic parsing is about mapping natural language into a, uh, you know, formal space so that machines could execute this. And so one concrete application of this that I worked on for a while is mapping natural language questions into essentially SQL queries, which obviously has many different applications as well.
And what was nice about this framework is that to really do this, you had to understand how the words contributed to different parts of the sequel query. And then you could get something that was a program that you could execute and you deliver the results as opposed to many question answering systems, which you ask a question, maybe retrieve some document, you're retrieving the answer or either that or making something up rather than computing it rigorously. So, so that was a paradigm i, I was working in maybe five to 10 years ago. But the main problem is that the world isn't a database. A small part of the world is a database, but most of the world is unstructured. And then I started thinking about question answering in general, and we developed the, the squad question answering benchmark to fuel progress in open domain question answering.
And that in turn, and many other data sets that were developed both at Stanford and elsewhere, I think led to the development of these powerful language models that then like Bert and Roberta and Elmo back in the, around 2018 to then say many years ago, many years ago, ancient history now <laugh> to more like 2020 generation of, you know, uh, these large foundation models. So I think there's certainly a place for that type of thinking. There are cases where you want to just math, natural language into, say people call it tool use, like you ask some some question that reverse calculation, you should just use a calculator rather than try to sort of quote unquote do it in the transformer's head. But there's also a lot of aspects of reasoning which are not quite formal. We do this all the time and a lot of that happens kind of natively in the, in the language model.
And I think it's still an interesting question how to kind of marry the two. I feel like the two are still in sort of, you know, jammed together in a way. And maybe it's natural because there's certain things you can do in your heads. So certain things you can in invoke a, a tool to use. But this has been also one of the, the classic, uh, debates in ai. There's neural versus symbolic and you know, for a while symbolic AI was dominant. Now neural AI has come really taken off and become dominant. But some of those central problems of how do you do planning, how do you do reasoning, which was the focus and study of symbolic AI are now again really relevant because now we've moved past just simple classification and just entity extraction, but now more to to more ambitious tasks. What do you think are some of the more interesting research programs right now in that area?
I think that it's interesting to remark on what's happening because there are to a first order of approximation. Larger models trained on the relevant data seem to do well on various benchmarks. I think that maybe there isn't enough emphasis on data efficiency and how quickly you can get, and how robustly you can get these points because we know it has been well documented that benchmarks can be gameable. So even though you do well on benchmark, doesn't mean you've necessarily solve the problem. So I think one has to be a little bit cautious about that. So obviously scale and more data is just one clear direction, but in terms of orthogonal directions, what are the, the methods, several things have to happen. One is we have to have ability to handle greater context lens. If you think about a long reasoning chain, you know, transformers are fixed and there's ways to extend it, but fundamentally it's sort of a fixed model.
Let's say advanced problem solving. For example, if you wanna solve a math problem, you improve something, the language model generates sort of this chain of thought and generates token by token and then it generates something. But we know that humans, when they solve a problem, it's much more, uh, you try different things, you backtrack. It's much more iterative. Flexible, yeah, iterative and it can last a lot longer. And then just going for a, a few iterations and what is the architecture that can handle that level of complexity, I think is still an outstanding question. Is there any aspects of
ELAD:
Foundation or large language models that are emergent that you didn't anticipate or that really surprised
PERCY:
You? I think going back to G P D three, I think in context learning is something that surprised many people in including me. So here you're prompting a language model with an instruction and input output pairs, you know, here's a a sentence, it's classified positive, here's a sentence, classify negative. And the model is somehow able to latch onto these examples and sort of figure out what you're trying to do and solve the task. And this is really intriguing because it's, it's emergent. It wasn't hand coded by the designers to, oh, I wanna do in context learning this way. Now of course you could have done that, but I think the, the real sort of magic is you didn't have to do that and it yet it still does something. It's not completely reliable, but it's, it sort of can get better with better models and, you know, better data.
Then there's chain of thought. Do you wanna explain what that is? So the idea is if I have a question that's presented to a language model, the language model could just answer and it'll maybe get a right or wrong. But if you ask a language model to generate an explanation of how it would solve the problems kind of thinking out loud, then it's much more likely to get the answer right. And this is very natural that it would be the case for hu hu humans as well. But the fact that, again, the chain of thought capability is something that, you know, emerges. The other thing I think is really wild is this, and I think it's maybe a general principle, which is the ability to mix and match. So you can ask the model to explain the quicksort algorithm in the silo of, uh, Shakespeare and will actually construct something that is semantically pretty on point, but also stylistically, you know, much, much better than what I've many people could come up with.
Which means that it has learned different concepts of what Shakespeare and what quicksort are and is able to fuse 'em. So if you think about creativity, I think this is sort of an example of creative use. People say that sometimes all language models just memorized because they're so big and train on clearly a lot of text. But these examples I think really indicate that there's no way that these language models are just memorizing because this text just doesn't exist and you have to have some creative juice and invent something new. And I, I think just, just to kind of go on riff on that a little bit, I think the creative aspects of these language models with the potential for scientific discovery or doing research or pushing the boundaries beyond what humans can do, I think is really, really fascinating. Because up until now, again, remember the, the AI dream tops out at humans, but, but now we can actually go beyond in many, many ways. And I think that unlocks a lot of possibilities.
SARAH:
Yeah, there are a lot of really interesting examples. I mean, you could actually argue that connecting new concepts in any novel way is creativity. But I love the one that is just discovering like new tactics and go that humans haven't discovered after thousands of years at play. Maybe we'll ask if you'll risk making a prediction that is impossible, emergent behaviors of models at the next level of scale, anything you might predict emerging capabilities, if we wouldn't have thought chain of thought or in context learning would work.
PERCY:
So I, I can give you an example of something I think is emerging and I can give you an example of a hope, but I don't know if I would call it prediction. So what we're seeing today is the ability to instruct a model using natural language to do certain things. You see a lot of this online with chat G B T and bing chat where you can just, and, and some of philanthropics work as well. You can instruct a model to be succinct, generate three paragraphs in the style of and so on. You can gl out these guidelines and have the model actually follow. So this instruction following ability is getting extremely good. Now, I, I will say that how much is emergent and how much is not, it's hard to tell because a lot of models, it's not just a language model that's trained to predict the next word.
There's a lot of, you know, secret sauce that goes under the hood. So, and if you define emergence of, you know, it was not intended by the designers, I don't know how much of that is emergent, but at least it's a capability that I think is very striking. The hope is that language models currently mix stuff up, they hallucinate, and this is clearly a, a big problem and almost, uh, in some ways a very difficult problem to crack. The hope is that as models get better, that some of this will actually go away. I don't know if that will happen. I guess the way I think about these, these models is that, um, they're, they're doing some sort of, if you think about predicting the next word, it seems very simple, but you have to really internalize a lot of what is going on in this context.
What are the previous words, what's the syntax, what's who's saying them? And all of that information and context has to get compressed. And then that allows you to predict the next word. If you are able to do this extremely well, then you sort of have a model of what's happening in the world, at least the world that you've captured in, in text. And so while the notion of truth might be, you know, ambiguous in, in many cases, I think the model can get an idea of what certain, you know, parts of internet are maybe reliable and what parts of the internet are not and what kind of, you know, the idea of having, you know, entities and dates and locations and what activities there are. I, I think that will maybe become more salient in the model. Like if you think of model language model that's just predicting the next word and it's only trained to do that and you say elad travel to blank, of course it's gonna mix <laugh>, you know, something up without further context. But if it has a better understanding of what's happening and of of course with more context, then maybe it can use that context to actually know that well, okay, well I don't know, maybe I should ask where you went. Yeah, yeah,
ELAD:
Yeah. So scale is basically increasing with the statistical accuracy of the prediction on the next word because you have more context and more data by which to, for what's coming and therefore it will reduce hallucinations cause you're increasing accuracy.
PERCY:
Yeah. So I, I think there's pre-training, which is predicting the next word and developing a world model, so to speak. And with those capabilities then you still have to say don't hallucinate. Yeah. But it'll be much easier to control that model if it has a notion of what hallucination even is.
ELAD:
I was talking to somebody who was close to the development of the transformer model and his claim was that one of the reasons it's done so well is to your point around scale, right? Eventually you hit enough scale that you see that it's cl it's clearly has these really interesting immersion properties. So you keep scaling it up and you keep sort of growing it. And so therefore it's like a self-reinforcing loop to keep using these types of models. And his claim was that it's expensive to do that sort of scale and so therefore there may be other architectures or approaches that we've just never scaled up sufficiently in order to actually see if they have the same emergent properties or certain characteristics that may be superior. How do you think about that from the perspective of, you know, just going down the path of the transformer side versus other architectures that may be really interesting and maybe neglected because we just, yeah, we just haven't thrown enough compute at them cause it's expensive.
PERCY:
Yeah, I really hope that in 10 years we won't be reusing that transformer because I think the transformer is a, I mean it's a very good ature. People have tried to improve it, but it's sort of like kind of good enough for, for people to press ahead, but scientifically there's no reason to believe that this is the one. And there have been some efforts. So one of my colleagues, Chris Ray and his students have developed other architectures which are actually at smaller scale competitive with, with transformers and actually don't require the central operation of attention. And I would love to see much more research exploring other alternatives to transformer. This is something again that academia I think is very, uh, well suited to do because it involves kind of challenging the, the status quo. You're not truly trying to just get it to work and get it out there, but you're trying to reflect on what are the principles, what can we learn from transformers, what is it trying to do, and how can we incorporate them in a much more principled way? At some level, it's still going to be about compute, right? So people have shown that LSTs scaling laws for LSTM show that if you are able to scale up LS STMs, maybe they would work, you know, pretty well as well. But the amount of compute is, you know, many times more. And given a fixed compute budget, we're always in a cons, a compute constrained environment. It's
ELAD:
An efficient enough architecture to keep trying.
PERCY:
Yeah, you would, you would not use an LSM X transformer strictly dominates an LSTM from the perspective of, uh, given a com fixed compute budget. So this question of like, what if I could scale the ltm, it becomes a little bit sort of irrelevant. Mm-hmm. <affirmative>.
ELAD:
So for the things where you see transformer like performance, what sort of compute budget would you need in order to be able to test them out? Is it the scale of a million dollars, 10 million, a hundred million dollars of compute? I know it changes based on compute pricing and I'm just trying to get a rough sense of, you know, how expensive is it to try today? And then if we extrapolate down a compute curve three years from now, maybe it's tractable again or something. So
PERCY:
Yeah, it really depends on the, the gaps that you're seeing right now in academia, you can train 1 billion parameter, you know, models. I mean, it, it's not, it's not cheap by academia standards, but you can, you can do it. And you know, here at C R F M we're training like, you know, six or seven parameter models and I think it's enough to be able to try out some ideas. But ultimately because of emergent properties and importance of scale, you can only make a hypothesis. You can find something like, oh, this seems promising at smaller scales, you still have to go out and test whether it's really pans out or the gap just closes. And maybe this is a good segue to talk about the compute together. So we founded together on the, the premise that compute was, is a central bottleneck in foundation models.
On the other hand, there's a lot of compute that's decentralized that's, uh, maybe underutilized or idle. And if we could harness that compute and bring a bear for both, you know, research and also commercial purposes, then we could actually do a a lot more. There are some pretty hefty technical challenges around doing that because foundation models are typically trained in very high-end data center environments where they interconnect between devices is extremely good. Whereas if you just grab your average desktop or home interconnect, it's a hundred times or more slower. But Chris Ray and Saja and others, really, they did observe most of the credit for this. We've developed some techniques that allow you to leverage this weekly connected compute and actually get pretty interesting training going. So hopefully with that type of infrastructure we can begin to unlock a bit more of compute both for academic research, but also for, you know, other startups and so on. That's
ELAD:
Really cool. So it sounds a little bit like earlier predecessors of this may be things like folding at home where people did protein folding collectively on their computers or study at home where there was search through different astronomical data and now you can actually do this for training a, an AI system on your desktop or, you know, excess compute that exists at data centers or in other places.
PERCY:
So folding a home is I think a great inspiration for a lot of this work. At some point during the middle of the pandemic, they actually have the world's, uh, largest supercomputer in terms of flop count because it was used to do molecular dynamic simulations for covid. The main challenge with foundation models is that there's a lot of big models and big data that needs to be shuffled around. So the tasky composition is much, much harder. So that's why many of the technical things that that we've doing about scheduling and compression enable us to, uh, overcome these hurdles. And then there's a question of incentives. So I think there's two aspects of what together is building. One is a sort of what I'll call a research computer, which is for academic research purposes where people can contribute compute and in the, in the process of contributing compute, they're able to use the decentralized, uh, cloud for doing training when they're not using it and when they are using it, they can use a much more of it.
So the hope is that it provides a much more efficient use of the, the compute because you're spreading it across a larger set of people. And then, you know, on the commercial side, the hope is that the open models that are developed in the open source ecosystem can the together platform can allow people to fine tune and adapt these models to various different, uh, use cases. One thing I think is noteworthy is that, you know, we think of foundation models today as, you know, maybe there's a few foundation models that are, you know, very good and exist, but I think in the future there's gonna be many different ones for different kind of, uh, use cases as this space kicks off, many of them will be derived from maybe existing foundation models, but many of them will also be perhaps trained on from, from scratch as well.
SARAH:
I think this is actually a pretty uncommon viewpoint right now. Can you talk a little bit about like where you, um, or you, you know, research efforts you're associated with choose to train models, like in maybe BIO aub me or, or whatever else you think is relevant here?
PERCY:
So there's Foundation models is a pretty broad category of, and many of the, the sort of the core center is, you know, large language models that are trained on lots of, you know, internet data. We've trained a model here at C F M in collaboration with, uh, mosaic on, uh, bot called biomed lm. Um, it's not a huge model, but it's trained on pub articles and it exhibits pretty good, you know, performance on various benchmarks. For a while we were able to be state of the art on the US medical licensing exam. Mm-hmm. <affirmative>, you know, Google did come up with a model that was, I think 200 times larger and they, they, they beat that model. So, you know, scaled does matter, but, but I think there are many cases where you, for efficiency reasons, maybe you do want a, a smaller model since cost I think is, uh, you know, a big concern. So
SARAH:
I wanna talk about some of the, I think like most important or hopefully most important work that the center's done so far. Can you explain what Helm is and what the goal has been?
PERCY:
Yeah, so Helm stands for Holistic Evaluation of Language Models, which is this project that happened over the last year. And the goal is to evaluate language models. So the trouble is that language models is a very generic thing. It's like saying evaluate the internet, what does that mean? The language model takes text in and text out. And one of the features of a language model is that it can be used for a myriad difference applications. And so what we did in that paper is to be as systematically and as rigorous as we could in laying out the different scenarios in which language models could be used, and also measure aspects of the these uses, which include not just accuracy, which a lot of benchmarks focus on, but also issues of how robust it is, how well it's calibrated, meaning that whether does the model know what it doesn't know whether the models are fair according to some definition of of fairness, whether they're they're biased, whether they, uh, spear out toxic content, how efficient they are.
And then we go and we basically grab every language model that's prominent that we could access, which includes open source models like O P T and Bloom, but also getting access to APIs from coha, AI 21 Open ai, and uh, also philanthropic and Microsoft. So overall there were 30 different models, 42 scenarios and seven metrics, and we ran the same evaluations on all of that. We've put all the results on our the Helm website so that you could see the top level statistics and accuracies, but also you can drill down into, hmm, on this particular benchmark, what are the instances, what are the predictions that these models are making, um, all the way down to what prompts are you using for the language models. So the idea here is that we're trying to provide transparency to this space, right? We know that these models are powerful, they have some deficiencies and we're trying to lay that all out in a kind of a scientific manner.
So I'm pretty excited about this project. The challenge thing about this project is, since we put out the paper maybe three months ago, a bunch of different models have come out including chat, G B T, llama, you know, uh, coherent AI 21 have updated their models. G B T four might come out at some point. So what had this project has evolved into is this basically this dynamically updating where every two weeks we refresh it with new models that are coming out as well as new scenarios. Because one thing we also realize, which is made clear by, uh, chat G P T, is that the type of things that we ask of a language model is changing. We don't ask it just to do question answering or just to do as they increasing capabilities, increasing capabilities. Now they can do a lot more. They can, you know, write a email or give you, you know, life advice on X, Y, z if you've put in a scenario and, and or write a, you know, a, an essay about X, Y, Z.
And I think what we need to do with the, the benchmark is also add the scenarios that accordingly improve or, or capture these capabilities as well as kind of new, uh, risks. So we are definitely interested in benchmarking, how persuasive are these, these language models which governs, you know, what are the risk and also how secure they are. One thing I'm actually also worried about is given all that the jail breaking that is, uh, extremely common with these models where you can basically bypass safety controls if these models start interacting with the world and accepting external inputs. Now you can not only just sort of jailbreak your own model, but you can jailbreak other people's model and get them to do various things. And then, so that could lead to sort of a cascade of errors. So some of these are the concerns that we hope to also capture with the model. I should also mention we're also trying to look at multimodal models, which I think is gonna be, um, pretty pertinent. So lots to do
SARAH:
A bunch of the things that you've described as sort of the role you see for the center or even like academia in the age of like foundation models broadly. Like they have more of an intersection with policy than traditionally, like machine learning research. Like how do you think about that?
PERCY:
Yeah, actually we've, I'm glad you asked that because we've been thinking a lot about the social implications of these models and sort of the, not the models themselves, which we focus a lot on talking about, but the environment in which these models are built. So there are a few players in this space with different opinions about how models should be built. Some are more closed, some are more open. And there's also, again, this sort of lack of transparency where we have a model that's produced and it's aligned apparently to human values. But then once you start kind of questioning, you can ask the question, okay, well which value, which humans are we talking about? Who determines these values? What legitimacy does that have? And what's the sort of accountability? Then you start noticing that, well, a lot of this is just kind of, uh, completely of a black box.
So one thing that we've been working at the center on is developing norms, starting with transparency. I think transparency is necessary but not sufficient. You need some level of transparency to even have a conversation about any of the, the policy issues. So making sure that the public can understand how these models are, are built, at least some notion of like what the data is, what are the instructions that are given to align the models, we're trying to advocate for greater transparency there. And I think this will be really important as these models really get deployed at scale and start impacting, you know, our lives. You know, a kind of a analogy I like to think about is, you know, nutrition labels or any sort of specification sheets on electronic devices. There's some sort of, uh, obligation I think that producers of some product should have to make sure that their product is used properly and has some bounds on it.
ELAD:
I, I guess I'll ask two questions. One is, if people wanted to participate in together, is there a client they can download and install or use or how, how can people help support the together efforts?
PERCY:
Yeah, so we are developing a client that will be made available. We both from the perspective joining the Together cloud so that you can contribute your compute, but also we're, we have an API that we're developing so that people can use together infrastructure to do inference and fine tuning models. We are also training some open models. So we have this something called open chat kits that's we're releasing soon, and this is built on top of euth, AI's NEO X model, but, you know, improve to, um, include various different types of capabilities. It's still a, you should think about it as really a work in progress. What we've trying to do is open it up so that people can play with it, give feedback, and have the community improve this together rather than us trying to produce some finished product and putting it out there. This goes back to the point about involving, you know, the spirit of open source and involving the community to build these, uh, foundation models together as opposed to someone unilaterally, uh, building them.
SARAH:
While we're talking timelines and predictions that you don't quite feel comfortable making, how do you think as a rigorous scientist about a g I,
PERCY:
I must say that my opinions about a g I have changed over time. I think that for a while it was, you know, perceived by most of the community as laughable. I will say that in the last 10 years I have been aware of, you know, there's a kind of a certain community who think about a g I and also existential risk and things like that, you know, so I've been in touch with people who think about these. I think I see the world maybe differently. I think of perhaps certainly these are powerful technologies and could have, uh, extreme social consequences and, but there's a lot of more near term issues. I focus a lot on kind of robustness of ML systems in the last, uh, you know, five years. But, you know, one thing I've learned about foundation models because of their emerging qualities, I've learned to be very kind of no open-minded, I would say. I was asking earlier about what no priors, where that comes from, and I think it's a fitting way to think about, you know, the world because I think everyone, including scientists o often get sort of, uh, drawn into a particular worldview and paradigm. And I think that, you know, the world is, is changing both on the technical side, but also how we conceive of AI and, you know, maybe even humans at some level. And I think we have to be open-minded to, you know, how that's gonna evolve over the next, uh, few years.
SARAH:
Awesome. Thanks for doing this conversation with us, Percy. It was great.
ELAD:
Yeah. Thanks for
PERCY:
Joining us. Yeah, thank you very much.