No Priors 🎙️111: Matei Zaharia, Founder and CTO of Databricks (TRANSCRIPT)

April 06, 2023

No Priors 🎙️111: The Future is Small Models, with Matei Zaharia, Founder and CTO of Databricks (TRANSCRIPT)

EPISODE DESCRIPTION:

If you have 30 dollars, a few hours, and one server, then you are ready to create a ChatGPT-like model that can do what’s known as instruction-following. Databricks’ latest launch, Dolly, foreshadows a potential move in the industry toward smaller and more accessible but extremely capable AIs. Plus, Dolly is open source, requires less computing power, and fewer data parameters than its counterparts.

Matei Zaharia, Cofounder & Chief Technologist at Databricks, joins Sarah and Elad to talk about how big data sets actually need to be, why manual annotation is becoming less necessary to train some models, and how he went from a Berkeley PhD student with a little project called Spark to the founder of a company that is now critical data infrastructure that’s increasingly moving into AI.

No Priors is now on YouTube! Subscribe to the channel on YouTube and like this episode.

Show Links:

Show Notes:

[01:29] - Origin of Databricks
[4:30] - Work at Stanford Lab
[5:29] - Dolly and Role of Open Source
[12:30] - Industry focus on high parameter count, understanding reasoning at small model scale
[18:42] - Enterprise applications for Dolly & chat bots
[25:06] - Making bets as an academic turned CTO
[36:23] - The early stages of AI and future predictions

SARAH:

If you have $30 a few hours and one server, then you're ready to create a chat GPT like model that can do what's known as instruction following the latest launch. Dolly from Databricks, which is available in open source for shadows of potential move in the industry towards smaller and more accessible but extremely capable ais. MATEI, co-founder and chief technologist at Databricks is here to tell us all about Dolly. We'll talk about how big data sets actually need to be, why manual annotations becoming less necessary to train some models, and how he went from a Berkeley PhD student with a little project you may have heard of called Spark to the founder of a company that's now critical data infrastructure that's increasingly moving into ai. Welcome to the podcast mate.

MATEI:

Thanks a lot. Excited to be here.

SARAH:

Can you start by telling us a little bit about the origins of Databricks and how it led you to where you are today?

MATEI:

Sure. Yeah. So Databricks started from a group of seven researchers at uc, Berkeley back in 2013. And we were really excited about democratizing basically the use of large data sets and of machine learning. So we had seen, you know, the web companies at the time were, were very successful with these things, but most other companies, you know, most other organizations, things like scientific Labs and so on weren't. And we were really excited to look at making it easier to do computation on large amounts of data and also to do machine learning at scale with the latest algorithms. So we had started, you know, during our research we worked with some of the web companies. We also started open source projects like most notably Apache Spark, which you know, was essentially, you know, first version of it was my PhD thesis. And we had seen a lot of interest in these and we thought, you know, it would be great to start a company to really reach enterprises and, and make this type of thing much better and, you know, actually a allow other companies to use this stuff.

SARAH:

Can you just give us a sense of what Databricks looks like today from like a, you know, scale and product suite perspective?

MATEI:

Sure. Yeah. So Databricks offers a pretty comprehensive data and ML platform in the cloud. It runs on top of the three major cloud providers, Amazon, Microsoft, and Google. And it includes support for, you know, data engineering, data warehousing, machine learning. And most interestingly, all this is integrated into one product. So for example, you can have one definition of your business metric that you use in your BI dashboards, and the same exact definition is used as a feature in machine learning. And you, you don't have this drift or copying data and uh, you can just kind of go back and forth between these worlds. The company has about 6,000 employees now and last year we said that we crossed, uh, a billion dollars in a R and we're continuing to go. It's a consumption-based cloud model where, you know, customers that are successful can go over time and bring in new use cases and so on.

SARAH:

Did you think the opportunity was as big as it has been when you started the company?

MATEI:

Yeah, well we definitely didn't, you know, anticipate necessarily to, to go to this size right as a lot of things can go wrong, but we were excited about the, the confluence of a few trends. So first of all, you know, it's so easy to collect large amounts of data and people are doing a automatically in, you know, many industries and second cloud computing makes it possible to scale up very quickly, do experiments, scale down and so on, which enables more companies to, to work with this kind of thing. And then the third one was machine learning. So we thought, you know, these are powerful trends and the exciting thing for, you know, us as a company is we didn't invent cloud computing. We didn't necessarily invent big data or anything, but we were able to start at a point in time when, when many companies were thinking to move into this space and just provide a great platform for that. And there's this migration already happening and you know, if you provide the best platform as people are migrating to the cloud, they'll consider it.

SARAH:

You still keep roots in research, you have a research group at Stanford. Can you talk about that?

MATEI:

Yeah, so um, I'm a computer science professor there, so I split my time between that and Databricks and we work on a bunch of things we usually like looking farther ahead into, into the future. And we've worked a lot on scalable systems for machine learning, how to do efficient training on lots of GPUs and stuff like that or how to do efficient serving. And then another thing I'm really excited about that we started about few years ago is looking at knowledge intensive of applications where you combine a language model with something like a search engine or an API you call or something like that and you try to produce a correct result maybe for a complicated task, like do a literature survey and then like tell me, you know, what you found about this thing with a bunch of references or counter-arguments or whatever. And I have a great group of PhD students that are working on that and you know, exploring different ways to do it.

ELAD:

How did Databricks decide to start working on Dolly? What spark that and you know, how did you first get going on that?

MATEI:

Yeah, so we'd had customers working with large language models of various forms, you know, e even before chat GPTcame out, but they were doing the more standard things like translation or sentiment analysis or things like that. A lot of them were tuning models for their specific domains. I think we had like almost a thousand customers that were using these in, in some form. But then when chat g d came out in November, it got people interested in, you know, using these four a lot more than just analyzing a bit of data and instead creating entire new interfaces or new types of computer applications, new experiences in them. And so there was an intense interest in this even at a time when, you know, the industry in general is being conscious about spending and like which things are really required and so on.

This was an exciting one and the really exciting thing about J GPTas you both know, is the instruction following or basically the, the ability of it to kind of carry on a conversation and like, you know, listen to the things you're telling it to do and do those as opposed to just completing text or just telling you a, you know, small amount of information like this is a positive or negative sentiment. So we really wanted to see whether it's possible to democratize this and to let people build their own models, you know, with their own data without sending it to some centralized provider that's trying to sort of learn from everyone's data and you know, kind of control their, their destiny in, in this space we were exploring different ways of doing it and in particular like Dolly is partly based on this great result from some other faculty members at Stanford called Alpaca, where they tested a a way to, you know, basically they used the model to generate a bunch of realistic conversations and then they used this to train another model that can now kind of carry on conversation on its own.

And so we tried essentially cloning that approach but starting with an open source model and it actually worked pretty well. And so that's what became Dolly. But yeah, we we've been looking at this space for a while and seen, you know, incredible demand for these kinds of applications.

ELAD:

Yeah, that I think the industry's really been very focused on scaling data parameter size and flops. And I think you all really have showcased the power of instruction following even a, you know, something that's relatively smaller scale. Could you explain that and how that all works?

MATEI:

It's very interesting and I think there's actually a lot of research still to be done here because these models have been mostly locked up in these very large companies for a while and everyone thought it's too hard to reproduce them. So the interesting thing is, you know, language models had existed for a while. You, you basically trained them to complete words, you know, here's a missing word in the text, can you fill it in? And then at the beginning when people tried to apply them to real applications, not just, you know, I, I erased the word on my homework, like fill it back in. But like actual applications, they had always done various ways of, you know, training something else on top of, you know, say the feature representation and these. And so there was a lot of domain specific work, but you could build like a sentiment classifier or or stuff like that.

Is it positive or negative? Probably like three years ago now, OpenAI published the GPTthree paper, which is called language models are Few Shot Learners. And they said number one, like we trained a language model to 175 billion parameters and we, we trained it on, I think it's like 45 terabytes of text. So lots of data, lots of parameters, and it's like pretty good at language modeling. And then number two they said you can actually kind of prompt this with a few examples of a task and it picks up on the task and does it. So lots of people were working on that, you know, how do you pump it? What's the best example to show? But everyone assumed that for that capability you need a giant model to begin with. So even the researchers in academia were calling into into GPTthree and trying to build, you know, stuff based on it and studied this phenomenon.

And then last year, 2022 Open AI published this other paper which was sort of a instruction tuning these models where they said, hey we, we used some human feedback and then some reinforcement learning and we got this GPTtheme model to actually just listen to one instruction. It doesn't need a complicated prompt Palmer thoughts of examples and it kind of works. And then they released a version of this at chat d p D. So I think in a lot of people's minds the scientific, you know, view of it was first you need a giant model and then you need this reinforcement learning thing and only then do you get this conversational capability and broad world knowledge. So it's actually very surprising In alpaca we just had a larger data set of, you know, human-like conversations and we had this very kind of modest size open source model that's only 6 billion parameters only trained on less than one terabyte of texts.

So like 50 times less data than GPTthree and it still has this behavior. It's uh, I think it's been pretty surprising to a lot of, you know, researchers the size of model that still gets you this kind of instruction following ability. So I think it is kind of an open research problem, like what exactly about these data sets is it that makes them good at this? What are the limitations, you know, either tasks that these are clearly voice at or better at. It's actually kind of hard to evaluate with long answers cuz it's hard to like automatically score them and say, you know, like this is a good Seinfeld skit that you generated and this is like a bad, you know, Barack Obama speech, but I think we'll figure this out. Yeah.

ELAD:

Were there any things that emerged from the model that you also found surprising? Like you mentioned one aspect of it just in terms of the approach you took and you know, with uh, dramatically more limited data and approach you ended up with really performant behavior. Were there other things that were unexpected properties of, of what you did with Dolly?

MATEI:

Yeah, I think to me the most interesting thing is it's surprisingly good at just freeform like kind of fluent text generation. So you can tell it like create a story or create a tweet or create a scientific paper abstract. And it does a pretty good job at that. And before that, whenever I talked to my, you know, n l p like researcher friends, they thought that that creativity was the thing that required a lot of parameters from something like GPTC. Like they actually told me, oh the knowledge intensive stuff, like remembering facts tell me the capital of like France and whatever. That's not surprising that a small model with a few parameters can do it, but the, the creativity that's like really hard. So this one is actually pretty good at the creativity and generation. It's less good at remembering lots of facts which kind of makes sense given the parameters. So if you ask it about common topics, you know, it'll be good if you ask it like the author of a book, you know, it might give the wrong one. I think we, we had an example cuz we've actually been building a a slightly bigger version of this too. And we had this question with like who is the author of Snow Crash, which is uh, Neil Stevenson and the initial dolly model said Neil Gaiman. So you know, it's still O'Neal, it's still a an author but it's still Han

ELAD:

Neil the sci-fi writer. Yeah, yeah.

MATEI:

So it's less good at remembering facts but pretty good at coherent sort of generation.

ELAD:

Yeah. The name Dolly basically references the first quote mammal Dolly the sheep. Can you explain the reference within the AI space?

MATEI:

Yeah, so it's based on, you know, cloning this other model from Stanford called Alpaca by doing that with an open data set. And that itself was based on something that meta released I think maybe three weeks ago or less called Lama, which is they took a modest size model, 7 billion parameters and they trained it on a ton of data. I think they said 1.4 trillion tokens or something like that, which is, I don't know how many bytes of data it was, but it was multiple terabytes of data basically. And they said, Hey, by just training this for longer, we got a small model that's actually producing pretty high quality content for its size. So there were all these kind of, you know, uh, wooly sort of animals out there and we thought it's just too perfect to like clone it and there are all these other things like, you know, it's like the dal lama, I don't know. There there are all these like things. Yeah.

ELAD:

So yeah, I that was a great name. Yeah, that was a good name. Yeah. Are there other things that you can share that you all have coming in the background at Databricks, right, your Stanford lab in terms of this more general area of language models?

MATEI:

Yeah, I mean, uh, Databricks definitely, you know, we're using everything we, we learned from Dolly and we're learning from our customers to, you know, to just offer a great suite of tools for training and, and operating L L M applications. We already have a popular ML ops platform and we, we also have this open source project called ML Flow that integrates with a lot of tools out there that are, are offering as built around. So you can expect some nice integrations into that, you know, separately we're also working on Databricks product features that use language models internally and learning a lot from developing those and you know, feeding that into our product. So I think in the next few months you can expect it and we also have this big user conference data AI summit coming up, uh, in June that will probably have uh, you know, a lot of stuff about this.

And I would say, you know, as a researcher and also kind of with my Databricks hat on, the thing I'm most excited about is really connecting these models with reliable data sources and making them really produce reliable results. Cause you know, if you use chat GPTor GPTfour, the two big problems with that are number one, like the knowledge is not up to date, you know, it only knows stuff it was strained on. And number two, a lot of the things it says are inaccurate and it's confident but like wrong in various ways. And I think you can tackle both of these by combining some kind of language model with um, you know, a system that that you know, pulls out like vetted data either from documents like a search engine or from uh, you know, APIs and tables and stuff like that inside your company.

You know, like for example, when I talk to the chat bot in my bank, it should know my latest bank account balance and transactions and stuff. You know, if I'm like, can you cancel the payment I made cuz I unsubscribe? You should just know what that means. So cracking how exactly to do that isn't easy. It may actually be easier with small models than with big ones to reduce hallucination from them, but it, you know, I think it's still an open question, but I think if we can figure this out then these become uh, much more reliable component in a, in an application.

SARAH:

Maybe we'll go from there to just like projecting a little bit about like architecture and research. You know, so much of the industry is focused on model scaling, right? Improving reasoning that way. Like how much do you think that matters in terms of I guess like real world usage in production with your customers in the near term?

MATEI:

Mm-Hmm <affirmative>? Yeah, great question. So, uh, to me at least the relationship between scale of the model versus, you know, quality of the data and supervision you put in versus like design of an application around it and, and those things are like overall quality. I think the relationship is not a hundred percent clear yet. Like to get a really reliable model that say, I dunno can, you know, like make a pharmacy prescription or something like that, maybe you need a trillion parameters, you know, maybe you actually need a, a really carefully designed data set and like supervision process which is kind of traditional sort of ML engineering type work. Or maybe you actually need a clarify application where like you're chaining together a couple of models and things and you're saying well does this make sense? Can I find a reference? Can I show this example to a human if it's really hard?

So I think it's a little bit open. The thing I can say for sure, especially and Dolly and like other, you know, results like this really highlighted is it does seem that the core tech w is getting commoditized very quickly. So if you just wanna run, you know, something like today's chat G P D, it'll be a lot cheaper cuz all these hardware manufacturers are building devices that are specialized and much cheaper. And another thing that's making it less expensive is we're figuring out ways to get a smaller model with less data, fewer parameters and stuff to get similar performance. So that I think is happening, uh, faster than at least I would've thought, you know, a few months ago. So at least to get something with today's capabilities, I think it'll be very affordable and you might just be able to run it locally on, you know, your phone or something.

The question of how large can you know if you make a much larger model, is it gonna be a lot smarter? I think it's still a bit unknown. I mean there are people who argue it's going to be very good at reasoning, but at the same time, this kind of token by token generation we're doing now is not an amazing format for reasoning because you have to like linearly like say one thing at a time. So it's not really good for like making plans or comparing versions. I think to get a really smart application you'll need to combine today's language modeling with some other sort of framework around it that, you know, uses it multiple times or explores a plan space or whatever and then you might get something good. And it's also possible that the very largest models are simply memorizing more stuff. So like they're impressive in terms of trivia. Like I can ask it about some random topic and it'll know, but they're not really like smarter at solving even a basic, you know, word problem. So yeah, I, I'm not sure unfortunately, especially with training from the web, it's often very hard to tell apart, like reasoning from memorization essentially didn't see that thing before. So it's, um, I I think actually being able to do experiments where you train these on carefully selected data and will lead to better understanding of like what they can do.

SARAH:

Yeah, yeah, that makes sense. Maybe if we think a little bit, just cuz you have great visibility from your, your role at Databricks, like what others willing do companies need, like your enterprise customers or just generally enterprises need to make use of these models? Because you said, you know, we believe the core technology, the models themselves are getting commoditized.

MATEI:

Yeah, so definitely the first piece is you need the, a data platform that could actually build, you know, reliable data, right? So we think that's like the bread and potatoes of like getting anything you, you, you need some <laugh>, you know, a basis to like sort of build on. So we think that will become really important and you know, maybe data platforms will have to evolve a little bit to be better at supporting unstructured data like text and images and so on. And to do quality assessment and stuff like that for it. That's one piece. I think another piece you need is you need the ML ops piece of like being able to experiment with things, deploy them, ab test them, and so on. And, and see what does better and, and improve it incrementally. And I also think these models will need a good connection to operational systems inside the company to do really powerful things with like the latest data.

So, you know, you saw probably the support for tools in in chat G P D, you know, before that there were lots of groups working on at least models integrated into search engines, sometimes into calling other tools as well, like calculators. I think it's still a little bit open-ended. There's one extreme where people say the model will figure out what tools to use on its own. I think for like enterprise use cases, that's a little bit like more than you really need, you know, you can kind of give it some tools and feed at stuff and it doesn't have to discover and like read the manual to figure out which one to use. But yeah, I think that's another piece you'll need for like really powerful applications. And then I do think infrastructure, like just basic training and, and serving infrastructure is important too. When you start to care about performance, like about latency and speed and you can see some of the, um, you know, new search engines using these models are not, not that fast, right? Uh, like a little bit slow, you know, it would be nice to have it faster and for automated analytics it's even more important that it's efficient. So there could be, I think there'll be a lot of activity there. Yeah.

SARAH:

Where do you see enterprises getting the most value from investing in, I guess more traditional ml and then like some of the language model stuff to date?

MATEI:

Yeah, great question. So traditional ml we're seeing actually virtually all major enterprises, you know, in all industries are using it. It's changed a lot in the past decade actually. So it's very good for forecasting things in general and for automating certain types of decisions. So basically like for example, optimizing your supply chain, right? You, you don't have time to look at like exactly everything that's going on and you know, think about it and have a meeting, but you know, if you do order like the right amount of like parts to meet your demand this week or if you minimize the amount of time, you know, an agricultural product like sits in a warehouse and like, you know, degrades in quality or stuff like that, it matters a lot and it can have a huge impact on the profitability of a company. So we're seeing a lot of that people applying a to automate, you know, supply chain and, and to automate basically their operations in various ways.

And then there are more classic use cases like far detection and stuff like that where also, you know, it's always an arms race and like you're trying to, to do the best you can because every percent of like accuracy you do better in can translate into, you know, huge impact with language models specifically. And especially with kind of conversational ones. I think the, you know, the really exciting thing is interfaces to people and I think customer support is a very obvious one. Maybe things like recommendations or asking questions on a product page, you know, in retail things like search, augmented with stuff is one. And we've also found that just internal apps in a company that have a lot of internal data can benefit from this kind of thing. So like one of the things, you know, we we've built for example is inside Databricks we have all these resources for, you know, engineers to understand how different parts of the product work, how to operate it, like all the APIs and you know, people used to just ask each other questions in these slack channels for each team and we could use that data like the questions and answers plus the, the data, you know, in the actual documentation to, you know, essentially automatically answer many, many such questions and just save people a lot of time.

So I do think that any app that has kind of business data or like stuff written by humans in it, like your issue tracker for your software development or like your Salesforce or something like that could benefit from these kind of interfaces. Yeah.

ELAD:

Yeah, it seems like any type of forum or anything else instantly becomes like data that you can use to fine tune or train a model that's specific to your customer support use case or you could use an embedding or something to do interesting things with it. So it seems, it seems like some really cool stuff to do. Are there any specific areas that Databricks is not focused on that you think would be especially interesting for somebody to build from a tooling perspective for enterprises trying to use some of these technologies?

MATEI:

Yeah, um, I, I think there are a lot of these, I think it's very early on. Probably one of the most obvious ones is just the domain or vertical specific models and tools. I actually think even a lot of the enterprises that like have a lot of the data and various domains might turn more into data or model vendors of some form in the future as they use this to like build something that no one else can. So I wouldn't be surprised at all if you see like the next, you know, wave of companies for say, um, security analytics or like, you know, biotech or analyzing financial data or stuff like that really build around L L M technology. And uh, and I also think in general in the app development space, like how do you develop apps that incorporate these tools? It's very open, it's not clear what the best way to do it is, and you know, you might end up with like really good programming tools that focus on this problem.

I would say, you know, for people thinking about startups and so on, like you want your startup to have, you know, a long-term defensible mode, ideally something that goes over time also. So anything around the unique dataset, for example, or unique like feedback into action you have is always good, right? Like, honestly, even something like adding ML features in your product that just kind of learn from your users and you know, do better recommendation and so on could eventually become a motor. Like, you know, others just can't easily catch up. But I think the, you know, anything that's around custom dataset is sort of safest.

ELAD:

Yeah. When you were working on Spark for your PhD, did you think you'd become a founder? Was your intention to start a company or did you just think it was interesting research to do? Or both?

MATEI:

No, it really wasn't. Yeah, I mean, w as a grad student, you know, I, I've always been interested in just like doing, you know, things that help people that have an impact, help people do cool things and you know, I, I had seen these open source technologies out there for distributed data processing. I thought, okay, well I'll try to start one and see how it goes. You know what, I wasn't sure that people would really pick it up and use it, but I wasn't looking to be a founder necessarily. I was just looking to do something useful in this like, emerging space. And honestly I was at least considering to be a computer science professor and I thought if I'm gonna be a professor and all the most exciting computing is happening in data centers today, and like, I don't know how that works, how am I gonna teach, you know, computer science to, to people, so I better learn about that stuff. But it turned out to be something, you know, more broadly. Interesting. Why

ELAD:

Was the most unexpected thing about being a founder?

MATEI:

There are a lot of, uh, challenges along the way. I think just being able to learn about all the aspects of a business and how much complexity there is in each one. You know, starting out as a more technical person, uh, first I, you know, I didn't really know what they expected, but there's a ton of depth in each one. And if you understand them, if you like, really try to understand them, get to know the culture of people there, like really get to know what they're thinking about, you can make much better decisions across multiple aspects of your company.

ELAD:

Is there anything that you would advise people coming from a similar background to yours? I, I have a PhD as well, although it's in biology. I feel like there's certain things that I learned in academia that was really valuable and then there's a bunch of stuff I really needed to unlearn as I went into industry. Are there specific pieces of advice you'd give to technical founders or PhD founders in terms of things that they should unlearn?

MATEI:

Let's see, like a lot of research, at least in computer science, the, the kind of stuff that I've worked on a lot of research is basically, it's mostly prototyping. It's like, can we showcase an idea? But it is not really software engineering of like, we'll build a thing that can be maintained and like runs flawlessly in the future and like supports, you know, problems. So I think you should kind of unlearn just the focus on short-term stuff and think about how is this going to go over time eventually, right? There is a phase of the company where you're just prototyping to get a good fit, but you should design things so they can evolve into, you know, into something that's very reliable long term. The other thing is, you know, I I think unlearn trying to invent everything from scratch, you should really be careful about like, hey, where am I doing something unique or, or if I'm doing something different from others, like, why is it right? Don't do it just for kicks. So, because in research it's very tempting to say, you know, I did this new thing, I'm gonna, you know, I'm gonna try all the fanciers like new ideas in each component of it.

SARAH:

Was there something that you guys like experimented with being like, you know, first principles unique about that you then said, you know, there are systems for this.

MATEI:

A good one early on was deployment infrastructure for like, how do we deploy and, and update our software across, you know, all the clouds and so on. And we soon realize it's better to go with really standard things like, uh, Kubernetes and tools like that than to try to do something custom because they're evolving very quickly. So yeah, that, that's kind of a good example where like you, at the beginning you say, all right, how hard can it be? You know, let's just build something. But then you realize, wait, every, every month there's like new stuff coming out and maybe this isn't where we wanna focus on.

SARAH:

So maybe just thinking about being like c t o now of, uh, very large company, like how is your lens as a researcher, computer science researcher informed your thinking as a c t

MATEI:

I think first of all, uh, uh, as a researcher, like you think a lot about the long-term trends, like what, you know, what could things look like five or 10 years from now or what's the, what's kind of the fundamental things here? So for example, this thing about LLMs being commoditized and or honestly the, the thing about them kind of maxing out at more parameters. I think many people hadn't really thought about that, but if, if you think back, like, you know, there is a lot of room to improve efficiency usually in hardware and software for an application. And this particular application is kind of simple because it is all basically like, you know, two or three different types of matrix operations. So like it's sort of the hardware designer's dream to do this stuff. And also there are usually diminishing returns from scale in terms of quality of, of models in general. And you can also kind of see it in other areas. Like in in computer vision for example. We don't have trillion parameter models. You get actually pretty small models that you can train for a specific task that are good. Self-Driving cars is another example. You know, they've rapidly improved in quality up to a point and then they kind of plateaued and they're still not really, you know, ready for primetime. Eventually you hit some limits.

SARAH:

There are plenty of people who are researchers in the field who don't really see an asymptote, right, with scaling. And so where do you believe that limit comes from? Like parameters, compute data, something else?

MATEI:

I just think a lot of things like scale sub linearly in, in general now it's hard to tell for, you know, things like reasoning and so on. But certainly in classical machine learning, like for example, if, if you're, you're trying to learn, um, a function that like separates positive and negative examples and as you add more data, like your accuracy doesn't really improve linearly, like, you know, with a few examples you get a pretty good estimate of that boundary. And then with more of them it gets a little bit better, but uh, it, it doesn't get like that much better. So it's just, I think it's common. That would be my main reason. Now the one thing, so with language models specifically, I think the part that does go linearly with more parameters or should is ability to just memorize more stuff. So if you wanted to tell you like who was on the fifth episode of like friends and like what was the second line they said and stuff like that.

Like yeah, more parameters will get you a neural network that just by putting that input, I can tell you that stuff, but that wasn't that interesting to me because I think the right solution for that is look things up in a database, like do retrieval, right? Do a search index. I think actually I think from a computation perspective, it's very inefficient to have like a trillion parameters and have to actually load them all and add and multiply by them each time you make an inference cuz they're just encoding knowledge, most of which you don't need for that inference. So that one I wasn't as excited about, but I think there are people who are just excited about neural networks. Like how do you know it's the same kind of people who wonder like how do brains work? Like how do animals learn who are just excited about weight? I only had some neurons and I put in this stuff and it remembered it. But as an engineer I'm not that excited cuz I'm like, yeah, I could have built a database that did that, but in terms of like, hey, I just trained a network with Grady and dissent and it did it, that is kind of cool.

ELAD:

Yeah, I feel like people are almost the opposite where we're actually quite bad at memorization. We're very good at inferring things. And so it's interesting to ask what is the basis for that computationally.

MATEI:

Yeah, the the other thing that we're learning though from this is it does seem that the type of data you, you put in and the kind of fine tuning and essentially it's like weighing the data has has a lot of impact. So this instruction tuning stuff is like really we have only a few examples of instruction following, but since we do fine tune the model, it's as if you, we put a very high weight on it and had lots of examples of that in our training set. And I mean, I think it's still an open question. Like for example, if you made a lot of examples of logical puzzles, right? Like you just generate some problems and solutions, would you get a model that's better at logical reasoning? There are other things you can do. I also think a big problem with current models Inca hinted at this before is we're just calling them to generate one token at a time.

So for example, you, you've probably seen this like chain of thought reasoning thing. Like if you ask a model a mass problem and it just tries to answer like, how many sheep were there, it might say like seven or something. And then it tries to make up the explanation and it's like wrong. But if you tell it, do the explanation first, think step by step and then answer, it's more likely to be right. But you can imagine other versions of that, like if it had a scratch pad, if it had a way to backtrack to say, you know, this is kind of a dead end, it might become better. So I think stuff like that, that's kind of around the model, it's still an AI system, but it's not just one giant d n n can further improve uh, uh, its abilities.

SARAH:

Yeah. And you, you've seen that work in like really complex and impressive ways. Like we had known Brown from the, the sort of Cicero group on and you know, they have planning as part of it, right? Versus it's just one, one very large model. They expect to do all the reasoning. You were actually saying like, you know, you, you basically make like sometimes controversial like long-term predictions about what's going to happen. Like, you know, there there's an asymptote in sort of value of scale

MATEI:

Yeah. As a researcher

SARAH:

And how does that impact like your decisions as CTO?

MATEI:

So especially as a company goes, right, it's like actually it becomes slower to change direction super dramatically. So you really wanna think about like, what will we do long term or you know, our, our C E O or Lee has this decision rule of like with any decision I ask about like, hey, which 1:00 AM I like sort of more likely to regret like five years from now? Not, not five months from now, but like, you know, if I, if I don't do this or whatever, like what's gonna happen? So you, you try to think about where things are gonna go. Of course you do wanna collect data and like sort of update your thoughts about it, test hypothesis and, and that I think is something you can get from research too. Like in research we always think when we have an idea, it's sort of a race to like figure out is it a good idea and can I publish it?

Because the research community values novelty a lot. Being the first to do something, you know, for better or worse it's, it's not amazing, but if you just reproduce a thing that someone else did, unfortunately you don't get as much credit. So we do think about how can we quickly validate something, but at the same time, and even in research, I had, you know, the same thing I I you, you try to pick topics that will matter. Like for example, in when I was doing my PhD, I didn't do a ton with machine learning and you know, I knew, I knew people who did it, I helped them out, I built infrastructure, but I didn't do ML research myself. And then later I, I kind of decided like, yeah, I am gonna do some things especially around this like, you know, connecting machine learning to external data sources like search engines. And I know it's gonna take a while to like really learn about it and get an intuition and stuff, but I think this is gonna matter long term because I think the local, like, you know, parsing semantics of what the sentence means is kind of solved already. And the interesting thing will be like, you know, doing this in a, in a bigger system.

SARAH:

Yeah, I, uh, I have four degrees and no PhD. I've never contributed anything to the, uh, corpus of the, the world's knowledge. Uh, a lot gotta ask, does it affect how you do ing?

ELAD:

No, not really. <Laugh>,

SARAH:

The PhD mute point.

ELAD:

Nice. I don't know <laugh>, I, I have a math degree as well and I feel like that actually was a thing that forced me to think slightly differently or at least it forced a way of very logic. I, it felt like there's a groove in your brain for logic that gets carved. So that, that probably helped. But who knows? I don't know.

SARAH:

You've been working in data machine learning for a long time. Like where do you think we are in this generation of ai? Mm-Hmm.

MATEI:

<Affirmative>. Yeah, I think we're still at the early stages of AI on, on unstructured data. So things like text and and images and so on, really having an impact in applications. So I think chat, GPTrelated features that every application is going to add will change the way we work with computing and they'll also change data analytics to some extent because you'll be able to use this data. And honestly, I also think that in terms of like just basic, you know, data infrastructure and ML infrastructure, we're still pretty early also, it's still many different tools. You have to hook together a lot of complex integration and you need a lot of sort of specialized people to do it. And I think over time, like I increasingly think that basically, especially because of the capabilities of these AI models, every software engineer will need to become an ML engineer and a data engineer also as they build the application and will will figure out ways of of doing them recipes or abstractions or whatever that are actually easy enough for everyone to do.

And one analogy i I like is, you know, when I was learning programming, which was sort of like, you know, mid late nineties I got these books on, you know, web applications and it was very complicated. There was a book on my sequel, there was a book on Apache web server or like CGI bin, all these things you have to hook together. And now, you know, most developers can make a web application in like one function and even non-programmers can make something like Google Forms or Salesforce or whatever. That's sort of, you know, basically it is a customer application. So I think we're far away from that in data N M L, but it could sort of look like that it's a, it's harder because it depends on the sort of static data that you've got sitting around. But I do think there are gonna be a lot more of these applications.

SARAH:

Yeah. Matei, this is a great conversation. Thanks for joining us on No priors. Thanks

MATEI:

So much. Thanks a lot Sarah and Elad.

Thanks!

No Priors 🎙️111: The Future is Small Models, with Matei Zaharia, Founder and CTO of Databricks (TRANSCRIPT)