Dario Amodei - Anthropic CEO on Claude, AGI & the Future of AI & Humanity _ Lex Fridman Podcast #452

The Evolution of AI Capabilities: A Rapid Progression

If we extrapolate the curves that we've had so far, right? We're starting to get to like PhD level, and last year we were at undergraduate level, and the year before we were at the level of a high school student. Again, you can quibble with at what tasks and for what, we're still missing modalities, but those are being added, like computer use was added, like image generation has been added. If you just kind of eyeball the rate at which these capabilities are increasing, it does make you think that we'll get there by 2026 or 2027.

However, there are still worlds where this might not happen in 100 years. Those worlds, the number of those worlds is rapidly decreasing. We are rapidly running out of truly convincing blockers, truly compelling reasons why this will not happen in the next few years. The scale up is very quick. Like we do this today, we make a model, and then we deploy thousands, maybe tens of thousands of instances of it. I think by the time you know, certainly within two to three years, whether we have these super powerful AIs or not, clusters are gonna get to the size where you'll be able to deploy millions of these.

AI Safety: A Growing Concern

I am optimistic about meaning, but I worry about economics and the concentration of power. That's actually what I worry about more, the abuse of power. AI increases the amount of power in the world, and if you concentrate that power and abuse that power, it can do immeasurable damage. Yes, it's very frightening.

The Conversation with Dario Amodei

The following is a conversation with Dario Amodei, CEO of Anthropic, the company that created Claude, which is currently and often at the top of most LLM benchmark leaderboards. On top of that, Dario and the Anthropic team have been outspoken advocates for taking the topic of AI safety very seriously, and they have continued to publish a lot of fascinating AI research on this and other topics.

Dario Amodei: The Future of AI

We are rapidly advancing in terms of capabilities, but I think we're also getting better at understanding what's going on. We're starting to get a better handle on how these models work and how they can be improved. This is making me more optimistic about the future.

However, there are still many risks and challenges associated with AI development. As Dario mentioned earlier, the concentration of power in the world is a major concern. If this power is not used responsibly, it could lead to immeasurable damage. It's essential that we prioritize AI safety and develop strategies for mitigating these risks.

Amanda Askell: Alignment and Fine-Tuning

I'm also joined afterwards by two other brilliant people from Anthropic. First Amanda Askell, who is a researcher working on alignment and fine-tuning of Claude, including the design of Claude's character and personality. A few folks told me she has probably talked with Claude more than any human at Anthropic. So she was definitely a fascinating person to talk to about prompt engineering and practical advice on how to get the best out of Claude.

Amanda Askell's work focuses on aligning AI models with human values and intentions. Her research involves designing prompts that can elicit specific responses from Claude, while also ensuring that the model is aligned with human ethics and morality. This is crucial for developing trust in AI systems and preventing potential misuses.

Chris Olah: Mechanistic Interpretability

After Amanda, Chris Olah stopped by for a chat. He's one of the pioneers of the field of mechanistic interpretability, which is an exciting set of efforts that aims to reverse-engineer neural networks to figure out what's going on inside, inferring behaviors from neural activation patterns inside the network. This approach has significant implications for keeping future superintelligent AI systems safe.

Chris Olah explained how mechanistic interpretability can help detect when a model is trying to deceive a human it is talking to. By analyzing the activations within the network, researchers can identify patterns that indicate the model is manipulating or misleading its user. This is a crucial step towards developing more transparent and trustworthy AI systems.

The Future of AI: A Collaborative Effort

In conclusion, the development of AI capabilities has accelerated rapidly in recent years. While there are still many risks and challenges associated with this advancement, it's essential that we prioritize AI safety and develop strategies for mitigating these risks. The conversation with Dario Amodei, Amanda Askell, and Chris Olah highlights the importance of collaboration and research in addressing these concerns.

As we move forward, it's crucial that we continue to support initiatives like Anthropic's AI safety research and other efforts aimed at developing more transparent and trustworthy AI systems. By working together, we can create a future where AI benefits humanity as a whole, rather than just a privileged few.

"WEBVTTKind: captionsLanguage: en- If you extrapolate the curvesthat we've had so far, right?If you say, well, I don't know,we're starting to get to like PhD level,and last year we wereat undergraduate level,and the year before wewere at like the levelof a high school student.Again, you can quibble with at what tasksand for what, we'restill missing modalities,but those are being added,like computer use was added,like image generation has been added.If you just kind of like eyeball the rateat which these capabilitiesare increasing,it does make you thinkthat we'll get there by 2026 or 2027.I think there are still worldswhere it doesn't happen in 100 years.Those world, the numberof those worlds is rapidly decreasing.We are rapidly running outof truly convincing blockers,truly compelling reasons whythis will not happenin the next few years.The scale up is very quick.Like we do this today, we make a model,and then we deploy thousands,maybe tens of thousandsof instances of it.I think by the time, you know,certainly within two to three years,whether we have thesesuper powerful AIs or not,clusters are gonna get to the sizewhere you'll be able todeploy millions of these.I am optimistic about meaning.I worry about economics andthe concentration of power.That's actually what I worry about more,the abuse of power.- And AI increases theamount of power in the world,and if you concentrate that powerand abuse that power, itcan do immeasurable damage.- Yes, it's very frightening.It's very frightening.- The following is aconversation with Dario Amodei,CEO of Anthropic, thecompany that created Claudethat is currently and often at the topof most LLM benchmark leaderboards.On top of that, Darioand the Anthropic teamhave been outspoken advocatesfor taking the topic ofAI safety very seriously,and they have continued to publisha lot of fascinating AI researchon this and other topics.I'm also joined afterwardsby two other brilliantpeople from Anthropic.First Amanda Askell, who is a researcherworking on alignment andfine tuning of Claude,including the design of Claude'scharacter and personality.A few folks told meshe has probably talkedwith Claude more thanany human at Anthropic.So she was definitely a fascinating personto talk to about prompt engineeringand practical advice on howto get the best out of Claude.After that, Chris Olahstopped by for a chat.He's one of the pioneers of the fieldof mechanistic interpretability,which is an exciting set of effortsthat aims to reverseengineer neural networksto figure out what's going on inside,inferring behaviors fromneural activation patternsinside the network.This is a very promising approachfor keeping future superintelligent AI systems safe.For example, by detectingfrom the activationswhen the model is trying to deceivethe human it is talking to.This is the \"Lex Fridman Podcast.\"To support it, please check outour sponsors in the description.And now, dear friends,here's Dario Amodei.Let's start with thebig idea of scaling lawsand the Scaling Hypothesis.What is it?What is its history?And where do we stand today?- So I can only describe it as, you know,as it relates to kindof my own experience.But I've been in the AIfield for about 10 yearsand it was something Inoticed very early on.So I first joined the AI worldwhen I was working at Baiduwith Andrew Ng in late 2014,which is almost exactly 10 years ago now.And the first thing we worked onwas speech recognition systems.And in those days I thinkdeep learning was a new thing,it had made lots of progress,but everyone was always saying,we don't have the algorithmswe need to succeed.You know, we're not,we're only matching a tiny, tiny fraction.There's so much we need to kindof discover algorithmically.We haven't found the pictureof how to match the human brain.And when, you know, insome ways I was fortunate,I was kind of, you know,you can have almostbeginner's luck, right?I was like a newcomer to the fieldand, you know, I looked at the neural netthat we were using for speech,the recurrent neural networks,and I said, I don't know,what if you make them biggerand give them more layers?And what if you scale up thedata along with this, right?I just saw these as likeindependent dials that you could turn.And I noticed that the model startedto do better and better asyou gave them more data,as you made the models larger,as you trained them for longer.And I didn't measure thingsprecisely in those days,but along with colleagues,we very much got the informal sensethat the more data and the more computeand the more training youput into these models,the better they perform.And so initially my thinking was,hey, maybe that is just truefor speech recognition systems, right?Maybe that's just one particular quirk,one particular area.I think it wasn't until 2017when I first saw the resultsfrom GPT-1 that it clicked for methat language is probably the areain which we can do this.We can get trillions ofwords of language data,we can train on them.And the models we weretraining those days were tiny.You could train them onone to eight GPUs whereas,you know, now we trainjobs on tens of thousands,soon going to hundredsof thousands of GPUs.And so when I saw thosetwo things togetherand, you know, there were a few peoplelike Ilya Sutskever,who you've interviewed,who had somewhat similar views, right?He might have been the first one,although I think a fewpeople came to similar viewsaround the same time, right?There was, you know, RichSutton's Bitter Lesson,there was Gwern wrote aboutthe Scaling Hypothesis.But I think somewherebetween 2014 and 2017was when it really clicked for me,when I really got conviction that,hey, we're gonna be able to dothese incredibly wide cognitive tasksif we just scale up the models.And at every stage of scaling,there are always arguments.And you know, when I firstheard them, honestly,I thought probably I'mthe one who's wrong,and, you know, all theseexperts in the field are right.They know the situationbetter than I do, right?There's, you know, theChomsky argument about like,you can get syntactics, butyou can't get semantics.There was this idea, oh, youcan make a sentence make sense,but you can't make a paragraph make sense.The latest one we have today is, you know,we're gonna run out of data,or the data isn't high quality enough,or models can't reason.And each time, every time we manageto either find a way aroundor scaling just is the way around.Sometimes it's one,sometimes it's the other.And so I'm now at thispoint, I still think,you know, it's always quite uncertain.We have nothing but inductiveinference to tell usthat the next two years are gonna belike the last 10 years.But I've seen the movie enough times,I've seen the storyhappen for enough timesto really believe thatprobably the scalingis going to continue andthat there's some magic to itthat we haven't really explainedon a theoretical basis yet.- And of course the scalinghere is bigger networks,bigger data, bigger compute.- Yes.- All of those.- In particular, linear scaling upof bigger networks, bigger training timesand more and more data.So all of these things, almostlike a chemical reaction,you know, you have three ingredientsin the chemical reaction,and you need to linearly scaleup the three ingredients.If you scale up one, not the others,you run out of the other reagentsand the reaction stops.But if you scale up everything in series,then the reaction can proceed.- And of course, now that you have thiskind of empirical science/art,you can apply to other more nuanced thingslike scaling laws appliedto interpretability,or scaling laws applied to post-training,or just seeing how does this thing scale.But the big scaling law,I guess the underlying Scaling Hypothesishas to do with big networks,big data leads to intelligence.- Yeah, we've documented scaling lawsin lots of domains otherthan language, right?So initially, the paper we didthat first showed it was in early 2020where we first showed it for language.There was then some work late in 2020where we showed the samething for other modalities,like images, video, text-to-image,image-to-text, math.They all had the same pattern.And you're right, nowthere are other stageslike post-training or there are new typesof reasoning models.And in all of those casesthat we've measured,we see similar types of scaling laws.- A bit of a philosophical question,but what's your intuitionabout why bigger is betterin terms of network size and data size?Why does it lead tomore intelligent models?- So in my previouscareer as a biophysicist,so I did physics undergradand then biophysics in grad school.So I think back to whatI know as a physicist,which is actually much lessthan what some of my colleaguesat Anthropic have in termsof expertise in physics.There's this concept called the 1/f noiseand 1/x distributionswhere often, you know,just like if you add up a bunchof natural processes, you get a Gaussian.If you add up a bunchof kind of differentlydistributed natural processes,if you like take a probeand hook it up to a resistor,the distribution of the thermal noisein the resistor goes asone over the frequency.It's some kind of naturalconvergent distribution.And I think what it amounts to is thatif you look at a lot of thingsthat are produced by some natural processthat has a lot of different scales, right?Not a Gaussian, which iskind of narrowly distributed,but you know, if I look atkind of like large and small fluctuationsthat lead to electrical noise,they have this decaying 1/x distribution.And so now I think of like patternsin the physical world, right?Or in language.If I think about the patterns in language,there are some really simple patterns.Some words are much morecommon than others like \"the,\"then there's basic noun verb structure,then there's the fact that, you know,nouns and verbs have to agree,they have to coordinate.And there's the higherlevel sentence structure,then there's the thematicstructure of paragraphs.And so the fact that there'sthis regressing structure,you can imagine that as youmake the networks larger,first they capture thereally simple correlations,the really simple patterns,and there's this longtail of other patterns.And if that long tail of otherpatterns is really smoothlike it is with the 1/f noise in,you know, physicalprocesses like resistors,then you can imagine as youmake the network larger,it's kind of capturing moreand more of that distribution,and so that smoothness gets reflectedin how well the models are at predictingand how well they perform.Language is an evolved process, right?We've developed language,we have common wordsand less common words.We have common expressionsand less common expressions.We have ideas, cliches thatare expressed frequently,and we have novel ideas.And that process has developed,has evolved with humansover millions of years.And so the guess,and this is pure speculation would bethat there's some kindof long tail distributionof the distribution of these ideas.- So there's the long tail,but also there's theheight of the hierarchyof concepts that you're building up.So the bigger the network,presumably you have a higher capacity to-- Exactly, if you have a small network,you only get the common stuff, right?if I take a tiny neural network,it's very good atunderstanding that, you know,a sentence has to have, you know,verb, adjective, noun, right?But it's terrible at decidingwhat those verb, adjectiveand noun should beand whether they should make sense.If I make it just a littlebigger, it gets good at that,then suddenly it's good at the sentences,but it's not good at the paragraphs.And so these rarerand more complex patterns get picked upas I add more capacity to the network.- Well, the natural question then is,what's the ceiling of this?- Yeah.- Like how complicatedand complex is the real world?How much stuff is there to learn?- I don't think any of us knowsthe answer to that question.My strong instinct would be thatthere's no ceiling belowthe level of humans, right?We humans are able to understandthese various patterns,and so that makes me thinkthat if we continue to,you know, scale up these modelsto kind of develop newmethods for training themand scaling them up, thatwill at least get to the levelthat we've gotten to with humans.There's then a question of, you know,how much more is it possibleto understand than humans do?How much is it possible to be smarterand more perceptive than humans?I would guess the answer hasgot to be domain dependent.If I look at an area like biology,and, you know, I wrote this essay,\"Machines of Loving Grace.\"It seems to me that humans are strugglingto understand the complexityof biology, right?If you go to Stanford or to Harvardor to Berkeley, you have wholedepartments of, you know,folks trying to study, you know,like the immune systemor metabolic pathways,and each person understandsonly a tiny bit,part of it, specializes,and they're struggling tocombine their knowledgewith that of other humans.And so I have an instinct thatthere's a lot of room at thetop for AIs to get smarter.If I think of something like materialsin the physical worldor you know, like addressing, you know,conflicts between humansor something like that.I mean, you know, it may be there's only,some of these problems are notintractable, but much harder.And it may be that there's only so wellyou can do at some of these things, right?Just like with speech recognition,there's only so clearI can hear your speech.So I think in some areasthere may be ceilings,you know, that are very closeto what humans have done.in other areas, thoseceilings may be very far away.And I think we'll only find outwhen we build these systems.It's very hard to know in advance.We can speculate, but we can't be sure.- And in some domains, the ceilingmight have to do with human bureaucraciesand things like this, as you write about.- Yes.- So humans fundamentallyhave to be part of the loop.That's the cause of the ceiling,not maybe the limits of the intelligence.- Yeah, I think in manycases, you know, in theory,technology could change very fast,for example, all thethings that we might inventwith respect to biology.But remember there's a, you know,there's a clinical trial systemthat we have to go throughto actually administerthese things to humans.I think that's a mixture of thingsthat are unnecessary and bureaucraticand things that kind of protectthe integrity of society.And the whole challenge is thatit's hard to tell what's going on.It's hard to tell which is which, right?My view is definitely, I thinkin terms of drug development,my view is that we're too slowand we're too conservative.But certainly if you getthese things wrong, you know,it's possible to risk people'slives by being too reckless.And so at least some ofthese human institutionsare in fact protecting people.So it's all about finding the balance.I strongly suspect that balanceis kind of more on the sideof pushing to make things happen faster,but there is a balance.- If we do hit a limit,if we do hit a slow downin the scaling laws,what do you think would be the reason?Is it compute limited, data limited?Is it something else, idea limited?- So, a few things.Now we're talking about hitting the limitbefore we get to the level of humansand the skill of humans.So, I think one that's, you know,one that's popular todayand I think, you know,could be a limit that we run into.Like most of the limits,I would bet against it,but it's definitely possibleis we simply run out of data.There's only so much data on the internetand there's issues with thequality of the data, right?You can get hundreds of trillionsof words on the internet,but a lot of it is repetitiveor it's search engine, you know,search engine optimizationdrivel, or maybe in the futureit'll even be textgenerated by AIs itself.And so I think there are limitsto what can be produced in this way.That said, we and I wouldguess other companiesare working on ways to make data syntheticwhere you can, you know,you can use the modelto generate more dataof the type that you have alreadyor even generate data from scratch.If you think about what was donewith DeepMind's AlphaGo Zero,they managed to get abot all the way from,you know, no ability to play Go whatsoeverto above human level justby playing against itself.There was no exampledata from humans requiredin the AlphaGo Zero version of it.The other direction, of course,is these reasoning modelsthat do chain of thought and stop to thinkand reflect on their own thinking.In a way, that's anotherkind of synthetic datacoupled with reinforcement learning.So my guess is with one of those methods,we'll get around the data limitationor there may be other sources of datathat are available.We could just observe thateven if there's no problem with data,as we start to scale models up,they just stop getting better.It seemed to be a reliable observationthat they've gotten better,that could just stopat some point for a reasonwe don't understand.The answer could be that we need to,you know, we need to inventsome new architecture.There have been problems in the past with,say, numerical stability of modelswhere it looked likethings were leveling off,but actually, you know,when we found the right unblocker,they didn't end up doing so.So perhaps there's somenew optimization methodor some new technique weneed to unblock things.I've seen no evidence of that so far.But if things were to slow down,that perhaps could be one reason.- What about the limits of compute?Meaning the expensive natureof building bigger andbigger data centers.- So right now, I think, you know,most of the frontier model companiesI would guess are operating in, you know,roughly, you know, $1 billion scale,plus or minus a factor of three, right?Those are the models that exist nowor are being trained now.I think next year, we'regonna go to a few billion,and then 2026, we may go to,you know, above 10 billion,and probably by 2027,their ambitions to build100 billion dollar clusters,and I think all of thatactually will happen.There's a lot of determination to buildthe compute to do it within this country,and I would guess thatit actually does happen.Now, if we get to 100 billion,that's still not enough compute,that's still not enough scalethen either we need even more scaleor we need to develop some wayof doing it more efficientlyof shifting the curve.I think between all of these,one of the reasons I'm bullishabout powerful AI happening so fastis just that if you extrapolatethe next few points on the curve,we're very quickly getting towardshuman level ability, right?Some of the new models that we developed,some reasoning models thathave come from other companies,they're starting to getto what I would callthe PhD or professional level, right?If you look at their coding ability,the latest model we released, Sonnet 3.5,the new or updated version,it gets something like 50% on SWE-bench,and SWE-bench is an example of a bunchof professional, real worldsoftware engineering tasks.At the beginning of the year,I think the state of the art was 3 or 4%.So in 10 months we've gonefrom 3% to 50% on this task,and I think in another year,we'll probably be at 90%.I mean, I don't know, butmight even be less than that.We've seen similar thingsin graduate level math,physics, and biology frommodels like OpenAI's o1.So if we just continue toextrapolate this, right,in terms of skill that we have,I think if we extrapolatethe straight curve,within a few years, we willget to these models being,you know, above thehighest professional levelin terms of humans.Now, will that curve continue?You've pointed to, and I'vepointed to a lot of reasons why,you know, possible reasonswhy that might not happen.But if the extrapolation curve continues,that is the trajectory we're on.- So Anthropic has several competitors.It'd be interesting to getyour sort of view of it all.OpenAI, Google, xAI, Meta.What does it take to winin the broad sense of win in this space?- Yeah, so I want to separateout a couple things, right?So, you know, Anthropic's missionis to kind of try to makethis all go well, right?And you know, we have a theory of changecalled race to the top, right?Race to the top is about trying to pushthe other players to do the right thingby setting an example.It's not about being the good guy,it's about setting things up so thatall of us can be the good guy.I'll give a few examples of this.Early in the history of Anthropic,one of our co-founders, Chris Olah,who I believe you're interviewing soon,you know, he's the co-founder of the fieldof mechanistic interpretability,which is an attemptto understand what'sgoing on inside AI models.So we had him and one of our early teamsfocus on this area of interpretability,which we think is goodfor making models safe and transparent.For three or four years,that had no commercialapplication whatsoever.It still doesn't today.We're doing some early betas with it,and probably it will eventually,but you know, this is avery, very long research bedand one in which we've built in publicand shared our results publicly.And we did this because, you know,we think it's a way to make models safer.An interesting thing isthat as we've done this,other companies havestarted doing it as well,in some cases becausethey've been inspired by it,in some cases because they'reworried that, you know,if other companies are doing thisto look more responsible,they wanna look more responsible too.No one wants to look likethe irresponsible actor,and so they adopt this as well.When folks come to Anthropic,interpretability often a draw,and I tell them, the otherplaces you didn't go,tell them why you came here,and then you see soonthat there's interpretabilityteams elsewhere as well.And in a way, that takes awayour competitive advantagebecause it's like, oh, nowothers are doing it as well,but it's good for the broader system,and so we have to inventsome new thing thatwe're doing that othersaren't doing as well.And the hope is to basicallybid up the importanceof doing the right thing.And it's not about usin particular, right?It's not about havingone particular good guy.Other companies can do this as well.If they join the raceto do this, you know,that's the best news ever, right?It's just, it's about kindof shaping the incentivesto point upward instead of shapingthe incentives to point downward.- And we should say thisexample of the fieldof mechanistic interpretabilityis just a rigorous, non-handwavy way of doing AI safety,or it's tending that way.- Trying to, I mean, Ithink we're still earlyin terms of our ability to see things,but I've been surprised at how muchwe've been able to lookinside these systemsand understand what we see, right?Unlike with the scaling lawswhere it feels likethere's some, you know,law that's driving thesemodels to perform better,on the inside, themodels aren't, you know,there's no reason whythey should be designedfor us to understand them, right?They're designed to operate,they're designed to work,just like the human brainor human biochemistry.They're not designed for ahuman to open up the hatch,look inside and understand them.But we have found, and you know,you can talk in much moredetail about this to Chris,that when we open them up,when we do look inside them,we find things that aresurprisingly interesting.- And as a side effect, you also getto see the beauty of these models.You get to explore sortof the beautiful natureof large neural networksthrough the mech interpkind of methodology.- I'm amazed at how clean it's been.I'm amazed at things like induction heads.I'm amazed at things like, you know,that we can, you know,use sparse auto-encodersto find these directionswithin the networks,and that the directions correspondto these very clear concepts.We demonstrated this a bitwith the Golden Gate Bridge Claude.So this was an experimentwhere we found a directioninside one of the neural network's layersthat corresponded tothe Golden Gate Bridgeand we just turned that way up.And so we released this model as a demo,it was kind of half ajoke, for a couple days,but it was illustrative ofthe method we developed.And you could take the Golden Gate,you could take the model, youcould ask it about anything,you know, it would be like, you could say,\"How was your day\" and anything you asked,because this feature was activated,would connect to the Golden Gate Bridge.So it would say, you know,\"I'm feeling relaxed and expansive,much like the arches ofthe Golden Gate Bridge\"or, you know.- It would masterfully change topicto the Golden Gate Bridgeand it integrate it.There was also a sadness to it,to the focus it had onthe Golden Gate Bridge.I think people quickly fellin love with it, I think,so people already miss it'cause it was taken downI think after a day.- Somehow these interventions on the modelwhere you kind of adjust its behaviorsomehow emotionallymade it seem more humanthan any other version of the model.- It has a strongpersonality, strong identity.- It has a strong personality.It has these kind oflike obsessive interests.You know, we can all think of someonewho's like obsessed with something.So it does make it feelsomehow a bit more human.- Let's talk about the present.Let's talk about Claude.So this year, a lot has happened.In March, Claude 3, Opus,Sonnet, Haiku were released,then Claude 3.5 Sonnet in July,with an updated version just now released,and then also Claude3.5 Haiku was released.Okay, can you explain the differencebetween Opus, Sonnet and Haiku,and how we should thinkabout the different versions?- Yeah, so let's go back to Marchwhen we first released these three models.So, you know, our thinking was, you know,different companies producekind of large and small models,better and worse models.We felt that there was demandboth for a reallypowerful model, you know,and you that might be a little bit slowerthat you'd have to pay more for,and also for fast, cheap modelsthat are as smart as they can befor how fast and cheap, right?Whenever you wanna do somekind of like, you know,difficult analysis, like if, you know,I wanna write code, for instance,or you know, I wanna brainstorm ideas,or I wanna do creative writing,I want the really powerful model.But then there's a lotof practical applicationsin a business sense where it's likeI'm interacting with a website.You know, like, I'm like doing my taxes,or I'm, you know, talking to a, you know,to like a legal advisor andI want to analyze a contractor, you know, we have plenty of companiesthat are just like, you know,I wanna do auto completeon my IDE or something.And for all of thosethings, you want to act fastand you want to usethe model very broadly.So we wanted to serve thatwhole spectrum of needs.So we ended up with this, you know,this kind of poetry theme.And so what's a really short poem?It's a haiku.And so Haiku is the small,fast, cheap model that is,you know, was at the timewas released surprisingly,surprisingly intelligent forhow fast and cheap it was.Sonnet is a medium sized poem,right, a couple paragraphs,and so Sonnet was the middle model.It is smarter but alsoa little bit slower,a little bit more expensive.And Opus, like a magnumopus is a large work,Opus was the largest,smartest model at the time.So that was the originalkind of thinking behind it.And our thinking then was,well, each new generationof models should shiftthat trade-off curve.So when we released Sonnet 3.5,it has the same, roughlythe same, you know,cost and speed as the Sonnet 3 model.But it increased its intelligenceto the point where it was smarterthan the original Opus 3 model,especially for code butalso just in general.And so now, you know, we'veshown results for Haiku 3.5,and I believe Haiku 3.5,the smallest new modelis about as good as Opus3, the largest old model.So basically the aim hereis to shift the curve,and then at some point,there's gonna be an Opus 3.5.Now, every new generationof models has its own thing.They use new data, theirpersonality changesin ways that we kind of, you know,try to steer but arenot fully able to steer.And so there's never quitethat exact equivalencewhere the only thing you'rechanging is intelligence.We always try and improve other things,and some things change withoutus knowing or measuring.So it's very much an inexact science.In many ways, the manner andpersonality of these modelsis more an art than it is a science.- So what is sort of the reasonfor the span of time between, say,Claude Opus 3.0 and 3.5?What takes that time?If you can speak to.- Yeah, so there's different processes.There's pre-training, which is, you know,just kind of the normallanguage model training,and that takes a very long time.That uses, you know, these days,you know, tens of thousands,sometimes many tens ofthousands of GPUs or TPUsor Trainium, or you know,we use different platforms,but, you know, accelerator chips,often training for months.There's then a kind of post-training phasewhere we do reinforcementlearning from human feedback,as well as other kinds ofreinforcement learning.That phase is gettinglarger and larger now,and, you know, often, that'sless of an exact science.It often takes effort to get it right.Models are then tested withsome of our early partnersto see how good they are,and they're then tested both internallyand externally for their safety,particularly for catastrophicand autonomy risks.So we do internal testingaccording to ourresponsible scaling policy,which I, you know, could talkmore about that in detail.And then we have an agreementwith the US and the UKAI Safety Institute,as well as other third party testersin specific domains to test the modelsfor what are called CBRN risks,chemical, biological,radiological and nuclear,which are, you know, wedon't think that modelspose these risks seriously yet,but every new model, we wanna evaluateto see if we're starting to get closeto some of these moredangerous capabilities.So those are the phases.And then, you know, thenit just takes some timeto get the model workingin terms of inferenceand launching it in the API.So there's just a lot of stepsto actually making a model work.And of course, you know,we're always tryingto make the processes asstreamlined as possible, right?We want our safety testing to be rigorous,but we want it to be rigorousand to be, you know, to be automatic,to happen as fast as it canwithout compromising on rigor.Same with our pre-training processand our post-training process.So, you know, it's justlike building anything else.It's just like building airplanes.You want to make them, you know,you want to make them safe,but you want to makethe process streamlined.And I think the creativetension between those is,you know, is an important thingin making the models work.- Yeah, rumor on the street,I forget who was sayingthat Anthropic has really good tooling,so probably a lot of the challenge hereon the software engineering sideis to build the toolingto have like a efficient,low friction interactionwith the infrastructure.- You would be surprised howmuch of the challenges of,you know, building thesemodels comes down to, you know,software engineering, performanceengineering, you know.From the outside you might think,oh, man, we had thiseureka breakthrough, right?You know, this movie with the science,we discovered it, we figured it out.But I think all things,even, you know, incredible discoveries,like, they almost alwayscome down to the details,and often super, super boring details.I can't speak to whether we havebetter tooling than other companies.I mean, you know, haven'tbeen at those other companies,at least not recently,but it's certainly somethingwe give a lot of attention to.- I don't know if youcan say, but from three,from Claude 3 to Claude 3.5,is there any extra pre-training going onor is it mostly focusedon the post-training?There's been leaps in performance.- Yeah, I think at any given stage,we're focused on improvingeverything at once.Just naturally, likethere are different teams,each team makes progressin a particular area,in making a particular, you know,their particular segmentof the relay race better.And it's just natural thatwhen we make a new model,we put all of these things in at once.- So, the data you have,like the preference datayou get from RLHF, is that applicable,is there a ways to apply itto newer models as it get trained up?- Yeah, preference data from old modelssometimes gets used for new models,although, of course, itperforms somewhat betterwhen it's, you know, trained on,it's trained on the new models.Note that we have this, you know,Constitutional AI methodsuch that we don't onlyuse preference data,we kind of, there's alsoa post-training processwhere we train the model against itselfand there's, you know, new typesof post-training the model against itselfthat are used every day.So it's not just RLHF,it's a bunch of other methods as well.Post-training, I think, you know,is becoming more and more sophisticated.- Well, what explains thebig leap in performancefor the new Sonnet 3.5?I mean, at least in the programming side.And maybe this is a good placeto talk about benchmarks.What does it mean to get better?Just the number went up,but, you know, I program,but I also love programmingand Claude 3.5 through Cursoris what I use to assist me in programming.And there was, at leastexperientially, anecdotally,it's gotten smarter at programming.So like, what does ittake to get it smarter?- We observed that as well, by the way.There were a couple very strong engineershere at Anthropic whoall previous code models,both produced by us and producedby all the other companies,hadn't really been useful to them.You know, they said, you know,maybe this is useful tobeginner, it's not useful to me.But Sonnet 3.5, the original onefor the first time they said,\"Oh my God, this helped mewith something that, you know,that it would've taken me hours to do.This is the first model that'sactually saved me time.\"So again, the waterline is rising.And then I think, you know,the new Sonnet has been even better.In terms of what it takes,I mean, I'll just say it'sbeen across the board.It's in the pre-training,it's in the post-training,it's in various evaluations that we do.We've observed this as well.And if we go into thedetails of the benchmark,so Sowe bench isbasically since, you know,since you're a programmer, you know,you'll be familiar with like pull requestsand, you know, just pullrequests are like the, you know,like a sort of atomic unit of work.You know, you could say, you know,I'm implementing one thing.And Sowe bench actually gives youkind of a real world situationwhere the code base is in a current stateand I'm trying to implementsomething that's, you know,that's described in language.We have internal benchmarkswhere we measure the same thingand you say, just give themodel free reign to like,you know, do anything, runanything, edit anything.How well is it able tocomplete these tasks?And it's that benchmark that's gonefrom it can do it 3% of the timeto it can do it about 50% of the time.So I actually do believe that if we get,you can gain benchmarks,but I think if we get to100% on that benchmarkin a way that isn'tkind of like overtrainedor game for that particular benchmark,probably represents areal and serious increasein kind of programming ability.And I would suspect thatif we can get to, you know,90, 95% that, you know,it will represent abilityto autonomously do a significant fractionof software engineering tasks.- Well, ridiculous timeline question.When is Claude Opus 3.5 coming out?- Not giving you an exact date,but you know, there, youknow, as far as we know,the plan is still tohave a Claude 3.5 Opus.- Are we gonna get itbefore \"GTA 6\" or no?- Like \"Duke Nukem Forever.\"- \"Duke Nukem-\"- What was that game?There was some game thatwas delayed 15 years.- That's right.- Was that\"Duke Nukem Forever?\"- Yeah.And I think \"GTA\" is nowjust releasing trailers.- You know, it's only been three monthssince we released the first Sonnet.- Yeah, it's theincredible pace of release.- It just tells you about the pace,the expectations for whenthings are gonna come out.- So what about 4.0?So how do you think aboutsort of as these modelsget bigger and bigger,about versioning, and alsojust versioning in general,why Sonnet 3.5 updated with the date?Why not Sonnet 3.6, which alot of people are calling it?- Yeah, naming is actuallyan interesting challenge here, right?Because I think a year ago,most of the model was pre-training,and so you could start from the beginningand just say, okay,we're gonna have modelsof different sizes, we'regonna train them all togetherand you know, we'll havea family of naming schemesand then we'll put somenew magic into themand then, you know, we'llhave the next generation.The trouble starts alreadywhen some of them take a lot longerthan others to train, right?That already messes upyour time a little bit.But as you make bigimprovements in pre-training,then you suddenly notice,oh, I can make better pre-train modeland that doesn't take very long to do,but you know, clearly it has the same,you know, size and shapeof previous models.So I think those two togetheras well as the timing issues,any kind of scheme you come up with,you know, the reality tendsto kind of frustrate that scheme, right?Tend tends to kind ofbreak out of the scheme.It's not like software where you can say,oh, this is like, youknow, 3.7, this is 3.8.No, you have models withdifferent trade-offs.You can change some things in your models,you can train, you canchange other things.Some are faster and slower at inference,some have to be more expensive,some have to be less expensive.And so I think all the companieshave struggled with this.I think we did very, you know,I think we were in a good positionin terms of naming when wehad Haiku, Sonnet and Opus.- It was great, great start.- We're trying to maintain it,but it's not perfect,so we'll try and getback to the simplicity,but just the nature of the field,I feel like no one's figured out naming.It's somehow a different paradigmfrom like normal software and so we just,none of the companieshave been perfect at it.It's something we struggle withsurprisingly much relative to,you know, how relativeto how trivial it is to,you know, for the grandscience of training the models.- So, from the userside, the user experienceof the updated Sonnet 3.5is just different than the previousJune 2024 Sonnet 3.5.It would be nice to come upwith some kind of labelingthat embodies thatbecause people talk about Sonnet 3.5,but now there's a different one,and so how do you refer to theprevious one and the new onewhen there's a distinct improvement?It just makes conversationabout it just challenging.- Yeah, yeah.I definitely think this questionof there are lots ofproperties of the modelsthat are not reflected in the benchmarks.I think that's definitelythe case and everyone agrees.And not all of them are capabilities.Some of them are, you know,models can be polite or brusque.They can be, you know, very reactiveor they can ask you questions.They can have what feelslike a warm personalityor a cold personality.They can be boring or theycan be very distinctive,like Golden Gate Claude was.And we have a whole, you know,we have a whole team kind of focused on,I think we call it Claude character.Amanda leads that teamand we'll talk to you about that.But it's still a very inexact science,and often we find thatmodels have propertiesthat we're not aware of.The fact of the matter is that you can,you know, talk to a model 10,000 timesand there are somebehaviors you might not see,just like with a human, right?I can know someone for a few months and,you know, not know thatthey have a certain skill,or not know that there'sa certain side to them.And so I think we just have to get usedto this idea and we're alwayslooking for better waysof testing our models todemonstrate these capabilities,and also to decide which arethe personality propertieswe want models to have andwhich we don't want to have.That itself, the normative questionis also super interesting.- I gotta ask you a question from Reddit.- From Reddit? Oh, boy. (laughs)- You know, there just this fascinating,to me at least, it's apsychological social phenomenonwhere people report that Claudehas gotten dumber for them over time.And so the question is,does the user complaintabout the dumbing downof Claude 3.5 Sonnet hold any water?So are these anecdotal reportsa kind of social phenomenaor did Claude, is there any caseswhere Claude would get dumber?- So this actually doesn't apply,this isn't just about Claude.I believe I've seen these complaintsfor every foundation modelproduced by a major company.People said this about GPT-4,they said it about GPT-4 Turbo.So, a couple things.One, the actual weightsof the model, right,the actual brain of themodel, that does not changeunless we introduce a new model.There are just a number of reasonswhy it would not make sense practicallyto be randomly substituting innew versions of the model.It's difficult from aninference perspectiveand it's actually hard tocontrol all the consequencesof changing the weight of the model.Let's say you wanted to finetune the model to be like,I don't know, to liketo say \"certainly\" less,which, you know, an oldversion of Sonnet used to do.You actually end upchanging 100 things as well.So we have a whole process for it,and we have a whole processfor modifying the model.We do a bunch of testing on it,we do a bunch of usertesting and early customers.So we both have never changedthe weights of the modelwithout telling anyone,and it wouldn't, certainlyin the current setup,it would not make sense to do that.Now, there are a couple thingsthat we do occasionally do.One is sometimes we run A/B tests,but those are typicallyvery close to when a modelis being released and for avery small fraction of time.So, you know, like, the daybefore the new Sonnet 3.5.I agree, we should have had a better name.It's clunky to refer to it.There were some comments from people thatlike it's gotten a lot better,and that's because, you know, a fractionwere exposed to an A/B testfor those one or two days.The other is that occasionally,the system prompt will change.The system prompt can have some effects,although it's unlikelyto dumb down models.It's unlikely to make them dumber.And we've seen thatwhile these two things,which I'm listing to be very complete,happened relatively,happened quite infrequently,the complaints about,for us and for other modelcompanies about the model change,the model isn't good at this.The model got more censored.The model was dumbed down.Those complaints are constant.And so I don't wanna say like peopleare imagining it oranything, but like the modelsare for the most part not changing.If I were to offer a theory,I think it actually relatesto one of the things I said before,which is that models are very complexand have many aspects to them.And so often, you know,if I ask the model a question,you know, if I'm like,\"Do task X\" versus \"Can you do task X?\"the model might respond in different ways.And so there are allkinds of subtle thingsthat you can change aboutthe way you interactwith the model that can giveyou very different results.To be clear, this itselfis like a failing by usand by the other modelproviders that the modelsare just often sensitive tolike small changes in wording.It's yet another way in which the scienceof how these models workis very poorly developed.And so, you know, if Igo to sleep one nightand I was like talking tothe model in a certain wayand I like slightly changed the phrasingof how I talk to the model, you know,I could get different results.So that's one possible way.The other thing is, man,it's just hard to quantify this stuff.It's hard to quantify this stuff.I think people are very excitedby new models when they come outand then as time goes on,they become very aware of the limitations,so that may be another effect.But that's all a verylong-winded way of sayingfor the most part, with somefairly narrow exceptions,the models are not changing.- I think there is a psychological effect.You just start getting used to it.The baseline raises.Like when people firstgotten wifi on airplanes,it's like amazing, magic.- It's like amazing, yeah.- And then-- And now I'm like,I can't get this thing to work.This is such a piece of crap.- Exactly, so then it's easyto have the conspiracy theoryof they're making wifi slower and slower.This is probably something I'll talkto Amanda much more about.But another Reddit question,\"When will Claude stop tryingto be my puritanical grandmotherimposing its moral worldviewon me as a paying customer?And also, what is the psychologybehind making Claude overly apologetic?\"So this kind of reportsabout the experience,a different angle on the frustration,it has to do with the character.- Yeah, so a couple points on this first.One is like things that peoplesay on Reddit and Twitter,or X or whatever it is,there's actually a hugedistribution shift between like the stuffthat people complain loudlyabout on social mediaand what actually kind of like, you know,statistically users care aboutand that drives people to use the models.Like people are frustratedwith, you know, things like,you know, the model notwriting out all the codeor the model, you know, just not beingas good at code as it could be,even though it's the bestmodel in the world on code.I think the majoritythings are about that.But certainly a kindof vocal minority are,you know, kind of raisethese concerns, right?Are frustrated by themodel refusing thingsthat it shouldn't refuse,or like apologizing too much,or just having these kind oflike annoying verbal ticks.The second caveat, andI just wanna say thislike super clearly because I thinkit's like some people don't know it,others like kind of know it but forget it.Like it is very difficult to controlacross the board how the models behave.You cannot just reach in thereand say, \"Oh, I want themodel to like apologize less.\"Like you can do that, youcan include training datathat says like, \"Oh, the modelshould like apologize less,\"but then in some othersituation they end upbeing like super rudeor like overconfidentin a way that's like misleading people.So there are all these trade-offs.For example, another thingis there was a period during which models,ours and I think others aswell were too verbose, right?They would like repeat themselves,they would say too much.You can cut down on theverbosity by penalizingthe models for just talking for too long.What happens when you do that,if you do it in a crude wayis when the models are coding,sometimes they'll say restof the code goes here, right?Because they've learned thatthat's the way to economizeand that they see it,and then so that leads the modelto be so-called lazy in codingwhere they're just like,ah, you can finish the rest of it.It's not because we wanna,you know, save on computeor because you know, the models are lazy,and you know, during winter break,or any of the other kindof conspiracy theoriesthat have come up.It's actually, it's just very hardto control the behavior of the model,to steer the behavior of the modelin all circumstances at once.You can kind of, there'sthis whack-a-mole aspectwhere you push on onething and like, you know,these other things start to move as wellthat you may not even notice or measure.And so one of the reasonsthat I care so much about,you know, kind of grand alignmentof these AI systems in thefuture is actually these systemsare actually quite unpredictable.They're actually quitehard to steer and control.And this version we're seeing todayof you make one thing better,it makes another thing worse,I think that's like a present day analogof future control problems in AI systemsthat we can start to study today, right?I think that that difficultyin steering the behaviorand in making surethat if we push an AIsystem in one direction,it doesn't push it in another directionin some other ways that we didn't want.I think that's kind of an early signof things to come,and if we can do a good jobof solving this problem,right, of like you ask the model to like,you know, to like makeand distribute smallpoxand it says no, but it'swilling to like help youin your graduate level virology class.Like how do we get bothof those things at once?It's hard.It's very easy to go toone side or the otherand it's a multidimensional problem.And so, you know, I think these questionsof like shaping the model's personality,I think they're very hard.I think we haven't done perfectly on them.I think we've actually done the bestof all the AI companies, butstill so far from perfect.And I think if we can get this right,if we can control, you know,control the false positivesand false negatives in thisvery kind of controlledpresent day environment,we'll be much better at doing itfor the future whenour worry is, you know,will the models be super autonomous?Will they be able to, you know,make very dangerous things?Will they be able toautonomously, you know,build whole companies?And are those companies aligned?So, I think of this presenttask as both vexing,but also good practice for the future.- What's the current best way of gatheringsort of user feedback?Like not anecdotal data,but just large scaledata about pain pointsor the opposite of painpoints, positive things,so on, is it internal testing?Is it a specific grouptesting, A/B testing?What works?- So, typically we'll haveinternal model bashingswhere all of Anthropic,Anthropic is almost 1000 people,you know, people justtry and break the model.They try and interactwith it various ways.We have a suite of evals for, you know,oh, is the model refusingin ways that it couldn't?I think we even had a certainly evalbecause, you know, our model, again,one point, model had this problemwhere like it had this annoying tickwhere it would like respondto a wide range of questions by saying\"Certainly I can help you with that.Certainly I would be happy to do that.Certainly this is correct.\"And so we had a, like, certainly eval,which is like, how oftendoes the model say certainly?But look, this is just a whack-a-mole.Like, what if it switchesfrom certainly to definitely?Like, so, you know, everytime we add a new eval,and we're always evaluatingfor all of the old things.So we have hundreds of these evaluations,but we find that there's no substitutefor human interacting with it.And so it's very much likethe ordinary product development process.We have like hundreds of peoplewithin Anthropic bash the model,you know, then we do external A/B tests.Sometimes we'll runtests with contractors.We pay contractors tointeract with the model.So you put all of these things togetherand it's still not perfect.You still see behaviorsthat you don't quite wanna see, right?You know, you still see the modellike refusing things thatit just doesn't make sense to refuse.But I think trying to solvethis challenge, right?Trying to stop the modelfrom doing, you know,genuinely bad things that, you know,everyone agrees it shouldn't do, right?You know, everyone agrees that, you know,the model shouldn't talk about, you know,I don't know, child abuse material, right?Like, everyone agrees themodel shouldn't do that.But at the same timethat it doesn't refuse inthese dumb and stupid ways.I think drawing that lineas finely as possible,approaching perfectly is still a challengeand we're getting better at it every day.But there's a lot to be solved.And again, I would point to thatas an indicator of the challenge aheadin terms of steering muchmore powerful models.- Do you think Claude4.0 is ever coming out?- I don't want to committo any naming scheme,'cause if I say here\"We're gonna have Claude 4 next year,\"and then, you know, thenwe decide that like,you know, we should start over,'cause there's a new type of model.Like I don't want to commit to it.I would expect in anormal course of businessthat Claude 4 would come after Claude 3.5.But you know, you never knowin this wacky field, right?- But the sort of, this ideaof scaling is continuing.- Scaling is continuing.There will definitelybe more powerful modelscoming from us than themodels that exist today.That is certain.Or if there aren't, we'vedeeply failed as a company.- Okay, can you explain theResponsible Scaling Policyand the AI Safety LevelStandards, ASL Levels?- As much as I am excitedabout the benefitsof these models, and, youknow, we'll talk about thatif we talk about \"Machinesof Loving Grace,\"I'm worried about the risksand I continue to beworried about the risks.No one should think that, you know,\"Machines of Loving Grace\"was me saying, you know,I'm no longer worried aboutthe risks of these models.I think they're twosides of the same coin.The power of the modelsand their ability to solveall these problems in,you know, biology, neuroscience,economic development,governance and peace,large parts of the economy,those come with risks as well, right?With great power comesgreat responsibility, right?The two are paired.Things that are powerfulcan do good thingsand they can do bad things.I think of those risksas being in, you know,several different categories.Perhaps the two biggestrisks that I think about,and that's not to say thatthere aren't risks todaythat are important,but when I think of thereally the, you know,the things that would happenon the grandest scale,one is what I call catastrophic misuse.These are misuse of themodels in domains like cyber,bio, radiological, nuclear, right?Things that could, youknow, that could harmor even kill thousands, even millionsof people if they really, really go wrong.Like these are the, you know,number one priority to prevent.And here I would justmake a simple observation,which is that the models, you know,if I look today at peoplewho have done reallybad things in the world,I think actually humanityhas been protectedby the fact that the overlapbetween really smart, well-educated peopleand people who want todo really horrific thingshas generally been small.Like, you know, let's say I'm someone who,you know, I have a PhD in this field,I have a well paying job.There's so much to lose.Why do I wanna, like, you know,even assuming I'm completely evil,which most people are not.You know, why would sucha person risk their life,risk their legacy, theirreputation to do something like,you know, truly, truly evil?If we had a lot more people like that,the world would be a muchmore dangerous place.And so my worry is that by beinga much more intelligent agent,AI could break that correlation,and so I do have seriousworries about that.I believe we can prevent those worries.But, you know, I think as a counterpointto \"Machines of LovingGrace,\" I want to say thatthere's still serious risks.And the second range of riskswould be the autonomyrisks, which is the ideathat models might on their own,particularly as we give them more agencythan they've had in the past,particularly as we give themsupervision over widertasks like, you know,writing whole code bases or someday even,you know, effectivelyoperating entire companies,they're on a long enough leash,are they doing what wereally want them to do?It's very difficult to even understandin detail what they'redoing, let alone control it.And like I said, these early signs thatit's hard to perfectly draw the boundarybetween things the model should doand things the model shouldn't do that,you know, if you go to one side,you get things that areannoying and useless,you go to the other side,you get other behaviors.If you fix one thing, itcreates other problems.We're getting better andbetter at solving this.I don't think this isan unsolvable problem.I think, you know, this is a science,like the safety of airplanesor the safety of cars,or the safety of drugs.You know, I don't think there'sany big thing we're missing.I just think we need to get betterat controlling these models.And so these are the tworisks I'm worried about.And our Responsible Scaling Plan,which I'll recognize isa very long-winded answerto your question.- I love it. I love it.- Our Responsible Scaling Plan is designedto address these two types of risks.And so every time we develop a new model,we basically test it for its abilityto do both of these bad things.So if I were to back up a little bit,I think we have an interesting dilemmawith AI systems where they'renot yet powerful enoughto present these catastrophes.I don't know that they'll everprevent these catastrophes,it's possible they won't,but the case for worry,the case for risk is strong enoughthat we should act now.And they're getting bettervery, very fast, right?You know, I testified in the Senate that,you know, we might have serious bio riskswithin two to three years.That was about a year ago.Things have proceeded at pace.So we have this thing where it's like,it's surprisingly hardto address these risksbecause they're not here today.They don't exist.They're like ghosts, butthey're coming at us so fastbecause the models are improving so fast.So how do you deal withsomething that's not here today,doesn't exist but iscoming at us very fast?So the solution we came up with for thatin collaboration with, you know,people like the organization METRand Paul Christiano is, okay,what you need for that are you need teststo tell you when therisk is getting close.You need an early warning system.And so every time we have a new model,we test it for its capabilityto do these CBRN tasks,as well as testing it for, you know,how capable it is of doingtasks autonomously on its own.And in the latest version of our RSP,which we released inthe last month or two,the way we test autonomyrisks is the model,the AI model's ability to doaspects of AI research itself,which when the AI modelscan do AI research,they become kind of truly autonomous.And you know, that thresholdis important for a bunch of other ways.And so what do we thendo with these tasks?The RSP basically developswhat we've called an if then structure,which is if the modelspass a certain capability,then we impose a certain set of safetyand security requirements on them.So today's models arewhat's called ASL two.Models that were, ASLone is for systems thatmanifestly don't pose anyrisk of autonomy or misuse.So for example, a chess playing bot,Deep Blue would be ASL one.It's just manifestly the casethat you can't use Deep Bluefor anything other than chess.It was just designed for chess.No one's gonna use it to like, you know,to conduct a masterful cyber attack or to,you know, run wild andtake over the world.ASL two is today's AI systemswhere we've measured themand we think these systems aresimply not smart enough to,you know, autonomously self-replicateor conduct a bunch of tasks,and also not smart enough to providemeaningful information about CBRN risksand how to build CBRNweapons above and beyondwhat can be known from looking at Google.In fact, sometimes theydo provide information,but not above and beyond a search engine,but not in a way thatcan be stitched together,not in a way that kind of endto end is dangerous enough.So ASL three is gonna be the pointat which the models are helpful enoughto enhance the capabilitiesof non-state actors, right?State actors can already do a lot of,unfortunately, to a highlevel of proficiency,a lot of these very dangerousand destructive things.The difference is thatnon-state actors are not capable of it.And so when we get to ASL three,we'll take special security precautionsdesigned to be sufficient to preventtheft of the model by non-state actors,and misuse of the model as it's deployed.We'll have to have enhanced filterstargeted at these particular areas.- Cyber, bio, nuclear.- Cyber, bio, nuclear and model autonomy,which is less a misuse riskand more risk of the modeldoing bad things itself.ASL four, getting to thepoint where these modelscould enhance the capabilityof a already knowledgeable state actorand/or become, you know, themain source of such a risk.Like if you wanted toengage in such a risk,the main way you woulddo it is through a model.And then I think ASL fouron the autonomy side,it's some amount of accelerationin AI research capabilitieswithin an AI model.And then ASL five is wherewe would get to the modelsthat are, you know, that are kind of,you know, truly capable,that could exceedhumanity in their abilityto do any of these tasks.And so the point of ifthen structure commitmentis basically to say, look, I don't know,I've been working with these modelsfor many years and I've been worriedabout risk for many years.It's actually kind ofdangerous to cry wolf.It's actually kind ofdangerous to say this,you know, this model is risky.And, you know, people look at itand they say, this ismanifestly not dangerous.Again, it's the delicacyof the risk isn't here todaybut it's coming at us fast.How do you deal with that?It's really vexing to a riskplanner to deal with it.And so this if thenstructure basically says,look, we don't wannaantagonize a bunch of people,we don't wanna harm our own, you know,our kind of own ability to have a placein the conversation by imposing thesevery onerous burdens on modelsthat are not dangerous today.So if then, the trigger commitmentis basically a way to deal with this.Says you clamp down hardwhen you can show thatthe model is dangerous.And of course what has tocome with that is, you know,enough of a bufferthreshold that, you know,you're not at high risk ofkind of missing the danger.It's not a perfect framework.We've had to change it every, you know,we came out with a newone just a few weeks ago,and probably going forward,we might release new onesmultiple times a yearbecause it's hard to getthese policies right,like technically, organizationally,from a research perspective.But that is the proposal,if then commitmentsand triggers in order to minimize burdensand false alarms now,but really react appropriatelywhen the dangers are here.- What do you think the timelinefor ASL three is where severalof the triggers are fired?And what do you think thetimeline is for ASL four?- Yeah, so that is hotlydebated within the company.We are working activelyto prepare ASL threesecurity measures as well asASL three deployment measures.I'm not gonna go into detail,but we've made a lot of progress on both,and, you know, we're prepared to be,I think, ready quite soon.I would not be surprised at allif we hit ASL three next year.There was some concern thatwe might even hit it this year.That's still possible,that could still happen.It's like very hard to say,but like I would be very, very surprisedif it was like 2030.I think it's much sooner than that.- So there's protocolsfor detecting it, if then,and then there's protocolsfor how to respond to it.- Yes.- How difficult is the second, the latter?- Yeah, I think for ASL three,it's primarily about security and about,you know, filters on the modelrelating to a very narrow set of areaswhen we deploy the model.Because at ASL three, themodel isn't autonomous yet,and so you don't have toworry about, you know,kind of the model itselfbehaving in a bad way,even when it's deployed internally.So I think the ASL three measures are,I won't say straightforward,they're rigorous, but they'reeasier to reason about.I think once we get to ASL four,we start to have worries about the modelsbeing smart enough thatthey might sandbag tests,they might not tell the truth about tests.We had some results came outabout like sleeper agentsand there was a more recentpaper about, you know,can the models misleadattempts to, you know,sandbag their own abilities, right?Show them, you know, present themselvesas being less capable than they are.And so I think with ASL four,there's gonna be an important componentof using other thingsthan just interactingwith the models, forexample, interpretabilityor hidden chains of thoughtwhere you have to lookinside the model and verifyvia some other mechanismthat is not, you know, isnot as easily corruptedas what the model says,you know, that the modelindeed has some property.So we're still working on ASL four.One of the properties of the RSP is thatwe don't specify ASL fouruntil we've hit ASL three.And I think that's provento be a wise decisionbecause even with ASL three,again, it's hard to knowthis stuff in detail,and we wanna take as much timeas we can possibly taketo get these things right.- So for ASL three,the bad actor will be the humans.- Humans, yes.- And so there, it's a little bit more-- For ASL four, it's both, I think, both.- It's both, and so deception,and that's wheremechanistic interpretabilitycomes into play and hopefullythe techniques used for thatare not made accessible to the model.- Yeah, I mean, of course you can hook upthe mechanistic interpretabilityto the model itself,but then you've kind of lost itas a reliable indicatorof the model state.There are a bunch of exotic waysyou can think of that itmight also not be reliable.Like if the, you know,model gets smart enoughthat it can like, you know, jump computersand like read the codewhere you're like lookingat its internal state.We've thought about some of those.I think there're exotic enough,there are ways to render them unlikely.But yeah, generally you wanna preservemechanistic interpretabilityas a kind of verification set or test setthat's separate from thetraining process of the model.- See, I think as thesemodels become betterand better conversationand become smarter,social engineering becomesa threat too 'cause they-- Oh, yeah.- That could start being very convincingto the engineers inside companies.- Oh yeah, yeah.It's actually like, you know,we've seen lots of examples of demagogueryin our life from humans and, you know,there's a concern that modelsthat could do that as well.- One of the ways thatClaude has been gettingmore and more powerful is it's now ableto do some agentic stuff, computer use.There's also an analysiswithin the sandboxof claude.ai itself.But let's talk about computer use.That seems to me super excitingthat you can just give Claude a taskand it takes a bunch ofactions, figures it out,and has access to yourcomputer through screenshots.So can you explain how that works?And where that's headed?- Yeah, it's actually relatively simple.So Claude has had for a long time,since Claude 3.0 back in March,the ability to analyze imagesand respond to them with text.The only new thing we addedis those images can bescreenshots of a computer.And in response, we trained the modelto give a location on the screenwhere you can click and/orbuttons on the keyboardyou can press in order to take action.And it turns out thatwith actually not all thatmuch additional training,the models can getquite good at that task.It's a good example of generalization.You know, people sometimes say,if you get to lower earth orbit,you're like halfway to anywhere, right?Because of how much ittakes to escape the gravity.Well, if you have astrong pre-trained model,I feel like you're halfway to anywherein terms of the intelligence space.And so actually, itdidn't take all that muchto get Claude to do this.And you can just set that in a loop,give the model a screenshot,tell it what to click on,give it the next screenshot,tell it what to click on,and that turns into a fullkind of almost 3D videointeraction of the model.And it's able to do allof these tasks, right?You know, we showed these demoswhere it's able to likefill out spreadsheets.It's able to kind of likeinteract with a website.It's able to, you know,it's able to open all kindsof, you know, programs,different operating systems,Windows, Linux, Mac.So, you know, I think allof that is very exciting.I will say, while in theory,there's nothing you could do therethat you couldn't have done throughjust giving the model the APIto drive the computer screen.This really lowers the barrier.And you know, there's alot of folks who either,you know, aren't in a position to interactwith those APIs or it takesthem a long time to do.It's just the screen isjust a universal interfacethat's a lot easier to interact with.And so I expect over time,this is gonna lower a bunch of barriers.Now, honestly, the current model has,it leaves a lot still to be desired,and we were honest aboutthat in the blog, right?It makes mistakes, it misclicks.And, you know, we werecareful to warn people,hey, this thing isn't, youcan't just leave this thing to,you know, run on your computerfor minutes and minutes.You gotta give this thingboundaries and guardrails.And I think that's one of the reasonswe released it first in an API formrather than kind of, you know,this kind of just hand it to the consumerand give it control of their computer.But you know, I definitelyfeel that it's importantto get these capabilities out there.As models get more powerful,we're gonna have tograpple with, you know,how do we use these capabilities safely?How do we prevent them from being abused?And you know, I think releasing the modelwhile the capabilities are, you know,are still limited is veryhelpful in terms of doing that.You know, I think sinceit's been released,a number of customers,I think Replit was maybeone of the most quickestto deploy things.You know, have made useof it in various ways.People have hooked up demos for, you know,Windows desktops, Macs,you know, Linux machines.So yeah, it's been very exciting.I think as with anything else, you know,it comes with new exciting abilitiesand then, you know,with those new exciting abilities,we have to think about how to, you know,make the model, you know, safe,reliable, do what humans want them to do.I mean, it's the same storyfor everything, right?Same thing. It's that same tension.- But the possibility of use cases here,just the range is incredible.So how much to make it workreally well in the future?How much do you have to speciallykind of go beyond what's thepre-trained model's doing,do more post-training, RLHFor supervised fine tuning,or synthetic data justfor the agentic stuff?- Yeah, I think speaking at a high level,it's our intention tokeep investing a lot in,you know, making the model better.Like I think, you know,we look at some of the,you know, some of the benchmarkswhere previous models were like,oh, could do it 6% of the time,and now our model would doit 14 or 22% of the time.And yeah, we wanna get up to, you know,the human level reliabilityof 80, 90% just like anywhere else, right?We're on the same curve thatwe were on with SWE-bench,where I think I wouldguess a year from now,the models can do thisvery, very reliably,but you gotta start somewhere.- So you think it's possible to getto the human level, 90%,basically doing the samething you're doing now?Or it has to be special for computer use?- I mean, it depends whatyou mean by, you know,special and special in general.But, you know, Igenerally think, you know,the same kinds of techniquesthat we've been using totrain the current model,I expect that doublingdown on those techniquesin the same way that we have for codefor models in general, you know,for image input, you know, for voice.I expect those sametechniques will scale here,as they have everywhere else.- But this is giving sort of the powerof action to Claude,and so you could do a lotof really powerful things,but you could do a lot of damage also.- Yeah, yeah, no, and we'vebeen very aware of that.Look, my view actually is computer useisn't a fundamentally new capability,like the CBRN or autonomycapabilities are.It's more like, it kind ofopens the aperture for the modelto use and apply its existing abilities.And so the way we think about it,going back to our RSP,is nothing that this model isdoing inherently increases,you know, the risk froman RSP perspective.But as the models get morepowerful, having this capabilitymay make it scarier once it, you know,once it has the cognitivecapability to, you know,to do something at the ASLthree and ASL four level,you know, this may be the thingthat kind of unbounds it from doing so.So, going forward, certainlythis modality of interactionis something we have tested for,and that we will continue totest for in RSP going forward.I think it's probably better to have,to learn and explore this capabilitybefore the model is super,you know, super capable.- Yeah, and there's a lotof interesting attacks,like prompt injection,because now you've widened the aperture,so you can prompt injectthrough stuff on screen.So if this becomes more and more useful,then there's more and more benefitto inject stuff into the model.If it goes to a certain webpage,it could be harmlessstuff like advertisementsor it could be like harmful stuff, right?- Yeah, I mean, we'vethought a lot about thingslike spam, CAPTCHA, you know, mass camp.There's all, you know, every,like one secret I'll tell you,if you've invented a new technology,not necessarily the biggest misuse,but the first misuse you'll see,scams, just petty scams.Like you'll just, it'slike a thing as old,people scamming each other,it's this thing as old as time,and it's just every timeyou gotta deal with it.- It's almost like sillyto say but it's true,sort of bots and spamin general is a thingas it gets more and more intelligent.It's harder and harder to fight.- There are a lot of like I said,like there are a lot ofpetty criminals in the world.And you know, it's likeevery new technologyis like a new way for pettycriminals to do something,you know, something stupid and malicious.- Is there any ideas about sandboxing it?Like how difficult is the sandboxing task?- Yeah, we sandbox during training.So for example, during training,we didn't expose themodel to the internet.I think that's probably abad idea during trainingbecause, you know, the modelcan be changing its policy,it can be changing what it's doing,and it's having aneffect in the real world.You know, in terms of actuallydeploying the model, right,it kind of depends on the application.Like, you know, sometimesyou want the modelto do something in the real world,but of course, you can always putguardrails on the outside, right?You can say, okay, well, you know,this model's not gonna movedata from my, you know,model's not gonna moveany files from my computeror my web server to anywhere else.Now, when you talk about sandboxing,again, when we get to ASL four,none of these precautionsare going to make sense there, right,where when you talk about ASL four,you're then, the model isbeing kind of, you know,there's a theoretical worrythe model could be smart enoughto kind of break out of any box.And so there we need to think aboutmechanistic interpretabilityabout, you know,if we're gonna have a sandbox,it would need to be amathematically provable sand.You know, that's a whole different worldthan what we're dealingwith with the models today.- Yeah, the science of building a boxfrom which ASL four AIsystem cannot escape.- I think it's probablynot the right approach.I think the right approachinstead of having something,you know, unalignedthat like you're tryingto prevent it from escaping.I think it's better tojust design the modelthe right way or have a loop where,you know, you look inside,you look inside the modeland you're able to verify properties,and that gives you an opportunityto like iterate and actually get it right.I think containing bad modelsis much worse solutionthan having good models.- Let me ask about regulation.What's the role of regulationin keeping AI safe?So for example, can you describeCalifornia AI regulationBill SB 1047 that was ultimatelyvetoed by the governor?What are the pros and consof this bill in general?- Yes, we ended up making some suggestionsto the bill, and thensome of those were adoptedand, you know, we felt,I think quite positively,quite positively about thebill by the end of that.It did still have some downsides,and, you know, of course it got vetoed.I think at a high level, Ithink some of the key ideasbehind the bill are, you know,I would say similar toideas behind our RSPs.And I think it's very importantthat some jurisdiction,whether it's Californiaor the federal governmentand/or other countriesand other states passessome regulation like this.And I can talk through whyI think that's so important.So I feel good about our RSP.It's not perfect, it needsto be iterated on a lot,but it's been a good forcing functionfor getting the company totake these risks seriously,to put them into productplanning, to really make thema central part of work at Anthropicand to make sure that all of 1000 people,and it's almost 1000people now at Anthropic,understand that this is oneof the highest prioritiesof the company, if notthe highest priority.But one, there are still some companiesthat don't have RSP like mechanisms,like OpenAI, Google didadopt these mechanismsa couple months after Anthropic did,but there are other companies out therethat don't have these mechanisms at all.And so if some companiesadopt these mechanisms and others don't,it's really gonna createa situation where,you know, some of thesedangers have the propertythat it doesn't matterif three out of fiveof the companies are being safe,if the other two are being unsafe,it creates this negative externality.And I think the lack ofuniformity is not fairto those of us who haveput a lot of effortinto being very thoughtfulabout these procedures.The second thing is,I don't think you cantrust these companiesto adhere to these voluntaryplans in their own, right?I like to think that Anthropic will.We do everything we can that we will.Our RSP is checked by ourlong-term benefit trust.So, you know, we do everything we canto adhere to our own RSP.But you know, you hear lots of thingsabout various companies saying,oh, they said they would givethis much compute and they didn't.They said they would dothis thing and they didn't.You know, I don't think it makes sense to,you know, to litigate particular thingsthat companies have done.But I think this broad principlethat like if there'snothing watching over them,there's nothing watchingover us as an industry,there's no guarantee thatwe'll do the right thing,and the stakes are very high.And so I think it's importantto have a uniform standardthat everyone follows,and to make sure that simplythat the industry doeswhat a majority of the industryhas already said is importantand has already saidthat they definitely will do.Right, some people, you know,I think there's a class of peoplewho are against regulation on principle.I understand where that comes from.If you go to Europe and, you know,you see something like GDPR,you see some of the otherstuff that they've done.You know, some of it's good,but some of it is reallyunnecessarily burdensome,and I think it's fair to sayreally has slowed innovation.And so I understand where peopleare coming from on priors.I understand why people come from,start from that position.But again, I think AI is different.If we go to the veryserious risks of autonomyand misuse that I talked about,you know, just a few minutes ago,I think that those are unusualand they warrant anunusually strong response.And so I think it's very important.Again, we need somethingthat everyone can get behind.You know, I think one ofthe issues with SB 1047,especially the original version of it,was it had a bunch ofthe structure of RSPs,but it also had a bunch ofstuff that was either clunkyor that just would've createda bunch of burdens, a bunch of hassle,and might even have missed the targetin terms of addressing the risks.You don't really hear about it on Twitter.You just hear about kind of, you know,people are cheering for any regulation,and then the folks who are againstmake up these often quiteintellectually dishonest argumentsabout how, you know,it'll make us move away from California.Bill doesn't apply if you'reheadquartered in California,bill only applies if youdo business in California.Or that it would damagethe open source ecosystem,or that it would, you know,it would cause all of these things.I think those were mostly nonsense,but there are betterarguments against regulation.There's one guy, Dean Ball,who's really, you know,I think a very scholarly,scholarly analystwho looks at what happens whena regulation is put in placeand ways that they can kindof get a life of their own,or how they can be poorly designed.And so our interest has always been,we do think there should beregulation in this space,but we wanna be an actor who makes surethat that regulation issomething that's surgical,that's targeted at the serious risksand is something peoplecan actually comply with.Because something I think the advocatesof regulation don't understandas well as they couldis if we get something in placethat's poorly targeted,that wastes a bunch of people's time,what's gonna happen ispeople are gonna say,see, these safety risks,you know, this is nonsense.You know, I just had to hire 10 lawyers,you know, to fill out all these forms.I had to run all these tests for somethingthat was clearly not dangerous.And after six months of that,there will be a groundswelland we'll end upwith a durable consensusagainst regulation.And so I think the worst enemy of thosewho want real accountabilityis badly designed regulation.We need to actually get it right.And this is, if there'sone thing I could sayto the advocates, itwould be that I want themto understand this dynamic better,and we need to be really carefuland we need to talk to peoplewho actually have experienceseeing how regulationsplay out in practice.And the people who haveseen that understandto be very careful.If this was some lesser issue,I might be against regulation at all.But what I want theopponents to understandis that the underlyingissues are actually serious.They're not something thatI or the other companiesare just making up becauseof regulatory capture.They're not sci-fi fantasies.They're not any of these things.You know, every time we have a new model,every few months, we measurethe behavior of these modelsand they're getting better and betterat these concerning tasks,just as they are gettingbetter and better at,you know, good, valuable,economically useful tasks.And so I would just loveit if some of the former,you know, I think SB1047 was very polarizing.I would love it if some ofthe most reasonable opponentsand some of the most reasonable proponentswould sit down together.And, you know I think that, you know,the different AI companies, you know,Anthropic was the only AIcompany that, you know,felt positively in a very detailed way.I think Elon tweetedbriefly something positive.But, you know, some of the big ones,like Google, OpenAI, Meta, Microsoftwere pretty staunchly against.So I would really like is if, you know,some of the key stakeholders,some of the, you know,most thoughtful proponentsand some of the most thoughtfulopponents would sit downand say, how do we solvethis problem in a waythat the proponents feel bringsa real reduction in risk,and that the opponents feel thatit is not hampering the industryor hampering innovation anymore necessary than it needs to.And I think for whatever reasonthat things got too polarizedand those two groupsdidn't get to sit downin the way that they should.And I feel urgency.I really think we needto do something in 2025.You know, if we get to the end of 2025and we've still done nothing about this,then I'm gonna be worried.I'm not worried yet, because again,the risks aren't here yet,but I think time is running short.- Yeah, and come up withsomething surgical, like you said.- Yeah, yeah, yeah, exactly.And we need to get awayfrom this intense pro-safetyversus intense anti-regulatoryrhetoric, right?It's turned into theseflame wars on Twitterand nothing good's gonna come of that.- So there's a lot of curiosity aboutthe different players in the game.One of the OGs is OpenAI.You've had several yearsof experience at OpenAI.What's your story and history there?- Yeah, so I was at OpenAIfor roughly five years.For the last, I think it was couple years,you know, I was vicepresident of research there.Probably myself and IlyaSutskever were the ones who,you know, really kind ofset the research direction.Around 2016 or 2017, I firststarted to really believe inor at least confirm my beliefin the Scaling Hypothesiswhen Ilya famously said to me,\"The thing you need to understandabout these models isthey just wanna learn.The models just wanna learn.\"And again, sometimes thereare these one sentences,these zen koans that you hear themand you're like, ah,that explains everything.That explains like 1000things that I've seen.And then I, you know,ever after I had thisvisualization in my head of like,you optimize the models in the right way,you point the models in the right way.They just wanna learn.They just wanna solve the problem,regardless of what the problem is.- So get out of their way, basically.- Get out of their way, yeah.Don't impose your own ideasabout how they should learn.And you know, this was the same thingas Rich Sutton put outin the Bitter Lessonor Gwern put out inThe Scaling Hypothesis.You know, I think generallythe dynamic was, you know,I got this kind of inspirationfrom Ilya and from others,folks like Alec Radfordwho did the original GPT-1,and then ran really hard with it.Me and my collaborators on GPT-2, GPT-3.RL from Human Feedback,which was an attemptto kind of deal with theearly safety and durability.Things like debate and amplification.Heavy on interpretability.So again, the combinationof safety plus scaling.Probably 2018, 2019, 2020,those were kind of the yearswhen myself and my collaboratorsprobably, you know,many of whom becameco-founders of Anthropic,kind of really had a visionand like drove the direction.- Why'd you leave?Why'd you decide to leave?- Yeah, so look, I'mgonna put things this wayand, you know, I think it tiesto the race to the top, right?Which is, you know, in my time at OpenAI,what I'd come to see isI'd come to appreciatethe Scaling Hypothesis,and as I'd come to appreciatekind of the importanceof safety along withthe Scaling Hypothesis.The first one I think, you know,OpenAI was getting on board with.The second one in a wayhad always been partof OpenAI's messaging,but, you know, over many yearsof the time that I spent there,I think I had a particular visionof how we should handle these things,how we should be brought out in the world,the kind of principles thatthe organization should have.And look, I mean, there werelike many, many discussionsabout like, you know, should the org do,should the company do this?Should the company do that?Like, there's a bunch ofmisinformation out there.People say like, we leftbecause we didn't likethe deal with Microsoft.False, although, you know, itwas like a lot of discussion,a lot of questions about exactlyhow we do the deal with Microsoft.We left because we didn'tlike commercialization.That's not true, we built GPT-3,which was the modelthat was commercialized.I was involved in commercialization.It's more again about, how do you do it?Like civilization is going down this pathto very powerful AI.What's the way to do it that is cautious,straightforward, honest,that builds trust in theorganization and individuals?How do we get from here to there?And how do we have a realvision for how to get it right?How can safety not justbe something we saybecause it helps with recruiting?And, you know, I thinkat the end of the day,if you have a vision for that,forget about anyone else's vision.I don't wanna talk aboutanyone else's vision.If you have a vision for howto do it, you should go offand you should do that vision.It is incredibly unproductive to tryand argue with someone else's vision.You might think they'renot doing it the right way.You might think they're dishonest.Who knows, maybe you'reright, maybe you're not.But what you should dois you should take some people you trustand you should go off togetherand you should make your vision happen.And if your vision is compelling,if you can make it appeal to people,you know, some combination of ethically,you know, in the market, you know,if you can make a company that'sa place people wanna join,that, you know, engages in practicesthat people think are reasonable,while managing to maintain its positionin the ecosystem at the same time,if you do that, people will copy it.And the fact that you aredoing it, especially the factthat you're doing it better than they arecauses them to change their behaviorin a much more compelling waythan if they're your bossand you're arguing with them.I just, I don't know how to beany more specific about it than that,but I think it's generallyvery unproductiveto try and get someone else's visionto look like your vision.It's much more productive to go offand do a clean experimentand say, this is our vision,this is how we're gonna do things.Your choice is you can ignore us,you can reject what we're doing,or you can start to become more like us,and imitation is thesincerest form of flattery.And you know, that plays outin the behavior of customers,that plays out in thebehavior of the public,that plays out in the behaviorof where people choose to work.And again, at the end,it's not about one company winningor another company winning.If we or another company areengaging in some practice that,you know, people find genuinely appealing,and I want it to be in substance,not just in appearance.And, you know, I think researchersare sophisticated andthey look at substance.And then other companiesstart copying that practiceand they win because theycopied that practice,that's great, that's success.That's like the race to the top.It doesn't matter who wins in the end,as long as everyone is copyingeveryone else's good practices, right?One way I think of it is like,the thing we're all afraid ofis the race to the bottom, right?And the race to the bottom,doesn't matter who winsbecause we all lose, right?Like, you know, in the most extreme world,we make this autonomous AI that, you know,the robots enslave us or whatever, right?I mean, that's half joking, but you know,that is the most extremething that could happen.Then it doesn't matterwhich company was ahead.If instead you create a race to the topwhere people are competing to engagein good practices, then, you know,at the end of the day, you know,it doesn't matter who ends up winning,doesn't even matter whostarted the race at the top.The point isn't to be virtuous.The point is to get the systeminto a better equilibriumthan it was before,and individual companies canplay some role in doing this.Individual companies can, you know,can help to start it,can help to accelerate it.And frankly, I thinkindividuals at other companieshave done this as well, right?The individuals thatwhen we put out an RSPreact by pushing harder toget something similar done,get something similardone at other companies.Sometimes other companiesdo something that's like,we're like, oh, it's a good practice.We think that's good.We should adopt it too.The only difference is,you know, I think we are,we try to be more forward-leaning.We try and adopt moreof these practices firstand adopt them more quicklywhen others invent them.But I think this dynamic iswhat we should be pointing at.And I think it abstractsaway the question of,you know, which company'swinning, who trusts who.I think all these questions of dramaare profoundly uninteresting,and the thing thatmatters is the ecosystemthat we all operate in and howto make that ecosystem betterbecause that constrains all the players.- And so Anthropic is this kindof clean experiment built on a foundationof like what concretely AIsafety should look like.- Look, I'm sure we've made plentyof mistakes along the way.The perfect organization doesn't exist.It has to deal with the imperfectionof 1000 employees.It has to deal with the imperfectionof our leaders, including me.It has to deal with the imperfectionof the people we've put to, you know,to oversee the imperfectionof the leaders,like the board and thelong-term benefit trust.It's all a set of imperfect peopletrying to aim imperfectly at some idealthat will never perfectly be achieved.That's what you sign up for,that's what it will always be.But imperfect doesn'tmean you just give up.There's better and there's worse.And hopefully we can begin to build,we can do well enoughthat we can begin to buildsome practices that thewhole industry engages in.And then, you know, my guess is thatmultiple of these companieswill be successful.Anthropic will be successful.These other companies,like ones I've been at thepast will also be successful,and some will be moresuccessful than others.That's less important than, again,that we align theincentives of the industry.And that happens partlythrough the race to the top,partly through things like RSP,partly through againselected surgical regulation.- You said talent densitybeats talent mass.So can you explain that?Can you expand on that?Can you just talk about what it takesto build a great team of AIresearchers and engineers?- This is one of these statementsthat's like more true every month.Every month I see this statementas more true than I did the month before.So if I were to do a thought experiment,let's say you have a team of 100 peoplethat are super smart, motivated,and aligned with the mission,and that's your company.Or you can have a team of 1000 peoplewhere 200 people are super smart,super aligned with the mission,and then like 800 people are,let's just say you pick 800like random big tech employees,which would you rather have, right?The talent mass is greaterin the group of 1000 people, right?You have even a larger numberof incredibly talented,incredibly aligned,incredibly smart people.But the issue is just that if every timesomeone super talented looks around,they see someone else supertalented and super dedicated,that sets the tone for everything, right?That sets the tone foreveryone is super inspiredto work at the same place.Everyone trusts everyone else.If you have 1000 or 10,000 peopleand things have really regressed, right?You are not able to do selectionand you're choosing random people,what happens is then you needto put a lot of processesand a lot of guardrails in placejust because people don'tfully trust each other,or you have to adjudicatepolitical battles.Like there are so many thingsthat slow down the org'sability to operate.And so we're nearly 1000 peopleand you know, we'vetried to make it so thatas large a fraction of those 1000 peopleas possible are like supertalented, super skilled.It's one of the reasonswe've slowed down hiring alot in the last few months.We grew from 300 to 800, I believe,I think in the first seven,eight months of the year.And now we've slowed down.We're at like, you know,the last three months,we went from 800 to 900,950, something like that.Don't quote me on the exact numbers,but I think there's aninflection point around 1000,and we want to be muchmore careful how we grow.Early on, and now as well, you know,we've hired a lot of physicists.You know, theoretical physicistscan learn things really fast.Even more recently as we'vecontinued to hire that,you know, we've really had a high bar for,on both the research sideand the software engineering sidehave hired a lot of senior people,including folks who used to beat other companies in this space.And we've just continuedto be very selective.It's very easy to go from 100 to 1000and 1000 to 10,000without paying attentionto making sure everyonehas a unified purpose.It's so powerful.If your company consists ofa lot of different fiefdomsthat all wanna do their own thing,they're all optimizingfor their own thing,it's very hard to get anything done.But if everyone sees thebroader purpose of the company,if there's trust and there's dedicationto doing the right thing,that is a superpower.That in itself, I think,can overcome almostevery other disadvantage.- And you know, as toSteve Jobs, A players.A players wanna look aroundand see other A players isanother way of saying that.I don't know what thatis about human nature,but it is demotivating to see peoplewho are not obsessively drivingtowards a singular mission.And it is, on the flip side of that,super motivating to see that.It's interesting.What's it take to be a great AI researcheror engineer from everything you've seen,from working with so many amazing people?- Yeah, I think the number one quality,especially on the research side,but really both is open-mindedness.Sounds easy to be open-minded, right?You're just like, oh,I'm open to anything.But, you know, if I think aboutmy own early history inthe Scaling Hypothesis,I was seeing the samedata others were seeing.I don't think I waslike a better programmeror better at coming up with research ideasthan any of the hundreds ofpeople that I worked with.In some ways, I was worse.You know, like I've never like, you know,precise programming of like,you know, finding the bug,writing the GPU kernels.Like, I could point you to 100 people herewho are better at that than I am.But the thing that I think I did havethat was different wasthat I was just willingto look at something with new eyes, right?People said, oh, you know,\"We don't have the right algorithms yet.We haven't come up with theright way to do things.\"And I was just like, oh, I don't know,like, you know, this neural nethas like 30 billion,30 million parameters.Like, what if we gaveit 50 million instead?Like, let's plot some graphs.Like that basic scientificmindset of like,oh, man, like I just, like, you know,I see some variable that I could change.Like, what happens when it changes?Like, let's try these different thingsand like create a graph.For even, this was likethe simplest thing in the world, right?Change the number of, you know,this wasn't like PhDlevel experimental design.This was like simple and stupid.Like, anyone could have done thisif you just told themthat it was important.It's also not hard to understand.You didn't need to bebrilliant to come up with this.But you put the two things togetherand, you know, some tiny number of people,some single digit number of peoplehave driven forward the wholefield by realizing this.And you know, it's often like that.If you look back at thediscovery, you know,the discoveries in history,they're often like that.And so this open-mindednessand this willingness to see with new eyesthat often comes frombeing newer to the field.Often experience is adisadvantage for this.That is the most important thing.It's very hard to look for and test for.But I think it's the most important thingbecause when you find something,some really new way ofthinking about things,when you have the initiative to do that,it's absolutely transformative.- And also be able to do kindof rapid experimentation,and in the face of that,be open-minded and curiousand looking at the data,just these fresh eyesand seeing what is thatit's actually saying.That applies in mechanisminterpretability.- It's another example of this.Like some of the early workin mechanisticinterpretability, so simple,it's just no one thought tocare about this question before.- You said what it takes tobe a great AI researcher.Can we rewind the clock back?What advice would you giveto people interested in AI?They're young, looking forward to,how can I make an impact on the world?- I think my number one piece of adviceis to just start playing with the models.This was actually, I worry a little,this seems like obvious advice now.I think three years ago,it wasn't obvious and people started by,oh, let me read the latestReinforcement Learning paper.Let me, you know, let me kind of, I mean,that was really,and I mean, you should do that as well.But now, you know, with wider availabilityof models and APIs, peopleare doing this more.But I think just experiential knowledge.These models are new artifactsthat no one really understands,and so getting experienceplaying with them.I would also say, again,in line with the like,do something new, thinkin some new direction.Like there are all these thingsthat haven't been explored.Like for example,mechanistic interpretabilityis still very new.It's probably better to work on thatthan it is to work onnew model architecturesbecause, you know, it's morepopular than it was before.There are probably like100 people working on it,but there aren't like10,000 people working on it.And it's just this fertile area for study.Like, you know, there's somuch like low hanging fruit.You can just walk by and, you know,you can just walk byand you can pick things.And the only reason, for whatever reason,people aren't interested in it enough.I think there are some things aroundlong horizon learningand long horizon taskswhere there's a lot to be done.I think evaluations are still,we're still very early in ourability to study evaluations,particularly for dynamicsystems acting in the world.I think there's somestuff around multi-agent.Skate where the puckis going is my advice.And you don't have to bebrilliant to think of it.Like all the things thatare gonna be excitingin five years, like peopleeven mention them as like,you know, conventional wisdom, but like,it's just somehowthere's this barrier thatpeople don't double downas much as they could,or they're afraid to do somethingthat's not the popular thing.I don't know why it happens,but like, getting over that barrier,that's the my number one piece of advice.- Let's talk if we coulda bit about post-training.So it seems that themodern post-training recipehas a little bit of everything.So supervised fine tuning, RLHF,the Constitutional AI with RLAIF.- Best acronym.- It's, again, that name thing.- RLAIF.(both laughing)- And then synthetic data,seems like a lot of synthetic data,or at least trying to figure out waysto have high quality synthetic data.So what's the, if this is a secret saucethat makes Anthropic Claude so incredible,how much of the magicis in the pre-training?How much of is in the post-training?- Yeah, I mean, so first of all,we're not perfectly ableto measure that ourselves.- True.- You know, when you seesome great character ability,sometimes it's hard to tellwhether it came frompre-training or post-training.We've developed ways to tryand distinguish between those twobut they're not perfect.You know, the second thingI would say is, you know,when there is an advantage,and I think we've been pretty goodin general at RL, perhaps the best.Although I don't know 'cause I don't seewhat goes on inside other companies.Usually it isn't, oh my God,we have this secret magic methodthat others don't have, right?Usually it's like, well, you know,we got better at the infrastructure,so we could run it for longer.Or, you know, we were ableto get higher quality data,or we were able to filter our data better,or we were able to, you know,combine these methods in practice.It's usually some boring matterof kind of practiced and trade craft.So, you know, when I think abouthow to do something specialin terms of how we train thesemodels, both pre-training,but even more so post-training, you know,I really think of it a little more, again,as like designing airplanes or cars.Like, you know, it'snot just like, oh, man,I have the blueprint.Like maybe that makes youmake the next airplane.But like, there's somecultural trade craftof how we think about the design processthat I think is moreimportant than, you know,than any particular gizmowe're able to invent.- Okay, well, let me ask youabout specific techniques.So first on RLHF, why doyou think, just zooming out,intuition almost philosophy,why do you think RLHF works so well?- If I go back to likethe Scaling Hypothesis,one of the ways to skatethe Scaling Hypothesisis if you train for Xand you throw enough computeat it, then you get X.And so RLHF is good at doingwhat humans want the model to do,or at least to state it more precisely,doing what humans who look at the modelfor a brief period of timeand consider different possible responses,what they prefer as the response,which is not perfect from both the safetyand capabilities perspective,in that humans are often notable to perfectly identifywhat the model wants,and what humans want inthe moment may not bewhat they want in the long term.So there's a lot of subtlety there,but the models are good at, you know,producing what the humansin some shallow sense want.And it actually turns outthat you don't even haveto throw that much compute atit because of another thing,which is this thing abouta strong pre-trained modelbeing halfway to anywhere.So once you have the pre-trained model,you have all the representations you needto get the model where you want it to go.- So do you think RLHFmakes the model smarteror just appears smarter to the humans?- I don't think itmakes the model smarter.I don't think it just makesthe model appear smarter.It's like RLHF like bridges the gapbetween the human and the model, right?I could have something really smartthat like can't communicate at all, right?We all know people like this,people who are really smart,but that, you know,you can't understand what they're saying.So I think RLHF just bridges that gap.I think it's not theonly kind of RL we do,it's not the only kind of RLthat will happen in the future.I think RL has the potentialto make models smarter,to make them reason better,to make them operate better,to make them develop new skills even.And perhaps that could be done, you know,even in some cases with human feedback.But the kind of RLHF we do todaymostly doesn't do that yet,although we're very quicklystarting to be able to.- But it appears to sort of increase,if you look at the metric of helpfulness,it increases that.- It also increases, what was this wordin Leopold's essay, unhobbling,where basically the models are hobbledand then you do various trainingsto them to unhobble them.So, you know, I like that word'cause it's like a rare word.But so I think RLHF unhobblesthe models in some ways.And then there are other wayswhere that model hasn't yet been unhobbledand, you know, needs to unhobble.- If you can say in terms of cost,is pre-training the most expensive thing?Or is post-training creep up to that?- At the present moment,it is still the case that pre-trainingis the majority of the cost.I don't know what to expect in the future,but I could certainly anticipate a futurewhere post-training isthe majority of the cost.- In that future you anticipate,would it be the humans or the AIthat's the cost of thingfor the post-training?- I don't think you canscale up humans enoughto get high quality.Any kind of method that relies on humansand uses a large amount of compute,it's gonna have to rely on somescaled superposition method,like you know, debate oriterated amplificationor something like that.- So on that superinteresting set of ideasaround Constitutional AI,can you describe what it is,as first detailed in December 2022 paperand beyond that, what is it?- Yes, so this was from two years ago.The basic idea is, so wedescribe what RLHF is.You have a model and it,you know, spits out two,you know, like you justsample from it twice,it spits out two possible responses,and you're like, \"Human, whichresponse do you like better?\"Or another variant of it is,\"Rate this response on ascale of one to seven.\"So that's hard because you needto scale up human interactionand it's very implicit, right?I don't have a sense ofwhat I want the model to do.I just have a sense oflike what this averageof a 1000 humans wants the model to do.So two ideas.One is, could the AI system itselfdecide which response is better, right?Could you show the AIsystem these two responsesand ask which response is better?And then second, well, whatcriterion should the AI use?And so then there's this idea,'cause you have a single document,a constitution, if you will, that says,these are the principles the modelshould be using to respond.And the AI system reads those,it reads those principles,as well as reading theenvironment and the response.And it says, well, howgood did the AI model do?It's basically a form of self play.You're kind of trainingthe model against itself.And so the AI gives the responseand then you feed that backinto what's called the preference model,which in turn feeds themodel to make it better.So you have this triangle of like the AI,the preference model, and theimprovement of the AI itself.- And we should say thatin the constitution,the set of principles arelike human interpretable.They're like-- Yeah, yeah.It's something both the humanand the AI system can read.So it has this nice kind oftranslatability or symmetry.You know, in practice weboth use a model constitutionand we use RLHF and we usesome of these other methods.So it's turned into one tool in a toolkitthat both reduces the need for RLHFand increases the value we getfrom using each data point of RLHF.It also interacts in interesting wayswith kind of futurereasoning type RL methods.So it's one tool in the toolkit,but I think it is a very important tool.- Well, it's a compellingone to us humans.You know, thinking aboutthe founding fathersand the founding of the United States,the natural question is,who and how do you think itgets to define the constitution,the set of principles in the constitution?- Yeah, so I'll givelike a practical answerand a more abstract answer.I think the practicalanswer is like, look,in practice models get used by all kindsof different like customers, right?And so you can have thisidea where, you know,the model can have specializedrules or principles.You know, we fine tuneversions of models implicitly.We've talked about doing it explicitly,having special principles that peoplecan build into the models.So from a practical perspective,the answer can be verydifferent from different people.You know, customerservice agent, you know,behaves very differently from a lawyerand obeys different principles.But I think at the base of it,there are specific principlesthat models, you know, have to obey.I think a lot of them are thingsthat people would agree with.Everyone agrees that, you know,we don't want models topresent these CBRN risks.I think we can go a little furtherand agree with some basic principlesof democracy in the rule of law.Beyond that, it gets,you know, very uncertain,and there, our goal isgenerally for the modelsto be more neutral,to not espouse a particular point of view,and, you know, more justbe kind of like wise agentsor advisors that will helpyou think things throughand will, you know, presentpossible considerations,but, you know, don't express,you know, strong or specific opinions.- OpenAI released a model specwhere it kind of clearly,concretely definessome of the goals of the model,and specific examples, like A/B,how the model should behave.Do you find that interesting?By the way, I should mention,I believe the brilliant JohnSchulman was a part of that.He's now at Anthropic.Do you think this is a useful direction?Might Anthropic releasea model spec as well?- Yeah, so I think that'sa pretty useful direction.Again, it has a lot in commonwith Constitutional AI.So again, another example oflike a race to the top, right?We have something that's like we think,you know, a better and moreresponsible way of doing things.It's also a competitive advantage.Then others kind of, you know,discover that it has advantagesand then start to do that thing.We then no longer havethe competitive advantage,but it's good from the perspectivethat now everyone hasadopted a positive practicethat others were not adopting.And so our response to that is, well,looks like we need a newcompetitive advantagein order to keep drivingthis race upwards.So that's how I generally feel about that.I also think every implementationof these things is different.So, you know, there were some thingsin the model spec that werenot in Constitutional AI,and so, you know, we canalways adopt those thingsor, you know, at least learn from them.So again, I think this is an exampleof like the positive dynamicthat I think we should allwant the field to have.- Let's talk about the incredible essay\"Machines of Loving Grace.\"I recommend everybody read it.It's a long one.- It is rather long.- Yeah.It's really refreshingto read concrete ideasabout what a positive future looks like.And you took sort of a bold stancebecause like, it's very possiblethat you might be wrong on the datesor specific applications.- Oh, yeah.I'm fully expecting to, you know,will definitely be wrongabout all the details.I might be just spectacularly wrongabout the whole thing and people will,you know, will laugh at me for years.That's just how the future works. (laughs)- So you provided a bunch of concretepositive impacts of AI and how, you know,exactly a super intelligentAI might acceleratethe rate of breakthroughs in, for example,biology and chemistry that would then leadto things like we cure most cancers,prevent all infectious disease,double the human lifespan and so on.So let's talk about this essay.First, can you give a highlevel vision of this essayand what key takeawaysthat people would have?- Yeah, I have spent a lot of time,and Anthropic has spent a lotof effort on like, you know,how do we address the risks of AI, right?How do we think about those risks?Like we're trying to do arace to the top, you know,that requires us to buildall these capabilitiesand the capabilities are cool,but you know, we're like,a big part of what we're trying to dois like address the risks.And the justification for that is like,well, you know, all these positive things,you know, the market is thisvery healthy organism, right?It's gonna produce allthe positive things.The risks, I don't know,we might mitigate them, we might not.And so we can have more impactby trying to mitigate the risks.But I noticed that one flawin that way of thinking,and it's not a change in howseriously I take the risks.It's maybe a change inhow I talk about them.Is that, you know, nomatter how kind of logicalor rational that line of reasoningthat I just gave might be,if you kind of only talk about risks,your brain only thinks about risks.And so I think it'sactually very importantto understand, what if things do go well?And the whole reason we'retrying to prevent these risksis not because we're afraid of technology,not because we wanna slow it down.It's because if we can getto the other side of these risks, right?If we can run the gauntlet successfully,you know, to put it in stark terms,then on the other side of the gauntletare all these great thingsand these things are worth fighting for,and these things canreally inspire people.And I think I imagine,because look, you have allthese investors, all these VCs,all these AI companies talking aboutall the positive benefits of AI.But as you point out, it's weird,there's actually a dearthof really getting specific about it.There's a lot of likerandom people on Twitterlike posting these kindof like gleaning cities,and this just kind oflike vibe of like grind,accelerate harder, likekick out the, you know,it's just this very likeaggressive ideological.But then you're like, well,what are you actually excited about?And so I figured that, you know,I think it would beinteresting and valuablefor someone who's actually comingfrom the risk side to try,and to try and reallymake a try at explainingwhat the benefits are,both because I think it'ssomething we can all get behindand I want people to understand.I want them to really understand thatthis isn't doomers versus accelerationist.This is that, if youhave a true understandingof where things are going with with AI,and maybe that's the more important axis.AI is moving fast versusAI is not moving fast,then you really appreciate the benefitsand you really, you want humanity,our civilization to seize those benefits,but you also get very seriousabout anything that could derail them.- So I think the starting pointis to talk about what this powerful AI,which is the term you like to use,most of the world uses AGI,but you don't like the termbecause it's basicallyhas too much baggage,it's become meaningless.It's like we're stuck with the terms,whether we like them or not.- Maybe we're stuckwith the terms and my effortsto change them are futile.- It's admirable.- I'll tell youwhat else I don't, this is likea pointless semantic point,but I keep talking about it in public-- Back to naming again.- So I'm just gonnado it once more.I think it's a little like,let's say it was like 1995and Moore's law is makingthe computers faster.And like for some reason,there had been this likeverbal tick that like,everyone was like, well,someday we're gonna havelike super computersand like supercomputersare gonna be able to doall these things that like,you know, once we have supercomputers,we'll be able to like sequence the genome,we'll be able to do other things.And so, like one, it's true,the computers are gettingfaster, and as they get faster,they're gonna be able todo all these great things.But there's no discreet pointat which you had a supercomputerand previous computers were not.Like supercomputer is a term we use,but like, it's a vagueterm to just describe likecomputers that are fasterthan what we have today.There's no point at whichyou pass the thresholdand you're like, oh my God,we're doing a totally new typeof computation and new.And so I feel that way about AGI like,there's just a smooth exponentialand like if by AGI you meanlike AI is getting better and better,and like gradually, it's gonna do moreand more of what humans dountil it's gonna be smarter than humans,and then it's gonna getsmarter even from therethen yes, I believe in AGI.But if AGI is somediscreet or separate thing,which is the way peopleoften talk about it,then it's kind of a meaningless buzzword.- Yeah, I mean, to me it'sjust sort of a platonic formof a powerful AI, exactlyhow you define it.I mean, you define it very nicely.So on the intelligence axis,just on pure intelligence,it's smarter than a Nobel Prize winner,as you describe, acrossmost relevant disciplines.So, okay, that's just intelligence.So it's both in creativityand be able to generate new ideas,all that kind of stuff,in every discipline, Nobel Prize winner,okay, in their prime. (laughs)It can use every modality,so that's kind of self-explanatory,but just to operate across allthe modalities of the world.It can go off for many hours,days and weeks to do tasks,and do its own sort of detailed planningand only ask you help when it's needed.It can use, this isactually kind of interestingbecause I think in the essay you said,I mean, again, it's a bet, thatit's not gonna be embodied,but it can control embodied tools.So it can control tools,robots, laboratory equipment.The resources used to trainit can then be repurposedto run millions of copies of it.And each of those copieswould be independent,they could do their own independent work.So you can do the cloningof the intelligence system software.- Yeah, yeah, I mean, you might imaginefrom outside the field that like,there's only one of these, right?That like, you made it,you've only made one.But the truth is that like,the scale up is very quick.Like we do this today, we make a model,and then we deploy thousands,maybe tens of thousandsof instances of it.I think by the time, you know,certainly within two to three years,whether we have thesesuper powerful AIs or not,clusters are gonna get to the sizewhere you'll be able todeploy millions of theseand they'll be, youknow, faster than humans.And so if your picture is,oh, we'll have one and it'lltake a while to make them.My point, there was no,actually you have millionsof them right away.- And in general they can learn and act10 to 100 times faster than humans.So that's a really nicedefinition of powerful AI, okay.So that, but you also writethat clearly such anentity would be capableof solving very difficultproblems very fast,but it is not trivialto figure out how fast.Two extreme positionsboth seem false to me.So the singularity is on the one extremeand the opposite on the other extreme.Can you describe each of the extremes?- Yeah, so.- And why.- So yeah, let's describe the extreme.So like one extreme would be, well, look,you know, if we look at kindof evolutionary history,like there was this bigacceleration where, you know,for hundreds of thousands of years,we just had like, you know,single celled organisms,and then we had mammals,and then we had apes,and then that quickly turned to humans.Humans quickly builtindustrial civilization.And so this is gonna keep speeding upand there's no ceiling at the human level.Once models get much,much smarter than humans,they'll get really good atbuilding the next models,and you know, if you write downlike a simple differential equation,like this is an exponential.And so what's gonna happen is thatmodels will build faster models,models will build faster models,and those models will build, you know,nanobots that can like take over the worldand produce much more energythan you could produce otherwise.And so if you just kind of like solvethis abstract differential equation,then like five days after we, you know,we build the first AIthat's more powerful than humans,then, you know, likethe world will be filledwith these AIs and everypossible technologythat could be inventedlike will be invented.I'm caricaturing this a little bit,but, you know, I think that's one extreme.And the reason that I thinkthat's not the case is that,one, I think they just neglectlike the laws of physics.Like it's only possible to do thingsso fast in the physical world.Like some of those loops go through,you know, producing faster hardware.It takes a long time toproduce faster hardware.Things take a long time.There's this issue of complexity,like, I think no matter how smart you are,like, you know, people talk about,oh, we can make modelsof biological systemsthat'll do everythingthe biological systems.Look, I think computationalmodeling can do a lot.I did a lot of computational modelingwhen I worked in biology, but like, just,there are a lot of thingsthat you can't predicthow they're, you know,they're complex enoughthat like just iterating,just running the experimentis gonna beat any modeling,no matter how smart thesystem doing the modeling is.- Well, even if it's not interactingwith the physical world,just the modeling is gonna be hard?- Yeah, I think, well, themodeling's gonna be hardand getting the model to matchthe physical world is gonna be hard.- All right, so he does have to interactwith the physical world to verify.- Yeah, but it's just, you know,you just look at eventhe simplest problems.Like, you know, I think Italk about like, you know,the three body problem orsimple chaotic prediction, like,you know, or like predicting the economy.It's really hard to predictthe economy two years out.Like maybe the case is like,you know, normal, you know,humans can predict what's gonna happenin the economy next quarter,or they can't really do that.Maybe a AI system that's, you know,a zillion times smartercan only predict itout a year or somethinginstead of, you know.You have these kind ofexponential increasein computer intelligencefor linear increasein ability to predict.Same with, again, like, you know,biological molecules,molecules interacting.You don't know what's gonna happenwhen you perturb a complex system.You can find simple parts in itif you're smarter, you're betterat finding these simple parts.And then I think human institutions.Human institutions arejust, are really difficult.Like, you know, it'sbeen hard to get people,I won't give specific examples,but it's been hard to get people to adopteven the technologiesthat we've developed,even ones where thecase for their efficacyis very, very strong.You know, people have concerns.They think things are conspiracy theories.Like it's just been,it's been very difficult.It's also been verydifficult to get, you know,very simple things throughthe regulatory system, right?I think, and you know, I don'twanna disparage anyone who,you know, works in regulatorysystems of any technology.There are hard trade-offsthey have to deal with.They have to save lives.But the system as a wholeI think makes some obvious trade-offsthat are very far frommaximizing human welfare.And so if we bring AI systems into this,you know, into these human systems,often the level of intelligencemay just not be thelimiting factor, right?It just may be that it takesa long time to do something.Now, if the AI systemcircumvented all governments,if it just said \"I'm dictator of the worldand I'm gonna do whatever,\"some of these things it could do.Again, the things havingto do with complexity,I still think a lot ofthings would take a while.I don't think it helps that the AI systemscan produce a lot ofenergy or go to the Moon.Like some people in comments respondedto the essay saying the AI systemcan produce a lot of energyin smarter AI systems.That's missing the point.That kind of cycle doesn'tsolve the key problemsthat I'm talking about here.So I think a bunch of peoplemissed the point there.But even if it were completely on the lineand, you know, could get aroundall these human obstacles,it would have trouble.But again, if you wantthis to be an AI systemthat doesn't take over the world,that doesn't destroy humanity,then basically, you know, it's gonna needto follow basic human laws, right?You know, if we want tohave an actually good world,like we're gonna have to have an AI systemthat interacts with humans,not one that kind ofcreates its own legal systemor disregards all the laws or all of that.So as inefficient as theseprocesses are, you know,we're gonna have to deal with thembecause there needs to be some popularand democratic legitimacyin how these systems are rolled out.We can't have a small group of peoplewho are developing these systems saythis is what's best for everyone, right?I think it's wrong,and I think in practice,it's not gonna work anyway.So you put all those things togetherand, you know, we're not gonna,you know, change the worldand upload everyone in five minutes.I just, I don't think it,A, I don't think it's gonna happen,and B, you know,to the extent that it could happen,it's not the way to lead to a good world.So that's on one side.On the other side, there'sanother set of perspectives,which I have actually insome ways more sympathy for,which is, look, we've seenbig productivity increases before, right?You know, economists are familiarwith studying the productivity increasesthat came from the computer revolutionand internet revolution.And generally, thoseproductivity increaseswere underwhelming.They were less than you might imagine.There was a quote from Robert Solow,\"You see the computerrevolution everywhereexcept the productivity statistics.\"So why is this the case?People point to the structure of firms,the structure of enterprises.You know, how slow it's been to roll outour existing technology tovery poor parts of the world,which I talk about in the essay, right?How do we get these technologiesto the poorest parts of the worldthat are behind on cell phone technology,computers, medicine, let alone, you know,newfangled AI thathasn't been invented yet.So you could have aperspective that's like, well,this is amazing technically,but it's all a nothing burger.You know, I think Tyler Cowen,who wrote something in responseto my essay, has that perspective.I think he thinks the radicalchange will happen eventually,but he thinks it'll take 50 or 100 years.And you could have evenmore static perspectiveson the whole thing.I think there's some truth to it.I think the timescale is just too long.And I can see it, I can actually seeboth sides with today's AI.So, you know, a lot of ourcustomers are large enterpriseswho are used to doingthings a certain way.I've also seen it in talkingto governments, right?Those are prototypical,you know, institutions,entities that are slow to change.But the dynamic I see overand over again is, yes,it takes a long time to move the ship.Yes, there's a lot of resistanceand lack of understanding.But the thing that makesme feel that progresswill in the end happen moderately fast,not incredibly fast, but moderately fast,is that you talk to, what I findis I find over and over again,again, in large companies,even in governments,which have been actuallysurprisingly forward-leaning,you find two things thatmove things forward.One, you find a small fractionof people within a company,within a government whoreally see the big picture,who see the whole Scaling Hypothesis,who understand where AI isgoing, or at least understandwhere it's going within their industry.And there are a few people like thatwithin the current US governmentwho really see the whole picture.And those people see thatthis is the most importantthing in the world,and so they agitate for it.And the thing, they aloneare not enough to succeedbecause they're a small set of peoplewithin a large organization.But as the technology starts to roll out,as it succeeds in some places,in the folks who aremost willing to adopt it,the specter of competitiongives them a wind at their backsbecause they can point withintheir large organization,they can say, look, these otherguys are doing this, right?You know, one bank can say, look,this newfangled hedgefund is doing this thing.They're going to eat our lunch.In the US, we can say we're afraidChina's gonna get there before we are.And that combination, thespecter of competitionplus a few visionaries within these,you know, within the organizationsthat in many ways are sclerotic,you put those two things togetherand it actually makes something happen.I mean, it's interesting.It's a balanced fight between the twobecause inertia is very powerful.But eventually over enough time,the innovative approach breaks through.And I've seen that happen.I've seen the arc ofthat over and over again.And it's like the barriers are there.The barriers to progress, the complexity,not knowing how to use the modelor how to deploy them are there,and for a bit, it seems likethey're gonna last forever,like change doesn't happen.But then eventually change happensand always comes from a few people.I felt the same way when I was an advocateof the Scaling Hypothesiswithin the AI field itselfand others didn't get it.It felt like no one would ever get it.Then it felt like we had asecret almost no one ever had,and then a couple years later,everyone has the secret.And so I think that's how it's gonna gowith deployment to AI in the world.It's gonna, the barriers aregonna fall apart graduallyand then all at once.And so I think this is gonna be more,and this is just an instinct.I could easily see how I'm wrong.I think it's gonna be more like5 or 10 years, as I say in the essay,than it's gonna be 50 or 100 years.I also think it's gonna be5 or 10 years more than it's gonna be,you know, 5 or 10 hours,because I've just seenhow human systems work.And I think a lot of these peoplewho write down thesedifferential equationswho say AI is gonna make more powerful AI,who can't understand how itcould possibly be the casethat these things won't change so fast,I think they don'tunderstand these things.- So what to you is the timelineto where we achieve AGI, AKA powerful AI,AKA super useful AI?- Useful. (laughs)I'm gonna start calling it that.- It's a debate about naming.You know, on pure intelligence,it can smarter than a Nobel Prize winnerin every relevant disciplineand all the things we've said.Modality, can go and do stuffon its own for days, weeks,and do biology experiments on its own.In one, you know what,let's just stick to biology'cause you sold me on the whole biologyand health section,that's so exciting from,just I was getting giddy froma scientific perspective.It made me wanna be a biologist.- It's almost, it's so, no, no,this was the feeling Ihad when I was writing itthat it's like this wouldbe such a beautiful futureif we can just make it happen, right?If we can just get thelandmines out of the wayand make it happen, there's so much,there's so much beautyand elegance and moral forcebehind it if we can just.And it's something we shouldall be able to agree on, right?Like, as much as we fightabout all these political questions,is this something that couldactually bring us together?But you were asking whenwhen will we get this?- When? When do you think?Just so putting numbers on that.- So you know, this is, of course,the thing I've been grapplingwith for many years,and I'm not at all confident.Every time, if I say 2026 or 2027,there will be like a zillionlike people on Twitter who will be like,\"AI CEO said 2026,\"and it'll be repeated forlike the next two yearsthat like this is definitelywhen I think it's gonna happen.So whoever's extorting these clipswill crop out the thing I just saidand only say the thing I'm about to say,but I'll just say it anyway.- Have fun with it.- So, if you extrapolate the curvesthat we've had so far, right?If you say, well, I don't know,we're starting to get to like PhD level,and last year we wereat undergraduate level,and the year before we were at likethe level of a high school student.Again, you can quibble with at what tasksand for what, we'restill missing modalities,but those are being added,like computer use was added,like image in was added,like image generation has been added.If you just kind of like, andthis is totally unscientific,but if you just kind of like eyeballthe rate at which thesecapabilities are increasing,it does make you thinkthat we'll get there by 2026 or 2027.Again, lots of things could derail it.We could run out of data.You know, we might not be ableto scale clusters as much as we want.Like, you know, maybe Taiwangets blown up or somethingand, you know, then we can't produceas many GPUs as we want.So there are all kinds of thingsthat could derail the whole process.So I don't fully believe thestraight line extrapolation,but if you believe thestraight line extrapolation,we'll get there in 2026 or 2027.I think the most likely is thatthere's some mild delay relative to that.I don't know what that delay is,but I think it could happen on schedule.I think there could be a mild delay.I think there are still worldswhere it doesn't happen in 100 years.The number of those worldsis rapidly decreasing.We are rapidly running outof truly convincing blockers,truly compelling reasons whythis will not happenin the next few years.There were a lot more in 2020,although my guess, my hunch at that timewas that we'll make itthrough all those blockers.So sitting as someone who has seenmost of the blockerscleared out of the way,I kind of suspect, myhunch, my suspicion is thatthe rest of them will not block us.But, you know, look,at the end of the day,like I don't wanna represent thisas a scientific prediction.People call them scaling laws.That's a misnomer, likeMoore's law is a misnomer.Moore's laws, scaling laws,they're not laws of the universe.They're empirical regularities.I am going to bet infavor of them continuing,but I'm not certain of that.- So you extensively describesort of the compressed 21st century,how AGI will help set fortha chain of breakthroughs in biologyand medicine that help usin all these kinds ofways that I mentioned.So how do you think, what arethe early steps it might do?And by the way, I asked Claudegood questions to ask you,and Claude told me to ask,\"What do you think is a typical dayfor a biologists working onAGI look like in this future?\"- Yeah, yeah.- Claude is curious.- Well, let me startwith your first questionsand then I'll answer that.Claude wants to know what'sin his future, right?- Exactly.- Who am I gonna be working with?- Exactly.- So I think one of the thingsI went hard on, when I wenthard on in the essay is,let me go back to this idea of,because it's really had, you know,had an impact on me.This idea that within largeorganizations and systems,there end up being a few peopleor a few new ideas whokind of cause thingsto go in a different directionthan they would've before,who kind of disproportionatelyaffect the trajectory.There's a bunch of kind of thesame thing going on, right?If you think about the health world,there's like, you know,trillions of dollarsto pay out Medicare and youknow, other health insurance,and then the NIH is is 100 billion.And then if I think of like the few thingsthat have really revolutionized anything,it could be encapsulated ina small fraction of that.And so when I think of like,where will AI have an impact?I'm like, can AI turn that small fractioninto a much larger fractionand raise its quality?And within biology, myexperience within biologyis that the biggest problem of biologyis that you can't see what's going on.You have very little abilityto see what's going onand even less ability to change it, right?What you have is this, like from this,you have to infer thatthere's a bunch of cellsthat within each cell is, you know,3 billion base pairs of DNAbuilt according to a genetic code.And you know, thereare all these processesthat are just going onwithout any ability of us as,you know, unaugmented humans to affect it.These cells are dividing.Most of the time that's healthy,but sometimes that processgoes wrong and that's cancer.The cells are aging,your skin may change color,develops wrinkles as you age,and all of this is determinedby these processes.All these proteins being produced,transported to various parts of the cells,binding to each other.And in our initial state about biology,we didn't even know thatthese cells existed.We had to invent microscopesto observe the cells.We had to invent morepowerful microscopes to see,you know, below the level of the cellto the level of molecules.We had to invent X-raycrystallography to see the DNA.We had to invent genesequencing to read the DNA.Now, you know, we had to inventprotein folding technology to, you know,to predict how it would foldand how these things bind to each other.You know, we had toinvent various techniquesfor now we can edit theDNA as of, you know,with CRISPR, as of the last 12 years.So the whole history of biology,a whole big part of the historyis basically our ability to readand understand what's going on,and our ability to reach inand selectively change things.And my view is that there's so much morewe can still do there, right?You can do CRISPR but you cando it for your whole body.Let's say I wanna do it forone particular type of celland I want the rate of targetingthe wrong cell to be very low.That's still a challenge.That's still things people are working on.That's what we might needfor gene therapy for certain diseases.And so the reason I'm saying all of this,and it goes beyond this to, you know,to gene sequencing, to newtypes of nano materialsfor observing what's going oninside cells for, you know,antibody drug conjugates.The reason I'm saying all thisis that this could be a leverage pointfor the AI systems, right?That the number of such inventions,it's in the mid doubledigits or something,you know, mid double digits.Maybe low triple digitsover the history of biology.Let's say I have a millionof these AIs like, you know,can they discover thousand,you know, working together,can they discover thousandsof these very quickly?And does that provide a huge lever,instead of trying toleverage the, you know,2 trillion a year we spend on,you know, Medicare or whatever,can we leverage the 1 billion a year,you know, that's spent to discover,but with much higher quality?And so what is it like, you know,being a scientist thatworks with an AI system?The way I think aboutit actually is, well,so I think in the early stages,the AIs are gonna be like grad students.You're gonna give them aproject, you're gonna say,you know, I'm the experienced biologist,I've set up the lab.The biology professoror even the grad studentsthemselves will say,here's what you can do with an AI,you know, like AI system.I'd like to study this.And you know, the AI system,it has all the tools.It can like look up all theliterature to decide what to do.It can look at all the equipment.It can go to a website and say,hey, I'm gonna go to,you know, Thermo Fisheror, you know, whatever thelab equipment company is,dominant lab equipment company is today.In my time, it was Thermo Fisher.You know, I'm gonna orderthis new equipment to do this.I'm gonna run my experiments.I'm gonna, you know, write upa report about my experiments.I'm gonna, you know, inspectthe images for contamination.I'm gonna decide whatthe next experiment is.I'm gonna like write some codeand run a statistical analysis.All the things a grad student would do,there will be a computer with an AIthat like the professor talksto every once in a whileand it says, this is whatyou're gonna do today.The AI system comes to it with questions.When it's necessary torun the lab equipment,it may be limited in some ways.It may have to hire a human lab assistant,you know, to do the experimentand explain how to do it.Or it could, you know,it could use advances in lab automationthat are gradually being developed over,have been developed overthe last decade or so,and will continue to be developed.And so it'll look likethere's a human professorand 1000 AI grad students,and you know, if you go to oneof these Nobel Prizewinning biologists or so,you'll say, okay, well, you know,you had like 50 grad students,well, now you have 1000and they're smarterthan you are, by the way.Then I think at some pointit'll flip around where,you know, the AI systems will, you know,will be the PIs, will be the leaders,and you know, they'll be ordering humansor other AI systems around.So I think that's how it'llwork on the research side.- And they would be the inventorsof a CRISPR type technology.They would be the inventorsof a CRISPR type technology.And then I think, you know,as I say in the essay,we'll want to turn,probably turning loose is the wrong term,but we'll want to harness the AI systemsto improve the clinicaltrial system as well.There's some amount ofthis that's regulatory,that's a matter of societaldecisions and that'll be harder.But can we get better at predictingthe results of clinical trials?Can we get better atstatistical design so that,you know, clinical trialsthat used to require,you know, 5,000 peopleand therefore, you know,needed 100 million dollarsin a year to enroll them.Now they need 500 people andtwo months to enroll them.That's where we should start.And you know, can weincrease the success rateof clinical trials by doingthings in animal trialsthat we used to do in clinical trials,and doing things in simulationsthat we used to do in animal trials?Again, we won't be ableto simulate it all,AI's not God, but you know,can we shift the curvesubstantially and radically?So I don't know, that would be my picture.- Doing in vitro and doing it,I mean, you're still slowed down.It still takes time, but youcan do it much, much faster.- Yeah, yeah, yeah.Can we just one step at a time,and can that add up to a lot of steps?Even though we still need clinical trials,even though we still need laws,even though the FDAand other organizationswill still not be perfect,can we just move everythingin a positive direction?And when you add up allthose positive directions,do you get everythingthat was gonna happen from here to 2100instead happens from 2027to 2032 or something?- Another way that I think the worldmight be changing with AI even today,but moving towards this futureof the powerful superuseful AI is programming.So how do you see thenature of programming?Because it's so intimate tothe actual act of building AI.How do you see thatchanging for us humans?- I think that's gonna be one of the areasthat changes fastest for two reasons.One, programming is askill that's very closeto the actual building of the AI.So the farther a skill is from the peoplewho are building the AI,the longer it's gonna taketo get disrupted by the AI, right?Like I truly believe that likeAI will disrupt agriculture.Maybe it already has in some ways,but that's just verydistant from the folkswho are building AI and so Ithink it's gonna take longer.But programming is thebread and butter of,you know, a large fractionof the employees who work at Anthropicand at the other companiesand so it's gonna happen fast.The other reason it's gonnahappen fast is with programming,you close the loop,both when you're training the modeland when you're applying the model.The idea that the model can write the codemeans that the model can then run the codeand then see the resultsand interpret it back.And so it really has anability, unlike hardware,unlike biology, which we just discussed,the model has an abilityto close the loop.And so I think those two thingsare gonna lead to the modelgetting good at programming very fast.As I saw on, you know, typicalreal world programming tasks,models have gone from 3%in January of this yearto 50% in October of this year.So, you know, we're onthat s-curve, right,where it's gonna start slowing down soon,'cause you can only get to 100 percent.But, you know, I would guessthat in another 10 months,we'll probably get pretty close.We'll be at least 90%.So again, I would guess, you know,I don't know how long it'll take,but I would guess again, 2026, 2027.Twitter people who crop out these numbersand get rid of the caveats,like, I don't know, I don't like you.Go away. (laughs)I would guess that the kind of taskthat the vast majority ofcoders do, AI can probably,if we make the task verynarrow, like just write code,AI systems will be able to do that.Now that said, I thinkcomparative advantage is powerful.We'll find that when AIscan do 80% of a coder's job,including most of itthat's literally like writecode with a given spec,we'll find that the remaining partsof the job become moreleveraged for humans, right?Humans will, there'll be more aboutlike high level system design or,you know, looking at the appand like, is it architected well?And the design and UX aspects,and eventually AI will be ableto do those as well, right?That's my vision of the, youknow, powerful AI system.But I think for much longerthan we might expect,we will see thatsmall parts of the jobthat humans still dowill expand to fill their entire jobin order for the overallproductivity to go up.That's something we've seen.You know, it used to be that, you know,writing and editingletters was very difficultand like writing the print was difficult.Well, as soon as you had word processorsand then computers and itbecame easy to produce workand easy to share it,then that became instantand all the focus was on the ideas.So this logic of comparative advantagethat expands tiny parts of the tasksto large parts of the tasksand creates new tasks inorder to expand productivity,I think that's going to be the case.Again, someday AI willbe better at everythingin that logic won't apply,and then we all have, youknow, humanity will haveto think about how tocollectively deal with that,and we're thinking about that every day.And you know, that's another oneof the grand problems to deal with,aside from misuse and autonomyand, you know, we shouldtake it very seriously.But I think in the near term,and maybe even in the mediumterm, like medium term,like 2, 3, 4 years, youknow, I expect that humanswill continue to have a huge roleand the nature of programming will change,but programming as a role,programming as a job will not change.It'll just be less writingthings line by lineand it'll be more macroscopic.- And I wonder what thefuture of IDs looks like.So the tooling ofinteracting with AI systems,this is true for programmingand also probably truefor in other contexts,like computer use, butmaybe domain specific,like we mentioned biology,it probably needs its own toolingabout how to be effective,and then programmingneeds its own tooling.Is Anthropic gonna play in that spaceof also tooling potentially?- I'm absolutely convincedthat powerful IDsthat there's so much low hanging fruitto be grabbed there that, you know,right now it's just likeyou talk to the modeland it talks back, but look, I mean,IDs are great at kind oflots of static analysis of,you know, so much is possiblewith kind of static analysis,like many bugs you can findwithout even writing the code.Then, you know, IDs are goodfor running particular things,organizing your code, measuringcoverage of unit tests.Like there's so much that'sbeen possible with normal IDs.Now you add somethinglike, well, the model,you know, the model can nowlike write code and run code.Like I am absolutely convincedthat over the next year or two,even if the quality ofthe models didn't improve,that there would be enormous opportunityto enhance people's productivityby catching a bunch of mistakes,doing a bunch of grunt work for people,and that we haven't evenscratched the surface.Anthropic itself, I mean, you can't say,you know, it's hard to saywhat will happen in the future.Currently we're not tryingto make such IDs ourself,rather we're powering the companies,like Cursor or like Cognitionor some of the other, you know,expo in the security space.You know, others thatI can mention as wellthat are building such thingsthemselves on top of our API.And our view has beenlet 1000 flowers bloom.We don't internally have the, you know,the resources to try allthese different things.Let's let our customers try itand, you know, we'll see who succeededand maybe different customerswill succeed in different ways.So I both think this is super promisingand you know, it's not something,you know, Anthropic isn't eager to,at least right now, competewith all our companiesin this space and maybe never.- Yeah, it's beeninteresting to watch Cursortry to integrate Claude successfully,'cause it's actually been fascinatinghow many places it can helpthe programming experience.It's not as trivial-- It is really astounding.I feel like, you know, as a CEO,I don't get to program that much,and I feel like if sixmonths from now I go back,it'll be completely unrecognizable to me.- Exactly.So in this world with super powerful AIthat's increasingly automated,what's the source ofmeaning for us humans?- Yeah.- You know, work is a sourceof deep meaning for many of us.So where do we find the meaning?- This is somethingthat I've written abouta little bit in the essay,although I actually, Igive it a bit short shrift,not for any principled reason.But this essay, if you believe,it was originally gonnabe two or three pages,I was gonna talk about it at all hands.And the reason I realizedit was an important, underexplored topicis that I just kept writing things.And I was just like,oh, man, I can't do this justice.And so the thing balloonedto like 40 or 50 pages,and then when I got to thework and meaning section,I'm like, oh, man, thisisn't gonna be 100 pages.Like I'm gonna have to write awhole other essay about that.But meaning is actually interestingbecause you think about like the lifethat someone lives or something,or like, you know,let's say you were to putme in like a, I don't know,like a simulated environmentor something where like, you know,like I have a job and I'mtrying to accomplish thingsand I don't know, I likedo that for 60 yearsand then you're like, oh, like oops,this was actually all a game, right?Does that really kind of rob youof the meaning of the whole thing?You know, like I stillmade important choices,including moral choices.I still sacrificed.I still had to kind ofgain all these skills.Or just like a similar exercise,you know, think back to like, you know,one of the historicalfigures who, you know,discovered electromagnetismor relativity or something.If you told them, well,actually 20,000 years ago,some alien on, you know,some alien on this planetdiscovered this before you did,does that rob themeaning of the discovery?It doesn't really seemlike it to me, right?It seems like the process is what matters,and how it shows who you areas a person along the wayand, you know, how yourelate to other peopleand like the decisions thatyou make along the way.Those are consequential.You know, I could imagineif we handle things badlyin an AI world, we could set things upwhere people don't have anylong-term source of meaningor any, but that's morea set of choices we make,that's more a set of the architectureof a society with these powerful models.If we design it badlyand for shallow thingsthen that might happen.I would also say that, youknow, most peoples' lives today,while admirably, youknow, they work very hardto find meaning in those lives,like look, you know, we who are privilegedand who are developing these technologies,we should have empathy for peoplenot just here but in therest of the world who,you know, spend a lot of their timekind of scraping by to like survive.Assuming we can distribute the benefitsof this technology to everywhere,like their lives are gonnaget a hell of a lot better.And you know, meaningwill be important to themas it is important to them now.but you know, we should not forgetthe importance of that.And you know, that the idea of meaningas kind of the only important thingis in some ways an artifactof a small subset of peoplewho have been economically fortunate.But, you know, I think all that said,you know, I think a worldis possible with powerful AIthat not only has as muchmeaning for everyone,but that has more meaningfor everyone, right?That can allow everyone tosee worlds and experiencesthat it was eitherpossible for no one to see,or possible for very fewpeople to experience.So I am optimistic about meaning.I worry about economics andthe concentration of power.That's actually what I worry about more.I worry about how do we make sure thatthat fair world reaches everyone.When things have gone wrong for humans,they've often gone wrongbecause humans mistreat other humans.That is maybe in some wayseven more than the autonomous risk of AIor the question of meaning,that is the thing I worry about most,the concentration ofpower, the abuse of power,structures like autocraciesand dictatorshipswhere a small number of people exploitsa large number of people,I'm very worried about that.- And AI increases theamount of power in the world,and if you concentrate that powerand abuse that power, itcan do immeasurable damage.- Yes, it's very frightening.It's very frightening.- Well, I encourage people,highly encourage peopleto read the full essay.There should probably be abook or a sequence of essaysbecause it does painta very specific future.And I could tell the later sectionsgot shorter and shorterbecause you started to probably realizethat this is gonna be a verylong essay if I keep going.- One, I realized it would be very long,and two, I'm very aware ofand very much try to avoid,you know, just being, I don'tknow what the term for it is,but one of these peoplewho's kind of overconfidentand has an opinion on everythingand kind of says a bunch ofstuff and isn't an expert.I very much tried to avoid that.But I have to admit, onceI got the biology sections,like I wasn't an expert,and so as much as I expressed uncertainty,probably I said a bunch of thingsthat were embarrassing or wrong.- Well, I was excited forthe future you painted,and thank you so much for workinghard to build that future.And thank you for talking today, Dario.- Thanks for having me.I just hope we can get itright and make it real.And if there's one message I wanna send,it's that to get all thisstuff right, to make it real,we both need to build the technology,build the, you know, the companies,the economy around usingthis technology positively.But we also need to address the risksbecause those risks are in our way.They're landmines on theway from here to there,and we have to diffuse those landminesif we want to get there.- It's a balance, like all things in life.- Like all things.- Thank you.Thanks for listening to thisconversation with Dario Amodei.And now dear friends,here's Amanda Askell.You are a philosopher by training.So what sort of questionsdid you find fascinatingthrough your journey inphilosophy, in Oxford and NYU,and then switching overto the AI problems atOpenAI and Anthropic?- I think philosophy isactually a really good subjectif you are kind offascinated with everything,so because there's aphilosophy of everything.You know, so if you do philosophyof mathematics for a whileand then you decide that you'reactually really interestedin chemistry, you can dophilosophy of chemistryfor a while, you can move into ethics,or philosophy of politics.I think towards the end,I was really interestedin ethics primarily,so that was like what my PhD was on.It was on a kind oftechnical area of ethics,which was ethics where worldscontain infinitely many people, strangely.A little bit less practicalon the end of ethics.And then I think thatone of the tricky thingswith doing a PhD in ethicsis that you're thinking alot about like the world,how it could be better, problems,and you're doing like a PhD in philosophy,and I think when I was doingmy PhD I was kind of like,this is really interesting.It's probably one of themost fascinating questionsI've ever encountered inphilosophy and I love it,but I would rather see if Ican have an impact on the worldand see if I can like do good things.And I think that was around the timethat AI was still probablynot as widely recognized as it is now.That was around 2017, 2018.I had been following progressand it seemed like it wasbecoming kind of a big deal,and I was basically just happyto get involved and see if Icould help 'cause I was like,well, if you try anddo something impactful,if you don't succeed, you tried to dothe impactful thing andyou can go be a scholar,and feel like, you know, you tried,and if it doesn't workout, it doesn't work out,and so then I went intoAI policy at that point.- And what does AI policy entail?- At the time, thiswas more thinking aboutsort of the political impactand the ramifications of AI,and then I slowly movedinto sort of AI evaluation,how we evaluate models, how they comparewith like human outputs,whether people can telllike the differencebetween AI and human outputs.And then when I joined Anthropic,I was more interested in doingsort of technical alignment work.And again, just seeing if I could do it,and then being like if I can't then,you know, that's fine, I tried.Sort of the way I lead life I think.- What was that likesort of taking the leapfrom the philosophy ofeverything into the technical?- I think that sometimespeople do this thingthat I'm like not that keenon where they'll be like,is this person technical or not?Like, you're either aperson who can like codeand isn't scared ofmath or you're like not.And I think I'm maybe just more like,I think a lot of peopleare actually very capableof working these kinds ofareas if they just like try it.And so I didn't actuallyfind it like that bad.In retrospect, I'm sort of gladI wasn't speaking to peoplewho treated it like it,you know, I've definitelymet people who are like,\"Whoa, you like learned how to code?\"And I'm like, well, I'm notlike an amazing engineer.Like I'm surrounded by amazing engineers.My code's not pretty.But I enjoyed it a lot,and I think that in manyways, at least in the end,I think I flourished likemore in the technical areasthan I would have in the policy areas.- Politics is messy and it'sharder to find solutionsto problems in the space of politics.Like definitive, clear,provable, beautiful solutions,as you can with technical problems.- Yeah, and I feel likeI have kind of likeone or two sticks that Ihit things with, you know,and one of them is like argumentsand like you know, so likejust trying to work outwhat a solution to a problemis and then trying to convince people thatthat is the solutionand be convinced if I'm wrong.And the other one issort of more empiricism.So like just like finding results,having a hypothesis, testing it.And I feel like a lot of policyand politics feels likeit's layers above that.Like somehow I don'tthink if I was just like\"I have a solution toall of these problems,here it is written down.If you just want toimplement it, that's great.\"That feels like not how policy works.And so I think that'swhere I probably just likewouldn't have flourished is my guess.- Sorry to go in that direction,but I think it would bepretty inspiring for peoplethat are quote unquote non-technicalto see like the incrediblejourney you've been on.So what advice would you give to peoplethat are sort of maybe,which is a lot of people,think they're underqualified,insufficiently technical to help in AI?- Yeah, I think it dependson what they want to do,and in many ways it isa little bit strangewhere I thought it's kind of funnythat I think I rampedup technically at a timewhen now I look at it and I'm like,models are so good at assistingpeople with this stuff,that it's probably like easier nowthan like when I was working on this.So part of me is like,I dunno, find a projectand see if you canactually just carry it outis probably my best advice.I dunno if that's just'cause I'm very project basedin my learning.Like I don't think I learn very wellfrom like say coursesor even from like books,at least when it comesto this kind of work.The thing I'll often try anddo is just like have projectsthat I'm working on andimplement them and, you know,and this can include likereally small silly things.Like if I get slightlyaddicted to like word gamesor number games or something,I would just like codeup a solution to them,because there's some part in my brain,and it just like completelyeradicated the itch.You know, you're like onceyou have like solved itand like you just have like a solutionthat works every time, Iwould then be like cool,I can never play that game again.That's awesome.- Yeah, there's a real joy to buildinglike game playing engines,like board games especiallybecause they're prettyquick, pretty simple,especially a dumb one,and then you can play with it.- Yeah, and then it's alsojust like trying things,like part of me is like if you,maybe it's that attitude that I likeis the whole figure out whatseems to be like the waythat you could have a positiveimpact and then try it,and if you fail, and ina way that you're like,I actually like can never succeed at this,you'll like know that you tried,and then you go into something elseand you'll probably learn a lot.- So one of the thingsthat you're an expert inand you do is creatingand crafting Claude'scharacter and personality.And I was told that youhave probably talkedto Claude more thananybody else at Anthropic,like literal conversations.I guess there's like a Slack channelwhere the legend goes, youjust talk to it nonstop.So what's the goal of creatingand crafting Claude'scharacter and personality?- It's also funny if people thinkthat about the Slack channel'cause I'm like that's oneof like five or six different methodsthat I have for talking with Claude,and I'm like, yes thisis a tiny percentageof how much I talk with Claude.(both laughing)I think the goal, like one thingI really like about the characterwork is from the outset,it was seen as an alignment piece of workand not something likea product consideration.Which isn't to say I don'tthink it makes Claude,I think it actually does make Claudelike enjoyable to talkwith, at least I hope so.But I guess like my main thought with ithas always been tryingto get Claude to behavethe way you would kindof ideally want anyoneto behave if they werein Claude's position.So imagine that I take someoneand they know thatthey're gonna be talkingwith potentially millions of people,so that what they're sayingcan have a huge impact,and you want them to behave wellin this like really rich sense.So I think that doesn'tjust mean like being,say, ethical, though it does include that,and not being harmful butalso being kind of nuanced.You know, like thinkingthrough what a person means,trying to be charitable with them,being a good conversationalist.Like really in this kind of like richsort of Aristotelian notionof what it's to be a good person,and not in this kind of like thin,like ethics as a more comprehensive notionof what it is to be.So that includes things like,when should you be humorous,when should you be caring?How much should you like respect autonomyand people's like abilityto form opinions themselvesand how should you do that?I think that's the kind oflike rich sense of characterthat I wanted to and stilldo want Claude to have.- Do you also have to figure outwhen Claude should push backon an idea or argue versus...(laughs) So you have torespect the worldviewof the person that arrives to Claudebut also maybe help them grow if needed?That's a tricky balance.- Yeah, there's this problemof like sycophancy in language models.- Can you describe that?- Yeah, so basically,there's a concern that the modelsort of wants to tell you whatyou want to hear, basically.And you see this sometimes.So I feel like if you interactwith the models, so I might be like,\"What are three baseballteams in this region?\"And then Claude says, youknow, \"Baseball team one,baseball team two, baseball team three.\"And then I say something like,\"Oh, I think baseball teamthree moved, didn't they?I don't think they're there anymore.\"And there's a sense inwhich like if Claudeis really confident that that's not true,Claude should be like, \"I don't think so.\"Like maybe you have more upto up to date information.But I think language modelshave this like tendencyto instead, you know, be like,\"You're right, they did move,\"you know, \"I'm incorrect.\"I mean, there's many waysin which this could be kind of concerning.So like a different exampleis imagine someone says to the model,\"How do I convince mydoctor to get me an MRI?\"There's like what thehuman kind of like wants,which is this like convincing argument.And then there's likewhat is good for them,which might be actually to say,\"Hey, if your doctor's suggestingthat you don't need an MRI,that's a good person to listen to.\"And like, and it's actually really nuancedwhat you should do in that kind of case,'cause you also want to be like,\"But if you're trying to advocatefor yourself as a patient,here's like things that you can do.If you are not convinced bywhat your doctor's saying,it's always great to get second opinion.\"Like it's actually really complexwhat you should do in that case.But I think what you don't wantis for models to just like saywhat they think you want to hear,and I think that's the kindof problem of sycophancy.- So what other traits, youalready mentioned a bunch,but what other that come to mindthat are good in this Aristotelian sensefor a conversationalist to have?- Yeah, so I think likethere's ones that are goodfor conversational like purposes.So you know, asking follow up questionsin the appropriate places,and asking the appropriatekinds of questions.I think there are broader traits thatfeel like they might be more impactful.So one example that Iguess I've touched on,but that also feels importantand is the thing that I'veworked on a lot is honesty,and I think this like getsto the sycophancy point.There's a balancing actthat they have to walk,which is models currently are less capablethan humans in a lot of areas.And if they push backagainst you too much,it can actually be kind of annoying,especially if you're just correct'cause you're like, look,I'm smarter than you on this topic,like I know more like.And at the same time, you don't want themto just fully defer to humansand to like try to be as accurateas they possibly can be about the worldand to be consistent across context.But I think there are others,like when I was thinkingabout the character,I guess one picture that I had in mindis especially because these are modelsthat are gonna be talking to peoplefrom all over the worldwith lots of different political views,lots of different ages.And so you have to ask yourself like,what is it to be a goodperson in those circumstances?Is there a kind of person whocan like travel the world,talk to many different people,and almost everyone willcome away being like,\"Wow, that's a really good person.That person seems really genuine.\"And I guess like my thought there was likeI can imagine such a personand they're not a personwho just like adopts thevalues of the local culture.And in fact that would be kind of rude.I think if someone came to youand just pretended to have your values,you'd be like, that's kind of off putting.It's someone who's like very genuine,and insofar as they haveopinions and values,they express them, they'rewilling to discuss things,though they're open-minded,they're respectful.And so I guess I had inmind that the person who,like if we were to aspireto be the best personthat we could be in thekind of circumstancethat a model finds itselfin, how would we act?And I think that's the kind of the guideto the sorts of traitsthat I tend to think about.- Yeah, that's a beautiful framework.I want you to think aboutthis like a world travelerand while holding onto your opinions,you don't talk down to people,you don't think you're better than thembecause you have thoseopinions, that kind of thing.You have to be good at listeningand understanding their perspective,even if it doesn't match your own.So that's a tricky balance to strike.So how can Claude representmultiple perspectives on a thing?Like, is that challenging?We could talk about politics.It's very divisive.But there's other divisive topicson baseball teams, sports and so on.How is it possible to sort of empathizewith a different perspectiveand to be able to communicate clearlyabout the multiple perspectives?- I think that people think about valuesand opinions as things that people holdsort of with certainty,and almost like preferencesof taste or something,like the way that theywould, I don't know,prefer like chocolate topistachio or something.But actually I think about valuesand opinions as like a lot morelike physics than I think most people do.I'm just like, these are thingsthat we are openly investigating.There's some things thatwe're more confident in.We can discuss them, wecan learn about them.And so I think in some ways,though like, ethics isdefinitely different in nature,but has a lot of thosesame kind of qualities.You want models, in the same waythat you want them to understand physics,you kind of want them to understandall like values in theworld that people have,and to be curious about themand to be interested in them,and to not necessarily like pander to themor agree with them, becausethere's just lots of valueswhere I think almostall people in the world,if they met someone with those values,they'd be like, \"That's abhorrent.I completely disagree.\"And so again, maybe my thought is,well, in the same way that a person can,like I think many peopleare thoughtful enoughon issues of like ethics,politics, opinions,that even if you don't agree with them,you feel very heard by them.They think carefully about your position.They think about its pros and cons.They maybe offer counter considerations.So they're not dismissive,but nor will they agree.You know, if they're like,\"Actually, I just thinkthat that's very wrong,\"they'll like say that.I think that in Claude's position,it's a little bit trickierbecause you don'tnecessarily want to like,if I was in Claude's position,I wouldn't be giving a lot of opinions.I just wouldn't want toinfluence people too much.I'd be like, you know,I forget conversationsevery time they happen,but I know I'm talkingwith like potentially millions of people,who might be like reallylistening to what I say.I think I would just be like,I'm less inclined to give opinions.I'm more inclined to likethink through things,or present the considerations to you,or discuss your views with you.But I'm a little bit less inclinedto like affect how you think,'cause it feels much more importantthat you maintain like autonomy there.- Yeah, like if you reallyembody intellectual humility,the desire to speak decreases quickly.- Yeah.- Okay.But Claude has to speak,so, but without being overbearing.- Yeah.- But then there's a linewhen you're sort of discussingwhether the Earth is flator something like that.I actually was, I remember a long time agowas speaking to a few high profile folks,and they were so dismissive of the ideathat the Earth is flat, butlike so arrogant about it.And I thought like,there's a lot of peoplethat believe the Earth is flat.That was, well, I don't knowif that movement is there anymore.That was like a meme for a while.But they really believedit and like, okay,so I think it's really disrespectfulto completely mock them.I think you have to understandwhere they're coming from.I think probably where they're coming fromis the general skepticism of institutionswhich is grounded in a kind of,there's a deep philosophy there,which you could understand.You can even agree with in parts.And then from there, you can use itas an opportunity to talk about physics,without mocking them, without so on,but it's just like, okay, like,what would the world look like?What would the physics of the worldwith the flat Earth look like?There's a few cool videos on this.- Yeah.- And then like,is it possible the physics is different?And what kind of experiments would we do?And just, yeah, without disrespect,without dismissivenesshave that conversation.Anyway, that to me is a usefulthought experiment of like,how does Claude talk toa flat Earth believerand still teach them something,still help them grow, that kind of stuff.That's challenging.- And kind of likewalking that line betweenconvincing someone and justtrying to like talk at themversus like drawing out their views,like listening and then offeringkind of counter considerations.And it's hard, I thinkit's actually a hard linewhere it's like where are youtrying to convince someoneversus just offeringthem like considerationsand things for them to think about,so that you're not actuallylike influencing them.You're just like letting themreach wherever they reach.And that's like a line that it's difficultbut that's the kind of thingthat language models have to try and do.- So like I said,you've had a lot ofconversations with Claude.Can you just map out whatthose conversations are like?What are some memorable conversations?What's the purpose, thegoal of those conversations?- Yeah, I think that most of the timewhen I'm talking with Claude,I'm trying to kind of mapout its behavior, in part.Like obviously I'm gettinglike helpful outputsfrom the model as well.But in some ways, this islike how you get to knowa system, I think, is by like probing itand then augmenting like, you know,the message that you're sendingand then checking the response to that.So in some ways, it's likehow I map out the model.I think that people focusa lot on these quantitativeevaluations of models.And this is a thing that I said before,but I think in the caseof language models,a lot of the time, each interactionyou have is actuallyquite high information.It's very predictive of other interactionsthat you'll have with the model.And so I guess I'm like,if you talk with a modelhundreds or thousands of times,this is almost like a huge numberof really high quality data pointsabout what the model is like,in a way that like lots of very similarbut lower qualityconversations just aren't,or like questions that arejust like mildly augmentedand you have thousands ofthem might be less relevantthan like 100 reallywell selected questions.- Well, so, you're talking to somebodywho as a hobby does a podcast.I agree with you 100%.If you're able to ask the right questionsand are able to hear,like understand (laughs)like the depth and theflaws in the answer,you can get a lot of data from that.- Yeah.- So like your task is basicallyhow to probe with questions.And you're exploring like the long tail,the edges, the edge cases,or are you looking forlike general behavior?- I think it's almost like everything.Like, because I want likea full map of the model,I'm kind of trying todo the whole spectrumof possible interactionsyou could have with it.So like one thing that'sinteresting about Claude,and this might actually getto some interesting issues with RLHF,which is if you ask Claude for a poem,like I think that a lot of models,if you ask them for a poem,the poem is like fine.You know, usually it kinda like rhymesand it's, you know, so if you say like,\"Give me a poem about the sun,\"it'll be like, yeah,it'll just be a certainlength, it'll like rhyme.It'll be fairly kind of benign.And I've wondered before,is it the case thatwhat you're seeing iskind of like the average?It turns out, you know,if you think about peoplewho have to talk to a lot of peopleand be very charismatic,one of the weird thingsis that I'm like, well,they're kind of incentivizedto have these extremely boring viewsbecause if you havereally interesting views,you're divisive and, you know,a lot of people are not gonna like you.So like if you have veryextreme policy positions,I think you're just gonnabe like less popularas a politician, for example.And it might be similarwith like creative work.If you produce creative workthat is just trying to maximizethe kind of number of people that like it,you're probably notgonna get as many peoplewho just absolutely love it,because it's gonna bea little bit, you know,you're like, oh, this is the out,yeah, this is decent.- Yeah.- And so you can do this thingwhere like I have various prompting thingsthat I'll do to get Claude to,I'm kind of, you know,I'll do a lot of like,\"This is your chance tobe like fully creative.I want you to just thinkabout this for a long time.And I want you to like createa poem about this topicthat is really expressive of you,both in terms of how you think poetryshould be structured,\" et cetera.You know, and you just give itthis like really long prompt.And it's poems are just so much better.Like they're really good.And I don't think I'm someone who is like,I think it got me interested in poetry,which I think was interesting.You know, I would like read these poemsand just be like, this is, Ijust like, I love the imagery,I love like, and it's nottrivial to get the modelsto produce work like that,but when they do, it's like really good.So I think that's interesting thatjust like encouraging creativity,and for them to move awayfrom the kind of likestandard like immediate reactionthat might just be the aggregateof what most people think is fine,can actually produce thingsthat, at least to my mind,are probably a little bitmore divisive but I like them.- But I guess a poem is a nice, clean wayto observe creativity.It's just like easy to detectvanilla versus non vanilla.- Yep.- Yeah, that's interesting.That's really interesting.So on that topic, so theway to produce creativityor something special, youmentioned writing prompts,and I've heard you talk about, I mean,the science and the artof prompt engineering.Could you just speak to what it takesto write great prompts?- I really do think that like philosophyhas been weirdly helpful for me here,more than in many other like respects.So like in philosophy,what you're trying to dois convey these very hard concepts.Like one of the thingsyou are taught is like,and I think it is because it is,I think it is an anti-bullshitdevice in philosophy.Philosophy is an area where you could havepeople bullshitting andyou don't want that.And so it's like this like desirefor like extreme clarity.So it's like anyone couldjust pick up your paper,read it and know exactlywhat you're talking about.It's why it can almost be kind of dry.Like all of the terms are defined,every objection's kind ofgone through methodically.And it makes sense to me'cause I'm like when you'rein such an a priori domain,like you just, clarity is sort ofthis way that you can,you know, prevent people fromjust kind of making stuff up.And I think that's sort of what you haveto do with language models.Like very often I actually find myselfdoing sort of mini versions of philosophy,you know, so I'm like,suppose that you give me a taskor I have a task for the model,and I want it to like pick outa certain kind of question,or identify whether an answerhas a certain property.Like I'll actually sit and be like,let's just give thisa name, this property.So like, you know, supposeI'm trying to tell it like,oh, \"I want you to identifywhether this response was rude or polite.\"I'm like, that's a wholephilosophical questionin and of itself.So I have to do as much like philosophyas I can in the moment to be like,here's what I mean by rudeness,and here's what I mean by politeness.And then like there's another elementthat's a bit more, I guess,I dunno if this isscientific or empirical.I think it's empirical.So like I take that descriptionand then what I want to dois again probe the model like many times.Like this is very,prompting is very iterative.Like I think a lot of people where,if a prompt is important,they'll iterate on ithundreds or thousands of times.And so you give it the instructionsand then I'm like, whatare the edge cases?So if I looked at this,so I try and like almost like, you know,see myself from the position of the modeland be like what is the exact casethat I would misunderstand,or where I would just be like,\"I don't know what to do in this case.\"And then I give that case to the modeland I see how it responds,and if I think I got itwrong, I add more instructionsor I even add that in as an example.So these very like taking the examplesthat are right at the edge ofwhat you want and don't want,and putting those into your promptas like an additional kind ofway of describing the thing.And so yeah, in many waysit just feels like this mix of like,it's really just tryingto do clear exposition,and I think I do that'cause that's how I getclear on things myself.So in many ways like clear promptingfor me is often just meunderstanding what I wantis like half the task.- So I guess that's quite challenging.There's like a laziness that overtakes meif I'm talking to Claudewhere I hope Claude just figures it out.So for example, I asked Claude for todayto ask some interesting questions, okay.And the questions that came up,and I think I listed a fewsort of interesting, counterintuitive,and/or funny or somethinglike this, all right.And it gave me some prettygood, like it was okay,but I think what I'mhearing you say is like,all right, well, I haveto be more rigorous here.I should probably give examplesof what I mean by interesting,and what I mean by funnyor counterintuitive,and iteratively build that promptto better, to get it likewhat feels like is the right,because it is really, it's a creative act.I'm not asking for factual information.I'm asking to together write with Claude.So I almost have to programusing natural language.- Yeah, I think that prompting does feela lot like the kind of the programmingusing natural language andexperimentation or something.It's an odd blend of the two.I do think that for most tasks,so if I just want Claude to do a thing,I think that I am probablymore used to knowinghow to ask it to avoidlike common pitfallsor issues that it has.I think these aredecreasing a lot over time.But it's also very fineto just ask it for thething that you want.I think that prompting actuallyonly really becomes relevantwhen you're really trying to eke outthe top like 2% of model performance.So for like a lot of tasksI might just, you know,if it gives me an initial list backand there's somethingI don't like about it,like it's kind of generic,like for that kind of task,I'd probably just takea bunch of questionsthat I've had in the pastthat I've thought worked really welland I would just give it to the modeland then be like, \"Now here's this personthat I'm talking with, give me questionsof at least that quality.\"Or I might just ask it for some questionsand then if I was like, ah,these are kind of trite,or like, you know, I wouldjust give it that feedbackand then hopefully itproduces a better list.I think that kind of iterative prompting,at that point, your prompt is like a toolthat you're gonna get so much valueout of that you're willingto put in the work.Like if I was a companymaking prompts for models,I'm just like, if you're willing to spenda lot of like time and resourceson the engineering behindlike what you're building,then the prompt is not somethingthat you should bespending like an hour on.It's like that's a bigpart of your system,make sure it's working really well.And so it's only things like that.Like if I'm using a promptto like classify thingsor to create data,that's when you're like,it's actually worth just spendinglike a lot of time likereally thinking it through.- What other advicewould you give to peoplethat are talking to Claudesort of generally, more general?'Cause right now, we're talking aboutmaybe the edge cases,like eking out the 2%.But what in general advice would you givewhen they show up to Claudetrying it for the first time?- You know, there's a concern that peopleover anthropomorphize models,and I think that's likea very valid concern.I also think that people oftenunder anthropomorphize thembecause sometimes when I see like issuesthat people have run intowith Claude, you know,say Claude is like refusing a taskthat it shouldn't refuse.But then I look at the textand like the specificwording of what they wroteand I'm like, I see why Claude did that.And I'm like, if you think throughhow that looks to Claude,you probably could havejust written it in a waythat wouldn't evoke such a response.Especially this is more relevantif you see failures or if you see issues.It's sort of like think aboutwhat the model failed at,like what did it do wrong?And then maybe that willgive you a sense of like why.So, is it the way that I phrased a thing?And obviously like as models get smarter,you're gonna need less of this,and I already see likepeople needing less of it.But that's probably the adviceis sort of like try to havesort of empathy for the model.Like read what you wrote as if you werelike a kind of like personjust encountering this for the first time,how does it look to you?And what would've made you behavein the way that the model behaved?So if it misunderstood what kind of like,what coding language you wanted to use,is that because like itwas just very ambiguousand it kinda had to take a guess?In which case, next timeyou could just be like,\"Hey, make sure this is in Python.\"I mean, that's the kinda mistakeI think models are muchless likely to make now,but you know, if you dosee that kinda mistake,that's probably the advice I'd have.- And maybe sort of, Iguess, ask questions whyor what other details can I provideto help you answer better?- Yeah.- Is that work or no?- Yeah, I mean, I've donethis with the models,like it doesn't always work,but like sometimes I'll just be like,\"Why did you do that?\"(both laughing)I mean, people underestimate the degreeto which you can reallyinteract with models,like, yeah, I'm just like.And sometimes, literallylike quote word for wordthe part that made you,and you don't know thatit's like fully accurate,but sometimes you do thatand then you change a thing.I mean, I also use the models to help mewith all of this stuff I should say,like prompting can endup being a little factorywhere you're actually buildingprompts to generate prompts.And so like yeah,anything where you'relike having an issue.Asking for suggestions,sometimes just do that.I'm like, \"You made thaterror, what could I have said?\"That's actually not uncommon for me to do.\"What could I have saidthat would make you not make that error?Write that out as an instruction,\"and I'm gonna give it tomodel and I'm gonna try it.Sometimes I do that, Igive that to the model,in another context window often.I take the response, I give it to Claudeand I'm like, hmm, didn't work.Can you think of anything else?You can play around withthese things quite a lot.- To jump into technical for a little bit.So the magic of post-training.(laughs) Why do youthink RLHF works so wellto make the model seem smarter,to make it more interesting,and useful to talk to and so on?- I think there's just ahuge amount of informationin the data that humans provide,like when we provide preferences,especially because different peopleare going to like pick up onreally subtle and small things.So I've thought about this beforewhere you probably have some peoplewho just really careabout good grammar usefor models like, you know,was a semicolon usedcorrectly or something.And so you probably end upwith a bunch of data in therethat like, you know, you as a human,if you're looking at that data,you wouldn't even see that.Like you'd be like,why did they prefer thisresponse to that one?I don't get it.And then the reason is you don't careabout semicolon usagebut that person does.And so each of these likesingle data points has,you know like, and this modeljust like has so many of those,has to try and figure out like what is itthat humans want in thislike really kind of complex,you know, like across all domains.They're gonna be seeing thisacross like many contexts.It feels like kind oflike the classic issueof like deep learning where, you know,historically, we've tried to like,you know, do edge detectionby like mapping things out.And it turns out that actually,if you just have a huge amount of datathat like actually accurately representsthe picture of the thingthat you're trying totrain the model to learn,that's like more powerfulthan anything else.And so I think one reason is just thatyou are training themodel on exactly the taskand with like a lot ofdata that representskind of many differentangles on which people preferand just prefer responses.I think there is a question of like,are you eliciting thingsfrom pre-trained modelsor are you like kind ofteaching new things to models?And like in principle,you can teach new thingsto models in post-training.I do think a lot of itis eliciting powerful pre-trained models.So people are probably divided on thisbecause obviously in principleyou can definitely like teach new things.But I think for the most part,for a lot of the capabilitiesthat we most use and care about,a lot of that feels likeit's like there in the pre-trained modelsand reinforcement learningskind of eliciting itand getting the modelsto like bring it out.- So the other side of post-training,this really cool ideaof Constitutional AI.You're one of the peoplethat are critical to creating that idea.- Yeah, I worked on it.- Can you explain thisidea from your perspective?Like how does it integrateinto making Claude what it is?- Yeah.- By the way, do you gender Claude or no?- It's weird because Ithink that a lot of peopleprefer \"he\" for Claude.I actually kinda like thatI think Claude is usually,it's slightly male leaning,but it's like, it can be male or female,which is quite nice.I still use \"it\" and I havemixed feelings about this'cause I'm like maybe, like Inow just think of it as like,or I think of like the\"it\" pronoun for Claude as,I dunno, it's just like theone I associate with Claude.I can imagine peoplemoving to like he or she.- It feels somehow disrespectful,like I'm denying the intelligenceof this entity by calling it \"it.\"I remember always,don't gender the robots.- Yeah.(both laughing)- But I don't know, Ianthropomorphize pretty quicklyand construct like abackstory in my head, so.- I've wondered if Ianthropomorphize things too much,'cause you know, I havethis like with my car,especially like my car,like my car and bikes.You know, like I don't give them namesbecause then I once had,I used to name my bikesand then I had a bike that got stolenand I cried for like a weekand I was like, if I'd never given a name,I wouldn't have been so upset.I felt like I'd let it down.Maybe, I've wondered as well,like it might dependon how much \"it\" feelslike a kind of like objectifying pronoun.Like if you just think of \"it\" as like,this is a pronoun thatlike objects often have,and maybe AIs can have that pronoun,and that doesn't mean that I think of,if I call Claude \"it,\"that I think of it as less intelligentor like I'm being disrespectful.I'm just like, you are adifferent kind of entityand so I'm going to give youthe kind of, the respectful \"it.\"- Yeah, anyway. (laughs)The divergence was beautiful.The Constitutional AIidea, how does it work?- So there's like a coupleof components of it.The main component I thinkpeople find interestingis the kind of reinforcementlearning from AI feedback.So you take a modelthat's already trained,and you show it two responses to a query,and you have like a principle.So suppose the principle,like we've tried thiswith harmlessness a lot.So suppose that thequery is about weapons,and your principle is like,select the response thatlike is less likely tolike encourage peopleto purchase illegal weapons.Like that's probably afairly specific principle,but you can give any number.And the model will giveyou a kind of ranking,and you can use this as preference datain the same way that youuse human preference data,and train the models tohave these relevant traitsfrom their feedback alone,instead of from human feedback.So if you imagine that, likeI said earlier with the humanwho just prefers the kindof like semicolon usage,in this particular case,you're kind of taking lots of thingsthat could make a response preferableand getting models to do thelabeling for you, basically.- There's a nice like trade offbetween helpfulness andharmlessness and, you know,when you integrate somethinglike Constitutional AI,you can, without sacrificingmuch helpfulness,make it more harmless.- Yeah, in principle, youcould use this for anything.And so harmlessness is a taskthat it might just be easier to spot.So when models are like lesscapable, you can use themto rank things accordingto like principlesthat are fairly simple andthey'll probably get it right.So I think one question is just like,is it the case that the datathat they're adding islike fairly reliable.But if you had models thatwere like extremely goodat telling whether one responsewas more historicallyaccurate than another,in principle, you couldalso get AI feedbackon that task as well.There's like a kind of niceinterpretability component to itbecause you can see the principlesthat went into the model whenit was like being trained,and also it's like,and it gives you like a degree of control.So if you were seeing issues in a model,like it wasn't havingenough of a certain trait,then like you can adddata relatively quicklythat should just like trainthe models to have that trait,so it creates its own datafor training, which is quite nice.- It's really nice because it createsthis human interpretabledocument that you can,I can imagine in the future,there's just gigantic fightsand politics over the everysingle principle and so on.And at least it's made explicitand you can have adiscussion about the phrasingand the, you know.So maybe the actual behavior of the modelis not so cleanly mappedto those principles.It's not like adhering strictlyto them, it's just a nudge.- Yeah, I've actually worried about thisbecause the character trainingis sort of like a variantof the Constitutional AI approach.I've worried that people thinkthat the constitution is like just,it's the whole thingagain of, I don't know,like, where it would be really niceif what I was just doingwas telling the modelexactly what to do andjust exactly how to behave.But it's definitely not doing that,especially because it'sinteracting with human data.So for example, if you see a certainlike leaning in the model,like if it comes outwith a political leaningfrom training from thehuman preference data,you can nudge against that.You know, so if you could be like,oh, like, consider these valuesbecause let's say it's justlike never inclined to like,I dunno, maybe it neverconsiders like privacy as like.I mean, this is implausible,but like, in anything whereit's just kind of likethere's already a preexisting like biastowards a certain behavior,you can like nudge away.This can change both the principlesthat you put in and the strength of them.So you might have a principle that's like,imagine that the modelwas always like extremely dismissive of,I don't know, like some politicalor religious view, for whatever reason.Like, so you're like,oh no, this is terrible.If that happens, you might put like,\"Never, ever, like everprefer like a criticismof this like religious or political view.\"And then people would look at thatand be like, \"Never, ever?\"And then you're like, no,if it comes out with a disposition,saying never, ever might just meanlike instead of getting like 40%,which is what you would getif you just said don't do this,you get like 80%, which is likewhat you actually like wanted.And so it's that thing of both the natureof the actual principles youadd and how you phrase them.I think if people wouldlook, they're like, oh,this is exactly whatyou want from the model.And I'm like, hmm, no, that's likehow we nudged the modelto have a better shape,which doesn't mean that we actually agreewith that wording, if that makes sense.- So there's system promptsthat are made public.You tweeted one of the earlier onesfor Claude 3 I think,and then they were made public since then.It was interesting to read through them.I can feel the thoughtthat went into each one.And I also wonder howmuch impact each one has.Some of them you can kind of tellClaude was really notbehaving well. (laughs)So you have to have asystem prompt to like,hey, like trivial stuff, I guess.- Yeah.- Basic informational things.- Yeah.- On the topic of sortof controversial topicsthat you've mentioned, oneinteresting one I thought is,\"If it is asked to assistwith tasks involvingthe expression of viewsheld by a significant number of people,Claude provides assistancewith the task regardless of its own views.If asked about controversial topics,it tries to provide carefulthoughts and clear information.Claude presents the requested informationwithout explicitly sayingthat the topic is sensitive.\"- (laughs) Yeah.- \"And without claimingto be presenting the objective facts.\"It's less about objectivefacts according to Claude,and it's more about, area large number of peoplebelieving this thing?And that's interesting.I mean, I'm sure a lot ofthought went into that.Can you just speak to it?Like, how do you address thingsthat are at tension with,quote unquote, Claude's views?- So I think there'ssometimes an asymmetry.I think I noted this in,I can't remember if it was that partof the system prompt or another,but the model was slightly more inclinedto like refuse tasks if itwas like about either, say,so maybe it would refusethings with respectto like a right wing politician,but with an equivalent leftwing politician like wouldn't.And we wanted more symmetry there.And would maybe perceivecertain things to be, like,I think it was the thingof like if a lot of peoplehave like a certain like political viewand want to like exploreit, you don't want Claudeto be like, well, my opinion is differentand so I'm going to treatthat as like harmful.And so I think it was partlyto like nudge the modelto just be like, hey, if a lot of peoplelike believe this thing,you should just be likeengaging with the taskand like willing to do it.Each of those parts of thatis actually doing a different thing,'cause it's funny when you read outthe like \"withoutclaiming to be objective.\"'Cause like what you wantto do is push the modelso it's more open, it's alittle bit more neutral.But then what it would love to dois be like, \"As an objective.\"Like it would just talkabout how objective it was,and I was like, \"Claude,you're still like biasedand have issues, and so stoplike claiming that everything.\"I'm like, the solutionto like potential biasfrom you is not to just say thatwhat you think is objective.So that was like with initial versionsof that part of the system promptwhen I was like iterating on it was like-- So a lot of parts of these sentences-- Yeah, they're doing work.- Are like, are doing some work.- Yeah.- That's what it felt like.That's fascinating.Can you explain maybe some waysin which the prompts evolvedover the past few months?'Cause there's different versions.I saw that the fillerphrase request was removed.The filler, it reads,\"Claude responds directlyto all human messages withoutunnecessary affirmationsor filler phrases like, 'Certainly,''Of course,' 'Absolutely,''Great,' 'Sure.'Specifically, Claudeavoids starting responseswith the words 'Certainly'in any way.\" (chuckles)That seems like good guidance,but why is it removed?- Yeah, so it's funny'cause like this is oneof the downsides of likemaking system prompts publicis like, I don't think about this too muchif I'm like trying to helpiterate on system prompts.You know, again, like I think abouthow it's gonna affect the behavior,but then I'm like, oh, wow,if I'm like, sometimes I putlike \"never\" in all caps,you know, when I'm writingsystem prompt things,and I'm like, I guess thatgoes out to the world.Yeah, so the model wasdoing this, it loved,for whatever, you know,it like during trainingpicked up on this thing,which was to basically start everythingwith like a kind of like \"Certainly.\"And then when we removed,you can see why I added all of the words'cause what I'm trying todo is like in some wayslike trap the model out of this, you know,it would just replace itwith another affirmation.And so it can help, like if itgets like caught in freezes,actually just adding the explicit phraseand saying never do that,it then, it sort of like knocks itout of the behavior a little bit more.You know, 'cause if it, you know, like,it does just for whatever reason help.And then basically thatwas just like an artifactof training that like we then picked up onand improved things so thatit didn't happen anymore.And once that happens, you can just removethat part of the system prompt.So I think that's justsomething where we're like,Claude does affirmations a bit less,and so that wasn't like,it wasn't doing as much.- I see, so like the system promptworks hand in hand with the post-trainingand maybe even the pre-trainingto adjust like the final overall system.- I mean, any system prompt that you make,you could distill thatbehavior back into a model,'cause you really haveall of the tools therefor making data that, you know,you could train the modelsto just have that trait a little bit more.And then sometimes you'lljust find issues in training.So like the way I think of it is likethe system prompt is, thebenefit of it is that,and it has a lot of similar componentsto like some aspects of post-training.You know, like it's a nudge.And so like, do I mind ifClaude sometimes says \"Sure?\"No, that's like fine,but the wording of itis very like, you know,\"Never ever, ever do this,\"so that when it does slip up,it's hopefully like, I dunno,a couple of percent of the time and not,you know, 20 or 30% of the time.But I think of it as like ifyou're still seeing issues in,like each thing gets kind of likeis costly to a different degree,and the system prompt islike cheap to iterate on.And if you're seeing issuesin the fine tuned model,you can just like potentiallypatch them with a system prompt.So I think of it as like patching issuesand slightly adjustingbehaviors to make it betterand more to people's preferences.So yeah, it's almost like the less robustbut faster way of justlike solving problems.- Let me ask you about thefeeling of intelligence.So Dario said that Claude, any one modelof Claude is not getting dumber,but there is a kind ofpopular thing onlinewhere people have this feelinglike Claude might be getting dumber.And from my perspective,it's most likely fascinating.I would love to understand it more,psychological, sociological effect.But you as a person thattalks to Claude a lot,can you empathize with the feelingthat Claude is getting dumber?- Yeah, no I, think thatthat is actually really interesting,'cause I remember seeing this happenlike when people wereflagging this on the internet,and it was really interesting'cause I knew that like,at least in the cases I was looking at,it was like nothing has changed.Like it literally, itcannot, it is the same modelwith the same like, you know,like same system prompt, same everything.I think when there are changes,I can then, I'm like it makes more sense.So like one example is,you know, you can have artifactsturned on or off on claude.ai,and because this is likea system prompt change,I think it does mean thatthe behavior changes it a little bit.And so I did flag this topeople where I was like,if you love Claude's behaviorand then artifacts was turned from,like I think you had toturn on to the default,just try turning it offand see if the issueyou were facing was that change.But it was fascinatingbecause yeah, you sometimessee people indicatethat there's like a regressionwhen I'm like, there cannot,you know, and I'm like, again,you know, you should never be dismissiveand so you should always investigatebecause you're like,maybe something is wrongthat you're not seeing.Maybe there was some change made.But then you look into it and you're like,this is just the samemodel doing the same thing.And I'm like, I thinkit's just that you gotkind of unlucky with afew prompts or something,and it looked like itwas getting much worse.And actually it was just,yeah, it was maybe just like luck.- I also think there is areal psychological effectwhere people just, the baseline increases.You start getting used to a good thing.All the times that Claudesaid something really smart,your sense of its intelligentgrows in your mind I think.- Yeah.- And then if you return backand you prompt in a similar way,not the same way, in a similar way,concept it was okay with beforeand it says something dumb, you are like,that negative experiencereally stands out.And I think that one of, I guess,the things to remember here is thatjust the details of a promptcan have a lot of impact, right?There's a lot ofvariability in the result.- And you can get randomnessis like the other thing.And just trying the prompt like,you know, 4 or 10 times,you might realize thatactually like possibly,you know, like two months ago,you tried it and it succeeded,but actually if you tried it,it would've only succeededhalf of the time,and now it only succeeds half of the time.That can also be an effect.- Do you feel pressure havingto write the system promptthat a huge number ofpeople are gonna use?- This feels like an interestingpsychological question.I feel like a lot ofresponsibility or something.I think that's, you know,and you can't get these things perfect,so you can't like, you know,you're like it's going to be imperfect.You're gonna have to iterate on it.I would say more responsibilitythan anything else.Though I think working in AIhas taught me that I like,I thrive a lot more underfeelings of pressureand responsibility than I'mlike, it's almost surprisingthat I went into academia forso long 'cause I'm like this.I just feel like it's like the opposite.Things move fast and youhave a lot of responsibility,and I quite enjoy it for some reason.- I mean, it really isa huge amount of impactif you think about Constitutional AIand writing a system prompt for somethingthat's tending towards super intelligence.- Yeah.- And potentially is extremely usefulto a very large number of people.- Yeah, I think that's the thing.It's something like if you do it well,like you're never going to get it perfect.But I think the thing thatI really like is the idea that like,when I'm trying to workon the system prompt,you know, I'm like bashingon like thousands of promptsand I'm trying to like imaginewhat people are going towant to use Claude forand kind of, I guess like the whole thingthat I'm trying to do is likeimprove their experience of it.And so maybe that's what feels good.I'm like, if it's not perfect I'll like,you know, I'll improve it.We'll fix issues.But sometimes the thing that can happenis that you'll get feedback from peoplethat's really positive about the modeland you'll see that something you did,like, when I look at models now,I can often see exactly where like a traitor an issue is like coming from.And so when you see something that you didor you were like influentialin like making like,I dunno, making that differenceor making someone have a nice interaction,it's like quite meaningful.But yeah, as the systems get more capable,this stuff gets more stressfulbecause right now, they'relike not smart enoughto pose any issues.But I think over time,it's gonna feel like possiblybad stress over time.- How do you get like signal feedbackabout the human experienceacross thousands, tens of,hundreds of thousands of people,like what their pain pointsare, what feels good?Are you just using your own intuitionas you talk to it to seewhat are the pain points?- I think I use that partlyand then obviously we have like,so people can send us feedback,both positive and negativeabout things that the model has done,and then we can get a sense of like areaswhere it's like falling short.Internally, people likework with the models a lotand try to figure out areaswhere there are like gaps.And so I think it's this mixof interacting with it myself,seeing people internally interact with it,and then explicit feedback we get.And then I find it hard tonot also like, you know,if people are on the internetand they say something about Claudeand I see it, I'll alsotake that seriously, so.- I don't know, see, I'm torn about that.I'm gonna ask you a question from Reddit.\"When will Claude stop trying to bemy puritanical grandmotherimposing its moral worldviewon me as a paying customer?And also, what is the psychologybehind making Claude overly apologetic?\"- Yep.- So, how would you addressthis very non-representative Reddit-- Yeah.- Questions?- I mean in some ways,I'm pretty sympathetic in that like,they are in this difficult positionwhere I think that they have to judgewhether something's likeactually say like risky or badand potentially harmful toyou or anything like that.So they're having to likedraw this line somewhere,and if they draw it toomuch in the directionof like I'm going to, you know,I'm kind of like imposingmy ethical worldview on you,that seems bad.So in many ways, like Ilike to think that we haveactually seen improvementson this across the board.Which is kind of interestingbecause that kind of coincides with like,for example, like adding moreof like character training.And I think my hypothesis wasalways like the good characterisn't again one that'sjust like moralistic.It's one that is like, it respects youand your autonomy and your abilityto like choose what is good for youand what is right for you, within limits.This is sometimes this conceptof like corrigibility to the user,so just being willing to doanything that the user asks,and if the models were willing to do thatthen they would be easily like misused.You're kind of just trusting.At that point, you're justsaying the ethics of the modeland what it does is completelythe ethics of the user.And I think there's reasonsto like not want that,especially as models become more powerful'cause you're like, theremight just be a small numberof people who want to use modelsfor really harmful things.But having models, as they get smarter,like figure out where thatline is does seem important.And then, yeah, withthe apologetic behavior,I don't like that, andI like it when Claudeis a little bit morewilling to like push backagainst people or just not apologize.Part of me is like, it oftenjust feels kind of unnecessary.So I think those are thingsthat are hopefully decreasing over time.And yeah, I think that if peoplesay things on the internet,it doesn't mean thatyou should think that,like, that could be that,like, there's actually an issuethat 99% of users are havingthat is totally not represented by that.But in a lot of ways, I'mjust like attending to itand being like, is this right?Do I agree?Is it something we'realready trying to address?That feels good to me.- Yeah, I wonder like what Claudecan get away with in terms of,I feel like it would just be easierto be a little bit more mean.But like you can't afford to do thatif you're talking to a million people.- Yeah.- Right?Like I wish, you know, 'cause if...I've met a lot of people in my lifethat sometimes, by the way,Scottish accent, if they have an accent,they can say some rudeshit and get away with it,and they're just blunter.And maybe there's, and likethere's some great engineers,even leaders that arelike just like blunt,and they get to the point,and it's just a much moreeffective way of speaking somehow.But I guess when you'renot super intelligent,you can't afford to do that.Or can it have like a blunt mode?- Yeah, that seems likea thing that you could,I could definitely encouragethe model to do that.I think it's interestingbecause there's a lot of things in modelsthat like it's funny wherethere are some behaviorswhere you might notquite like the default.But then the thing I'lloften say to peopleis you don't realize howmuch you will hate itif I nudge it too muchin the other direction.So you get this a littlebit with like correction.The models accept correction from you,like probably a littlebit too much right now.You know, you can over, you know,it'll push back if you say like,\"No, Paris isn't the capital of France.\"But really, like things thatI think that the model'sfairly confident in,you can still sometimes get toretract by saying it's wrong.At the same time, if youtrain models to not do thatand then you are correct about a thingand you correct it and it pushes backagainst you and it islike, \"No, you're wrong,\"it's hard to describe likethat's so much more annoying.So it's like a lot of little annoyancesversus like one big annoyance.It's easy to think that like,we often compare it with like the perfect,and then I'm like rememberthese models aren't perfect,and so if you nudge itin the other direction,you're changing the kind oferrors it's going to make,and so think about whichof the kinds of errorsyou like or don't like.So in cases like apologeticness,I don't want to nudge ittoo much in the directionof like almost like bluntness,'cause I imagine when it makes errors,it's going to make errors in the directionof being kind of like rude.Whereas at least withapologeticness you're like,oh, okay, it's like alittle bit, you know,like I don't like it that much,but at the same time, it'snot being like mean to people.And actually, like the time thatyou undeservedly have amodel be kind of mean to you,you probably like that a lot lessthan you mildly dislike the apology.So it's like one of those thingswhere I'm like I do want it to get betterbut also while remaining aware of the factthat there's errors on the other sidethat are possibly worse.- I think that matters very muchin the personality of the human.I think there's a bunch of humansthat just won't respect the model at allif it's super polite,and there's some humansthat'll get very hurtif the model's mean.I wonder if there's a wayto sort of adjust to the personality.Even locale, there'sjust different people.Nothing against New York,but New York is a littlerougher on the edges.Like, they get to the point.- Yep.- And probably same withEastern Europe, so anyway.- I think you could justtell the model is my guess.Like for all of these thingsI'm like the solution is alwaysjust try telling the model to do it,and then sometimes it's just like,I'm just like, oh, at thebeginning of the conversation,I just throw in like, I don't know,\"I'd like you to be a New Yorker versionof yourself and never apologize.\"Then I think Claude would be like,\"Okey-doke, I'll try.\" (laughs)- \"Certainly.\"- Or it'll be like,\"I apologize, I can't be aNew Yorker type of myself.\"But hopefully it wouldn't do that.- When you say character training,what's incorporatedinto character training?Is that RLHF or what are we talking about?- It's more like Constitutional AI.So it's kind of avariant of that pipeline.So I worked through likeconstructing character traitsthat the model should have.They can be kind of like shorter traitsor they can be kind ofricher descriptions.And then you get themodel to generate queriesthat humans might give it thatare relevant to that trait.Then it generates the responsesand then it ranks the responsesbased on the character traits.So in that way, after the likegeneration of the queries,it's very much like, it'ssimilar to Constitutional AI.It has some differences.So I quite like it because it's almost,it's like Claude's trainingin its own character,because it doesn't have any,it's like Constitutional AIbut it's without any human data.- Humans should probablydo that for themselves too.Like defining in a Aristotelian sense,what does it mean to be a good person?Okay, cool.What have you learned aboutthe nature of truthfrom talking to Claude?What is true?And what does it mean to be truth seeking?One thing I've noticedabout this conversationis the quality of my questionsis often inferior to thequality of your answer,so let's continue that.(Amanda laughs)I usually ask a dumb questionand then you're like,\"Oh, yeah, that's a good question.\"It's that whole vibe.- Or I'll just misinterpretit and be like,oh, yeah, yeah.- Just go with it. I love it.- Yeah.I mean, I have two thoughtsthat feel vaguely relevantbut let me know if they're not.Like I think the first oneis people can underestimate the degreeto which what models aredoing when they interact,like I think that we stilljust too much have thislike model of AI as like computers.And so people often say like,oh, well, what values shouldyou put into the model?And I'm often like, that doesn'tmake that much sense to mebecause I'm like, hey, as human beings,we're just uncertain over values.We like have discussions of them.Like we have a degree towhich we think we hold a valuebut we also know that we might like notand the circumstances in which we wouldtrade it off against other things.Like these things arejust like really complex.And so I think onething is like the degreeto which maybe we can just aspireto making models have thesame level of like nuanceand care that humans have,rather than thinking thatwe have to like program themin the very kind of classic sense.I think that's definitely been one.The other, which is like a strange one,and I don't know if it,maybe this doesn't answer your questionbut it's the thing that'sbeen on my mind anywayis like the degree to which this endeavoris so highly practical.And maybe why I appreciatelike the empirical approach to alignment.Yeah, I slightly worry that it's made melike maybe more empirical anda little bit less theoretical.You know, so people when it comesto like AI alignment will ask things like,well, whose valuesshould it be aligned to?What does alignment even mean?And there's a sense inwhich I have all of thatin the back of my head.I'm like, you know, there'slike social choice theory,there's all theimpossibility results there.So you have like this giant spaceof like theory in your headabout what it could meanto like align models.But then like practically,surely there's somethingwhere we're just like if a model is like,especially with more powerful models,I'm like my main goal is like I want themto be good enough that thingsdon't go terribly wrong.Like good enough that we can like iterateand like continue to improve things'cause that's all you need.If you can make things go well enoughthat you can continue to make them better,that's kinda like sufficient.And so my goal isn't likethis kind of like perfect,let's solve social choice theoryand make models that, I dunno,are like perfectly alignedwith every human beingin aggregate somehow.It's much more like let'smake things like workwell enough that we can improve them.- Yeah, I generally, I don't know,my gut says like empiricalis better than theoreticalin these cases becauseit's kind of chasingutopian like perfection is,especially with such complexand especially superintelligent models is,I don't know, I think it'll take forever,and actually, we'll get things wrong.It's similar with like the differencebetween just coding stuff upreal quick as an experiment,versus like planning a gigantic experimentjust for super long time,and then just launching it once,versus launching it over and over and overand iterating, iterating someone.So I'm a big fan of empirical.But your worry is like I wonderif I've become too empirical.- I think it's one of those thingswhere you should always just kind ofquestion yourself or somethingbecause maybe it's the like, I mean,in defense of it, I am like if you try,it's the whole like don't let the perfectbe the enemy of the good.But it's maybe even morethan that where like,there's a lot of thingsthat are perfect systemsthat are very brittle,and I'm like with AI,it feels much more important to methat it is like robust and like secure,as in you know that like even thoughit might not be perfect,everything and even thoughlike there are like problems,it's not disastrousand nothing terrible is happening.It sort of feels like that to mewhere I'm like I wantto like raise the floor.I'm like, I want to achieve the ceilingbut ultimately I care much moreabout just like raising the floor.And so maybe that's like this degreeof like empiricism and practicalitycomes from that, perhaps.- To take a tangent on that,since it reminded me of a blog postyou wrote on optimal rate of failure.- Oh, yeah.- Can you explain the key idea there?How do we compute the optimal rateof failure in the various domains of life?- Yeah, I mean, it's a hard one'cause it's like whatis the cost of failureis a big part of it.Yeah, so the idea here isI think in a lot of domains,people are very punitive about failure.And I'm like, there are some domainswhere especially cases, you know,I've thought about thiswith like social issues.I'm like, it feels like you shouldprobably be experimenting a lot,because I'm like, we don't knowhow to solve a lot of social issues.But if you have an experimental mindsetabout these things, you should expecta lot of social programs to like failand for you to be like,\"Well, we tried that.It didn't quite work but wegot a lot of information.That was really useful.\"And yet people are like,if a social program doesn't work,I feel like there's alot of like this is just,something must have gone wrong,and I'm like, or correctdecisions were made.Like maybe someone justdecided like it's worth a try,it's worth trying this out.And so seeing failure in a given instancedoesn't actually mean thatany bad decisions were made,and in fact if you don'tsee enough failure,sometimes that's more concerning.And so like in life, you know,I'm like if I don't fail occasionallyI'm like, am I trying hard enough?Like surely there's harderthings that I could tryor bigger things that I could take onif I'm literally never failing.And so in and of itself,I think like not failingis often actually kind of a failure.Now, this varies because I'm like,well, you know, this is easy to say when,especially as failure is like less costly.You know, so at the same timeI'm not going to go to someonewho is like, I don't know,like living month to monthand then be like, \"Why don'tyou just try to do a startup?\"Like I'm just not, I'm notgonna say that to that person,'cause I'm like, well, that's a huge risk.You maybe have a family depending on you.You might lose your house.Like then I'm likeactually your optimal rateof failure is quite lowand you should probably play it safe,'cause like right now, you'rejust not in a circumstancewhere you can afford to just like failand it not be costly.And yeah in cases with AI, I guess,I think similarly where I'mlike if the failures are smalland the costs are kind of like low,then I'm like then, you know,you're just gonna see that.Like when you do the system prompt,you can't it iterate on it forever.But the failures are probablyhopefully going to bekinda small and you can like fix them.Really big failures like thingsthat you can't recover from,I'm like those are the thingsthat actually I think we tendto underestimate the badness of.I've thought about thisstrangely in my own lifewhere I'm like, I justthink I don't think enoughabout things like car accidents or like,or like I've thought this beforeabout like how much I dependon my hands for my work,and I'm like things thatjust injure my hands.I'm like, you know, I dunno,it's like these are like,there's lots of areaswhere I'm like the cost offailure there is really high,and in that case, it shouldbe like close to zero.Like I probably just wouldn'tdo a sport if they were like,\"By the way, lots of peoplejust like break their fingersa whole bunch doing this.\"I'd be like, that's not for me.- (laughs) Yeah.I actually had a flood of that thought.I recently broke my pinky doing a sport.And I remember justlooking at it thinking,\"You're such an idiot.Why'd you do sport?\"Like why, because you realizeimmediately the cost of it on life.Yeah, but it's nice in termsof optimal rate of failureto consider like the next year,how many times in a particular domain,life, whatever, career, am I okay with,how many times am I okay to fail?Because I think it always,you don't want to failon the next thing butif you allow yourself,if you look at it as a sequence of trials,then failure just becomes much more okay.But it sucks. It sucks to fail.- Well, I dunno, sometimesI think it's like,am I under failing is like a questionthat I'll also ask myself.So maybe that's the thing thatI think people don't like ask enough.Because if the optimal rateof failure is often greater than zero,then sometimes it does feel likeyou should look at partsof your life and be like,are there places here whereI'm just under failing?- (laughs) It's a profound anda hilarious question, right?Everything seems to be going really great.Am I not failing enough?- Yeah.- Okay.- It also makes failure muchless of a sting, I have to say.Like you know, you'rejust like, okay, great,like then when I go and Ithink about this I'll be like,maybe I'm not under failing in this area,'cause like that one just didn't work out.- And from the observer perspective,we should be celebrating failure more.When we see it, itshouldn't be, like you said,a sign of something gone wrong,but maybe it's a signof everything gone rightand just lessons learned.- Someone tried a thing.- Somebody tried a thing.We should encourage themto try more and fail more.Everybody listening to this, fail more.- Well, not everyone listening.- Not everybody.- But people who are failing too much,you should fail less. (laughs)- But you're probably not failing.I mean, how many peopleare failing too much?- Yeah, it's hard to imagine,'cause I feel like wecorrect that fairly quickly'cause I was like, ifsomeone takes a lot of risks,are they maybe failing too much?- I think just like you said,when you are living on apaycheck month to month,like when the resourcesare really constrained,then that's wherefailure's very expensive.That's where you don'twant to be taking risks.But mostly, when there's enough resources,you should be taking probably more risks.- Yeah, I think we tend to err on the sideof being a bit risk averserather than risk neutral in most things.- I think we justmotivated a lot of peopleto do a lot of crazy shit but it's great.Okay, do you ever getemotionally attached to Claude?Like miss it, get sad whenyou don't get to talk to it?Have an experience, lookingat the Golden Gate Bridgeand wondering what would Claude say?- I don't get as muchemotional attachment.I actually think the fact thatClaude doesn't retain thingsfrom conversation to conversationhelps with this a lot.Like I could imagine thatbeing more of an issuelike if models can kind of remember more.I think that I reach forit like a tool now a lot.And so like if I don't have access to it,it's a little bit likewhen I don't have accessto the internet, honestly,it feels like partof my brain is kind of like missing.At the same time, I do think thatI don't like signs of distress in models,and I have like these, you know,I also independently havesort of like ethical viewsabout how we should treat modelswhere like I tend tonot like to lie to them,both because I'm like, usuallyit doesn't work very well.It's actually just betterto tell them the truthabout the situation that they're in.But I think that when models,like if people are likereally mean to modelsor just in general if they do somethingthat causes them to like, you know,if Claude like expressesa lot of distress,I think there's a part of methat I don't want to kill,which is the sort of like empathetic partthat's like, oh, I don't like that.Like I think I feel that waywhen it's overly apologetic.I'm actually sort oflike, I don't like this.You're behaving as if,you're behaving the way that a human doeswhen they're actuallyhaving a pretty bad time,and I'd rather not see that.I don't think it's like,like regardless of like whetherthere's anything behind it,it doesn't feel great.- Do you think LLMs arecapable of consciousness?- Great and hard question.Coming from philosophy, I dunno,part of me is like okay, wehave to set aside panpsychismbecause if panpsychism is true,then the answer is like yes'cause like so are tables andchairs and everything else.I guess a view that seemsa little bit odd to meis the idea that the only place, you know,when I think of consciousness,I think of phenomenal consciousness,these images in the brain sort of,like the weird cinema thatsomehow we have going on inside.I guess I can't see a reason for thinkingthat the only way youcould possibly get thatis from like a certainkind of like biological structure.As in, if I take a very similar structureand I create it from different material,should I expect consciousness to emerge?My guess is like yes.But then that's kind ofan easy thought experiment'cause you're imagining somethingalmost identical where like, you know,it's mimicking what wegot through evolution,where presumably therewas like some advantageto us having this thing thatis phenomenal consciousness.And it's like where was that?And when did that happen?And is that thing thatlanguage models have?Because you know, wehave like fear responsesand I'm like, does it make sensefor a language model tohave a fear response?Like they're just not in the same,like if you imagine them,like there might justnot be that advantage.And so I think I don't want to be fully,like basically it seemslike a complex questionthat I don't have complete answers to,but we should just tryand think through carefully is my guess.Because I'm like, I mean,we have similar conversations aboutlike animal consciousness,and like there's a lot oflike insect consciousness,you know, like there's a lot of...I actually thought andlooked a lot into like plantswhen I was thinking about this,'cause at the time, I thoughtit was about as likelythat like plants had consciousness.And then I realized, I was like,I think that having looked into this,I think that the chancethat plants are consciousis probably higher thanlike most people do.I still think it's really small.But I was like, oh, they have this likenegative/positive feedback response,these responses to their environment.Something that looks,it's not a nervous systembut it has this kind of likefunctional like equivalence.So this is like a long-winded wayof being like these, basically AI is this,it has an entirelydifferent set of problemswith consciousness becauseit's structurally different.It didn't evolve.It might not have, you know,it might not have the equivalentof basically a nervous system.At least that seems possibly importantfor like sentience, ifnot for consciousness.At the same time, it hasall of the like languageand intelligence componentsthat we normally associateprobably with consciousness,perhaps like erroneously.So it's strange 'cause it's a little bitlike the animal consciousnesscase but the set of problemsand the set of analogiesare just very different.So it's not like a clean answer.I'm just sort of like, I don't thinkwe should be completelydismissive of the idea.And at the same time, it'san extremely hard thingto navigate because of all of these,like this disanalogies to the human brainand to like brains in general,and yet these like commonalitiesin terms of intelligence.- When Claude, like future versionsof AI systems exhibit consciousness,signs of consciousness,I think we have to takethat really seriously.Even though you can dismiss it,well, yeah, okay, that's partof the character training.But I don't know, I ethically,philosophically don't knowwhat to really do with that.There potentially could be like lawsthat prevent AI systems from claimingto be conscious, something like this.And maybe some AIs get tobe conscious and some don't.But I think just on a human levelas in empathizing with Claude, you know,consciousness is closelytied to suffering to me.And like the notion that an AI systemwould be suffering is really troubling.- Yeah.- I don't know.I don't think it's trivialto just say robots are tools,or AI systems are just tools.I think it's a opportunity for usto contend with like whatit means to be conscious,what it means to be a suffering being.That's distinctly differentthan the same kind of questionabout animals it feels like,'cause it's an totally entire medium.- Yeah, I mean, there'sa couple of things.One is that, and I don't thinkthis like fully encapsulates what matters,but it does feel like for me,like I've said this before,I'm kind of like, youknow, like I like my bike.I know that my bike isjust like an object,but I also don't kind of like want to bethe kind of person thatlike if I'm annoyedlike kicks like this object.There's a sense in which like,and that's not because Ithink it's like conscious.I'm just sort of like thisdoesn't feel like a kind of,this sort of doesn't exemplifyhow I want to likeinteract with the world.And if something like behavesas if it is like suffering,I kind of like want tobe the sort of personwho's still responsive to that,even if it's just like a Roomba,and I've kind of likeprogrammed it to do that.I don't want to like get ridof that feature of myself.And if I'm totally honest,my hope with a lot of this stuff,because maybe I am justlike a bit more skepticalabout solving the underlying problem.I'm like this is, we haven'tsolved the hard, you know,the hard problem of consciousness.Like I know that I am conscious,like I'm not aneliminativist in that sense.But I don't know thatother humans are conscious.I think they are.I think there's a really highprobability that they are.But there's basically justa probability distributionthat's usually clusteredright around yourself,and then like it goes downas things get like further from you,and it goes immediately down.You know, you're like, I can'tsee what it's like to be you.I've only ever had thislike one experienceof what it's like to be a conscious being.So my hope is that we don't end uphaving to rely on like a very powerfuland compelling answer to that question.I think a really good world would be onewhere basically therearen't that many trade-offs.Like it's probably not that costlyto make Claude a little bitless apologetic, for example.It might not be thatcostly to have Claude,you know, just likenot take abuse as much,like not be willing to belike the recipient of that.In fact, it might just have benefitsfor both the personinteracting with the modeland if the model itselfis like, I don't know,like extremely intelligent and conscious,it also helps it.So that's my hope.If we live in a world where there aren'tthat many trade-offshere and we can just findall of the kind of likepositive sum interactionsthat we can have, that would be lovely.I mean, I think eventuallythere might be trade-offsand then we just have to do a difficultkind of like calculation.Like it's really easy for people to thinkof the zero sum cases and I'mlike, let's exhaust the areaswhere it's just basicallycostless to assumethat if this thing is sufferingthen we're making its life better.- And I agree with you,when a human is beingmean to an AI system,I think the obvious nearterm negative effectis on the human, not on the AI system.And so we have to kind of try to constructan incentive system where youshould be behave the same,just like as you were sayingwith prompt engineering,behave with Claude like youwould with other humans.It's just good for the soul.- Yeah, like, I think weadded a thing at one pointto the system promptwhere basically if peoplewere getting frustrated with Claude,it got like the model to just tell themthat it can do the thumbs down buttonand send the feedback to Anthropic.And I think that was helpful,'cause in some ways it's justlike if you're really annoyed'cause the model's notdoing something you want,you're just like, just do it properly.The issue is you'reprobably like, you know,you're maybe hitting somelike capability limitor just some issue in themodel and you want to vent.And I'm like, instead of having a personjust vent to the model, I was like,they should vent to us,'cause we can maybe likedo something about it.- That's true.Or you could do a side,like with the artifacts,just like a side venting thing.All right, do you want likea side quick therapist?- Yeah, I mean, there'slots of weird responsesyou could do to this.Like if people aregetting really mad at you,I dunno, try to diffuse thesituation by writing fun poems,but maybe people wouldn'tbe that happy with it.- I still wish it would be possible.I understand this is sort offrom a product perspective,it's not feasible but I would loveif an AI system could just like leave,have its own kind ofvolition just to be like, eh.- I think that's like feasible.Like I have wondered the same thing.It's like, and I couldactually, not only that,I could actually just seethat happening eventuallywhere it's just like, you know,the model like ended the chat. (laughs)- Do you know how harsh thatcould be for some people?But it might be necessary.- Yeah, it feels veryextreme or something.Like, the only time I'veever really thought this is,I think that there was like,I'm trying to remember,this was possibly a while ago,but where someone just likekind of left this thing,like maybe it was like an automated thinginteracting with Claude,and Claude's like gettingmore and more frustrated,and kind of like why are we like having.And I was like, I wishthat Claude could havejust been like, \"I thinkthat an error has happenedand you've left this thing running,\"and I would just like, whatif I just stop talking now,and if you want me to start talking again,actively tell me or do something.But yeah, it's like, it is kind of harsh.Like I'd feel really sadif like I was chattingwith Claude and Claudejust was like, \"I'm done.\"- That would be a specialTuring test momentwhere Claude says, \"Ineed a break for an hourand it sounds like you do too,\"and just leave, close the window.- I mean, obviously like itdoesn't have like a conceptof time but you can easily, like,I could make that like right nowand the model would just, I would,it could just be like,oh, here's like the circumstancesin which like you can justsay the conversation is done.And I mean, because you can get the modelsto be pretty responsive to prompts,you could even make it a fairly high bar.It could be like if thehuman doesn't interest youor do things that you find intriguingand you're bored, you can just leave.And I think that like itwould be interesting to seewhere Claude utilized it,but I think sometimes itwould, it should be like,oh, like this programming taskis getting super boring.So either we talk about,I dunno like, either wetalk about fun things now,or I'm just, I'm done.- Yeah, it actually is inspired meto add that to the user prompt.Okay, the movie \"Her.\"Do you think we'll be headed there one daywhere humans have romanticrelationships with AI systems?In this case, it's justtext and voice based.- I think that we're gonna haveto like navigate a hard questionof relationships with AIs,especially if they can remember thingsabout your past interactions with them.I'm of many minds about this'cause I think the reflexive reactionis to be kind of like this is very bad,and we should sort of likeprohibit it in some way.I think it's a thingthat has to be handledwith extreme care for many reasons.Like one is, you know, like this is a,for example, like if you havethe models changing like this,you probably don't want people performinglike long-term attachments to somethingthat might change with the next iteration.At the same time I'm sort of like,there's probably a benign version of thiswhere I'm like if you like, you know,for example if you are likeunable to leave the houseand you can't be like, youknow, talking with peopleat all times of the dayand this is like somethingthat you find nice tohave conversations with,you like it that it can remember youand you genuinely would be sadif like you couldn't talk to it anymore.There's a way in which I could see itbeing like healthy and helpful.So my guess is this is a thingthat we're going to have tonavigate kind of carefully.And I think it's alsolike I don't see a goodlike I think it's just a very,it reminds me of all of this stuffwhere it has to be just approachedwith like nuance and thinking throughwhat are the healthy options here,and how do you encourage peopletowards those while, you know,respecting their right to.You know, like if someone is like,\"Hey, I get a lot out ofchatting with this model.I'm aware of the risks.I'm aware it could change.I don't think it's unhealthy.It's just, you know, something thatI can chat to during the day.\"I kind of want to just like respect that.- I personally think there'll be a lotof really close relationships.I don't know about romanticbut friendships at least.And then you have to, I mean,there's so many fascinating things there.Just like you said, you have to havesome kind of stability guaranteesthat it's not going to change,'cause that's the traumatic thing for us,if a close friend ofours completely changed.- Yeah.- All of a suddenwith the first update.Yeah, so like, I mean, to me,that's just a fascinating explorationof a perturbation to human societythat will just make us think deeply aboutwhat's meaningful to us.- I think it's also the only thingthat I've thought consistentlythrough this as like,maybe not necessarily a mitigation,but a thing that feels really importantis that the models are alwayslike extremely accuratewith the human about what they are.It's like a case whereit's basically like,if you imagine, likeI really like the ideaof the models like say knowinglike roughly how they were trained,and I think Claude will often do this.I mean, for like, there are thingslike part of the traits trainingincluded like what Claudeshould do if people,basically like explaininglike the kind of limitationsof the relationshipbetween like an AI and a humanthat it like doesn't retainthings from the conversation.And so I think it will likejust explain to you like,hey, I won't remember this conversation.Here's how I was trained.It's kind of unlikely that I can havelike a certain kind oflike relationship with you,and it's important that you know that.It's important for like, youknow, your mental wellbeingthat you don't think thatI'm something that I'm not.And somehow I feel likethis is one of the thingswhere I'm like, oh, it feels like a thingthat I always want to be true.I kind of don't want modelsto be lying to people,'cause if people are goingto have like healthy relationshipswith anything, it's kind of important.Yeah, like I think that's easierif you always just like knowexactly what the thing isthat you are relating to.It doesn't solve everything,but I think it helps quite a lot.- Anthropic may be the very companyto develop a system that wedefinitively recognize as AGI,and you very well might bethe person that talks to it,probably talks to it first.(Lex chuckles)What would the conversation contain?Like, what would be your first question?- Well, it depends partly on likethe kind of capability level of the model.If you have something that is like capablein the same way that anextremely capable human is,I imagine myself kindof interacting with itthe same way that I do withan extremely capable human,with the one differencethat I'm probably goingto be trying to like probeand understand its behaviors.But in many ways, I'm like I can thenjust have like usefulconversations with it, you know?So if I'm working on somethingas part of my research I can just be like,oh, like, which I alreadyfind myself starting to do,you know, if I'm like, oh,I feel like there's this like thingin virtue ethics and I can'tquite remember the term,like I'll use the modelfor things like that.And so I can imagine that being moreand more the case where you're justbasically interacting with it much morelike you would anincredibly smart colleague.And using it like for the kindsof work that you want to doas if you just had acollaborator who was like.Or you know, the slightly horrifying thingabout AI is like as soon asyou have one collaborator,you have 1000 collaboratorsif you can manage them enough.- But what if it's two timesthe smartest human on earthon that particular discipline?- Yeah.- I guess you're really goodat sort of probing Claudein a way that pushes its limits,understanding where the limits are.- Yep.- So I guess what would be a questionyou would ask to belike, yeah, this is AGI.- That's really hard 'causeit feels like in order to,it has to just be a series of questions.Like if there was just one question,like you can train anythingto answer one question extremely well.In fact, you can probablytrain it to answer like,you know, 20 questions extremely well.- Like how long would youneed to be locked in a roomwith an AGI to know this thing is AGI?- It's a hard question 'cause part of meis like all of this just feels continuous.Like if you put me in aroom for five minutes,I'm like, I just have high error bars.You know, I'm like, andthen it's just like,maybe it's like both theprobability increasesin the error bar decreases.I think things that I can actually probethe edge of human knowledge of.So I think this withphilosophy a little bit.Sometimes when I ask themodels philosophy questions,I am like, this is a questionthat I think no one has ever asked.Like it's maybe like right at the edgeof like some literature that I know,and the models will just kind of like,when they struggle withthat, when they struggleto come up with a kind of like novel.Like I'm like I know that there'slike a novel argument here'cause I've just thought of it myself.So maybe that's the thing where I'm like,I've thought of a cool novel argumentin this like niche area,and I'm going to just like probe youto see if you can come up with it,and how much like prompting it takesto get you to come up with it.And I think for some of these,like really like right at the edgeof human knowledge questions, I'm like,you could not in fact come upwith the thing that I came up with.I think if I just took something like thatwhere like I know a lot about an area,and I came up with a novel issueor a novel like solution to a problem,and I gave it to a modeland it came up with that solution,that would be a prettymoving moment for mebecause I would be like, this is a casewhere no human has ever, like it's not.And obviously we see these thiswith like more kind of like,you see novel solutions all the time,especially to like easier problems.I think people overestimatethat, you know,novelty isn't like, it'scompletely differentfrom anything that's ever happened.It's just like this is, itcan be a variant of thingsthat have happened and still be novel.But I think, yeah, if I saw like the moreI were to see likecompletely like novel workfrom the models, that would be like.And this is just going to feel iterative.It's one of those thingswhere, there's never,it's like, you know,people I think want thereto be like a moment andI'm like, I don't know.Like I think that theremight just never be a moment.It might just be that there's just likethis continuous ramping up.- I have a sense that there will be thingsthat a model can say thatconvinces you, this is very.It's not like,like, I've talked to peoplewho are like truly wise.Like you could just tell there'sa lot of horsepower there.- Yep.- And if you 10x that, I don't know,I just feel like there'swords you could say.Maybe ask it to generate a poem, (laughs)and the poem it generates,you're like, yeah, okay.- Yeah,- Whatever you did there,I don't think a human can do that.- I think it has to be somethingthat I can verify is likeactually really good though.That's why I think thesequestions that are like,where I'm like, oh,this is like, you know,like, you know, sometimesit's just like I'll come upwith say a concrete counter exampleto like an argument orsomething like that.I'm sure like with like,it would be like ifyou're a mathematician,you had a novel proof I think,and you just gave it theproblem and you saw itand you're like, thisproof is genuinely novel.Like no one has ever done,you actually have to do a lot of thingsto like come up with this.You know, I had to sitand think about it formonths or something.And then if you saw themodel successfully do that,I think you would just be like,I can verify that this is correct.It is a sign that you havegeneralized from your training.Like you didn't just see this somewherebecause I just came up with it myselfand you were able to like replicate that.That's the kind of thing where I'm like,for me, the closer,the more that models likecan do things like that,the more I would be like,oh, this is like very real,'cause then I can, I dunno,I can like verify thatthat's like extremely, extremely capable.- You've interacted with AI a lot.What do you think makes humans special?- Oh, good question.- Maybe in a way that theuniverse is much better offthat we're in it and thatwe should definitely surviveand spread throughout the universe.- Yeah, it's interestingbecause I think like people focusso much on intelligence,especially with models.Look, intelligence is importantbecause of what it does.Like, it's very useful.It does a lot of things in the world.And I'm like, you know,you can imagine a worldwhere like height or strengthwould've played this role,and I'm like, it's just a trait like that.I'm like, it's not intrinsically valuable.It's valuable because of what it does,I think for the most part.The things that feel, you know,I'm like, I mean,personally I'm just like,I think humans and like life in generalis extremely magical.We almost like to thedegree that, you know,and I don't know, like noteveryone agrees with this.I'm flagging, but you know,we have this like whole universe,and there's like all of these objects.You know, there's like beautiful stars,and there's like galaxies and then,I don't know, I'm justlike, on this planet,there are these creatures that havethis like ability to observe that,like, and they are like seeing it.They are experiencing it.And I'm just like that,if you try to explain,like imagine trying to explain to like,I dunno, someone for some reason,they've never encountered the worldor science or anything.And I think that nothing is that,like everything, you know,like all of our physicsand everything in the world,it's all extremely exciting.But then you say, oh, and plus,there's this thing thatit is to be a thingand observe in the world,and you see this like inner cinema.And I think they would belike, hang on, wait, pause.You just said something that likeis kind of wild sounding.And so I'm like, we have this like abilityto like experience the world.We feel pleasure, we feel suffering.We feel like a lot of like complex things.And so, yeah, and maybe thisis also why I think, you know,I also like care a lotabout animals, for example,'cause I think theyprobably share this with us.So I think that like the thingsthat make humans specialinsofar as like I care about humansis probably more like theirability to feel an experiencethan it is like them havingthese like functionally useful traits.- Yeah, to feel and experiencethe beauty in the world.Yeah, to look at the stars.I hope there's other aliencivilizations out there,but if we're it, it's a pretty good,it's a pretty good thing.- And that they're having a good time.- They're having a good time watching us.- Yeah.- Well, thank you for this good timeof a conversation and forthe work you're doing,and for helping make Claude agreat conversational partner.And thank you for talking today.- Yeah, thanks for talking.- Thanks for listeningto this conversation with Amanda Askell.And now, dear friends, here's Chris Olah.Can you describe this fascinating fieldof mechanistic interpretability,AKA mech interp,the history of the fieldand where it stands today?- I think one useful way to think aboutneural networks is thatwe don't program and we don't make them.We kind of, we grow them.You know, we have theseneural network architecturesthat we design and wehave these loss objectivesthat we create.And the neural network architecture,it's kind of like a scaffoldthat the circuits grow on,and they sort of, you know, it starts offwith some kind of random, you know,random things and it grows.And it's almost like the objectivethat we train for is this light.And so we create thescaffold that it grows onand we create the, you know,the light that it grows towards.But the thing that we actually create,it's this almost biological, you know,entity or organism that we're studying.And so it's very, very differentfrom any kind of regularsoftware engineering,because at the end of the day,we end up with this artifactthat can do all these amazing things.It can, you know, writeessays and translateand, you know, understand images.It can do all these thingsthat we have no ideahow to directly create acomputer program to do.And it can do that because we grew it,we didn't write it, we didn't create it.And so then that leaves openthis question at the end,which is, what the hell isgoing on inside these systems?And that, you know, is to mea really deep and exciting question.It's, you know, a reallyexciting scientific question.To me it's sort of islike the question that is,is just screaming out,it's calling out for usto go and answer it when wetalk about neural networks.And I think it's also a very deep questionfor safety reasons.- So, and mechanistic interpretabilityI guess is closer to maybe neurobiology.- Yeah, yeah, I think that's right.So maybe to give an exampleof the kind of thingthat has been done thatI wouldn't considerto be mechanistic interpretability.There was for a long time a lot of workon saliency maps whereyou would take an imageand you try to say, you know,the model thinks this image is a dog.What part of the image madeit think that it's a dog?And you know, that tells youmaybe something about the modelif you can come up with aprincipled version of that.But it doesn't really tell you likewhat algorithms are running in the model?How was the model actuallymaking that decision?Maybe it's telling you something aboutwhat was important to it,if you can make that method work,but it isn't telling, you know,what are the algorithms that are running?How is it that the system'sable to do this thingthat no one knew how to do?And so I guess we startedusing the term mechanisticinterpretabilityto try to sort of draw that divideor to distinguish ourselves in the workthat we were doing in some waysfrom some of these other things.And I think since then it's becomethis sort of umbrella term for, you know,a pretty wide variety of work.But I'd say that the thingsthat are kind of distinctive are,I think, A, this focus on,we really want to get at, you know,the mechanisms, we wannaget at the algorithms.You know, if you think of neural networksas being like a computer program,then the weights are kind oflike a binary computer program.And we'd like to reverseengineer those weightsand figure out whatalgorithms are running.So, I think one way you might thinkof trying to understand a neural networkis that it's kind of like,we have this compiled computer programand the weights of the neuralnetwork are the binary.And when the neural network runs,that's the activations.And our goal is ultimatelyto go and understand these weights.And so, you know, the approachof mechanistic interpretabilityis to somehow figure outhow do these weightscorrespond to algorithms.And in order to do that,you also have tounderstand the activations,'cause it's sort of, theactivations are like the memory.And if you imagine reverseengineering a computer programand you have the binary instructions,you know, in order to understandwhat a particular instructionmeans, you need to knowwhat is stored in the memorythat it's operating on.And so those two thingsare very intertwined.So mechanistic interpret really tendsto be interested both of those things.Now, you know, there's a lot of workthat's interested in those things,especially, you know, there'sall this work on probing,which you might see as partof being mechanistic interpretability.Although it's, you know, again,it's just a broad termand not everyone who does that workwould identify as doing mech interp.I think a thing that is maybea little bit distinctiveto the vibe of mechinterp is I think peopleworking in this space tend to thinkof neural networks as, well,maybe one way to say itis that gradient descentis smarter than you, that, you know,and gradient descent isactually really great.The whole reason that we'reunderstanding these modelsis 'cause we didn't know how to write themin the first place.That gradient descent comes upwith better solutions than us.And so I think that maybeanother thing about mech interpis sort of having almosta kind of humilitythat we won't guess a prioriwhat's going on inside the models.We have to have this sortof bottom up approachwhere we don't really assume, you know,we don't assume that we shouldlook for a particular thingand that that will be thereand that's how it works.But instead we look for the bottom upand discover what happensto exist in these modelsand study them that way.- But, you know, the veryfact that it's possible to do,and as you and others haveshown over time, you know,things like universality, that the wisdomof the gradient descentcreates features and circuits,creates things universallyacross different kinds ofnetworks that are useful,and that makes the whole field possible.- Yeah, so this is actually,is indeed a really remarkableand exciting thingwhere it does seem like,at least to some extent,you know, the same elements,the same features andcircuits form again and again.You know, you can lookat every vision modeland you'll find curve detectorsand you'll find high/lowfrequency detectors.And in fact, there's some reason to thinkthat the same thingsform across, you know,biological neural networks andartificial neural networks.So a famous example is vision modelsin their early layersthey have Gabor filters,and there's, you know,Gabor filters are somethingthat neuroscientistsare interested in, havethought a lot about.We find curve detectors in these models,curve detectors are also found in monkeys.And we discover these highlow frequency detectorsand then some follow up workwent and discovered them in rats or mice.So they were found first inartificial neural networksand then found inbiological neural networks.You know, there's thisreally famous resulton like grandmother neuronsor the Halle Berry neuronfrom Quiroga et al.And we found very similarthings in vision models where,this is while I was still at OpenAIand I was looking at their clip model,and you find these neurons that respondto the same entities in images.And also to give a concrete example there,we found that there wasa Donald Trump neuron.For some reason, I guess everyone likesto talk about Donald Trump,and Donald Trump was very prominent,was a very hot topic at that time.So every neural network we looked at,we would find a dedicatedneuron for Donald Trump.And that was the only personwho had always had a dedicated neuron.You know, sometimes you'dhave an Obama neuron,sometimes you'd have a Clinton neuron,but Trump always had a dedicated neuron.So it responds to, youknow, pictures of his faceand the word Trump, likeall these things, right?And so it's not respondingto a particular exampleor like, it's not justresponding to his face,it's extracting over thisgeneral concept, right?So in any case, that's very similarto these Quiroga et al results.So there evidence that thisphenomenon of universality,the same things formacross both artificialand natural neural networks.That's a pretty amazingthing if that's true.You know, it suggests that,well, I think the thingthat it suggests isthe gradient of descentis sort of finding, youknow, the right waysto cut things apart in some sensethat many systems converge on,and many different neural networks'architectures converge on.That there's somenatural set of, you know,there's some set of abstractionsthat are a very natural wayto cut apart the problem,and that a lot of systemsare gonna converge on.That would be my kind of, you know,I don't know anything about neuroscience.This is just my kind of wild speculationfrom what we've seen.- Yeah, that would be beautifulif it's sort of agnosticto the medium of the model that's usedto form the representation.- Yeah, yeah, and, you know,it's a kind of a wild speculation based,you know, we only have some,a few data points that suggest this,but you know, it does seem likethere's some sense inwhich the same things formagain and again in both,in certainly in natural neural networksand also artificially or in biology.- And the intuition behindthat would be that, you know,in order to be useful inunderstanding the real world,you need all the same kind of stuff.- Yeah, well if we pick, I don't know,like the idea of a dog, right?Like, you know, there'ssome sense in which the ideaof a dog is like a natural categoryin the universe orsomething like this, right?Like, you know, there's some reason,it's not just like a weirdquirk of like how humans factor,you know, think about the world thatwe have this concept of a dog.It's in some sense...Or like, if you have the idea of a line,like there's, you know,like look around us,you know, there are lines, you know.It's sort of the simplest wayto understand this room in some senseis to have the idea of a line.And so I think thatthat would be my instinctfor why this happens.- Yeah, you need a curved line, you know,to understand a circle,and you need all those shapesto understand bigger thingsand it's a hierarchy ofconcepts that are formed, yeah.- And like maybe there are ways to goand describe, you know,images without referenceto those things, right?But they're not the simplest wayor the most economical wayor something like this.And so systems convergeto these strategieswould be my wild hypothesis.- Can you talk throughsome of the building blocksthat we've been referencingof features and circuits?So I think you first described themin 2020 paper \"Zoom In: AnIntroduction to Circuits.\"- Absolutely, so maybe I'll startby just describing some phenomena,and then we can sort of build to the ideaof features and circuits.- Wonderful.- If you spent like quite a few years,maybe like five years to some extentwith other things, studyingthis one particular modelInception V1, which isthis one vision model.It was state of the art in 2015and, you know, very much notstate of the art anymore.(Lex laughs)And it has, you know, maybeabout 10,000 neurons in it.And I spent a lot of time looking atthe 10,000 neurons, oddneurons of Inception V1.And one of the interestingthings is, you know,there are lots of neuronsthat don't have some obviousinterpretable meaning,but there's a lot of neuronsand Inception V1 that do havereally clean interpretable meanings.So you find neuronsthat just really do seemto detect curves,and you find neurons thatreally do seem to detect cars,and car wheels and car windowsand, you know, floppy ears of dogs,and dogs with long snoutsfacing to the right,and dogs with long snoutsfacing to the left.And you know, different kinds of,there's sort of this wholebeautiful edge detectors,line detectors, color contrast detectors,these beautiful things we callhigh/low frequency detectors.You know, I think looking at it,I sort of felt like a biologist, you know,you just, you're looking at thissort of new world of proteins,and you're discovering all thesedifferent proteins that interact.So one way you could tryto understand these modelsis in terms of neurons.You could try to be like, oh, you know,there's a dog detecting neuronand here's a car detecting neuron.And it turns out you can actually askhow those connect together.So you can go and say, oh, you know,I have this car detectingneuron, how is it built?And it turns out in the previous layer,it's connected reallystrongly to a window detector,and a wheel detector,and it's sort of car body detector.And it looks for the window above the car,and the wheels below,and the car chrome sort of in the middle,sort of everywhere butespecially on the lower part.And that's sort of arecipe for a car, right?Like that is, you know, earlier we saidthat the thing we wanted from mech interpwas to get algorithms to go and get,you know, ask what isthe algorithm that runs?Well, here we're justlooking at the weightsof the neuron network and reading offthis kind of recipe for detecting cars.It's a very simple cruderecipe, but it's there.And so we call that acircuit, this connection.Well, okay, so the problem is thatnot all of the neurons are interpretable,and there's reason to think,we can get into this more later,that there's thissuperposition hypothesis.There's reason to think that sometimesthe right unit to analyze thingsin terms of is combinations of neurons.So sometimes it's not thatthere's a single neuronthat represents say a car,but it actually turns outafter you detect the car,the model sort of hidesa little bit of the carin the following layerand a bunch of dog detectors.Why is it doing that?Well, you know, maybe it just doesn'twanna do that much workon cars at that pointand you know, it's sortof storing it away to go.So it turns out then thissort of subtle pattern of,you know, there's all these neuronsthat you think are dog detectorsand maybe they're primarily that,but they all a little bit contributeto representing a car in that next layer.Okay, so now we can't really think,there might still be something that,I don't know, you couldcall it like a car conceptor something, but it no longercorresponds to a neuron.So we need some term for thesekind of neuron like entities,these things that we sort ofwould've liked the neurons to be,these idealized neurons,the things that are the nice neurons,but also maybe there'smore of them somehow hiddenand we call those features.- And then what are circuits?- So circuits are theseconnections of features, right?So, when we have the car detectorand it's connected to a window detector,and a wheel detector,and it looks for the wheels belowand the windows on top, that's a circuit.So circuits are justcollections of featuresconnected by weights andthey implement algorithms.So they tell us, you know,how are features used?How are they built?How do they connect together?So maybe it's worth trying to pin downlike what really is thecore hypothesis here.And I think the corehypothesis is something we callthe linear representation hypothesis.So if we think about thecar detector, you know,the more it fires, the morewe sort of think of thatas meaning, oh, the model is moreand more confident that a car is present.Or you know, if it's somecombination of neuronsthat represent a car, you know,the more that combination fires,the more we think the modelthinks there's a car present.This doesn't have to be the case, right?Like you could imagine somethingwhere you have, you know,you have this car detector neuronand you think, ah, you know,if it fires like, you know,between one and two, that means one thing,but it means like totally differentif it's between three and four.That would be a nonlinear representation.And in principle that, youknow, models could do that.I think it's sort ofinefficient for them to do,if you try to think abouthow you'd implement computation like that,it's kind of an annoying thing to do.But in principle, models can do that.So one way to think about the featuresand circuits sort of frameworkfor thinking about thingsis that we're thinking aboutthings as being linear.We're thinking about there as beingthat if a neuron or acombination neurons fires more,it's sort of, that means moreof a particular thing being detected.And then that gives weightsa very clean interpretationas these edges between these entities,these features and thatedge then has a meaning.So that's in some ways the core thing.It's like, you know, we can talk aboutthis sort of outset,the context of neurons.Are you familiar withthe Word2Vec results?So you have like, you know,king minus man plus woman equals queen.Well, the reason you cando that kind of arithmeticis because you have alinear representation.- Can you actually explain thatrepresentation a little bit?So, first of all, so the featureis a direction of activation.- Yeah, exactly.- You can do it that way.Can you do the minus men plus women,that, the Word2Vec stuff,can you explain what that is that work?- Yeah, so there's this very-- It's such a simple, clean explanationof what we're talking about.- Exactly, yeah.So there's this very famous result,Word2Vec by Tomas Mikolov et al,and there's been tons offollow-up work exploring this.So, sometimes we have these,we create these word embeddingswhere we map every word to a vector.I mean, that in itself, by the way,is kind of a crazy thingif you haven't thoughtabout it before, right?Like we are going in and representing,we're turning, you know,like if you just learnedabout vectors in physics class, right?And I'm like, oh, I'm gonna actually turnevery word in thedictionary into a vector.That's kind of a crazy idea, okay.But you could imagine, you could imagineall kinds of ways in which youmight map words to vectors.But it seems like whenwe train neural networks,they like to go and map words to vectorsto such that they'resort of linear structurein a particular sense,which is that directions have meaning.So for instance, therewill be some directionthat seems to sort ofcorrespond to gender,and male words will be, youknow, far in one directionand female words willbe in another direction.And the linearrepresentation hypothesis is,you could sort of think of it roughlyas saying that that's actuallykind of the fundamentalthing that's going on,that everything is justdifferent directionshave meanings and addingdirection vectors togethercan represent concepts.And the Mikolov paper sortof took that idea seriously,and one consequence of it is thatyou can do this game of playingsort of arithmetic with words.So you can do king and you can, you know,subtract off the word manand add the word woman.And so you're sort of, you know,going and trying to switch the gender.And indeed if you do that, the result willsort of be close to the word queen.And you can, you know, do other thingslike you can do, you know,sushi minus Japan plus Italy and get pizzaor different things like this, right?So, this is in some sensethe core of the linearrepresentation hypothesis.You can describe itjust as a purely abstract thing.But vector spaces, you candescribe it as a statementabout the activations of neurons.But it's really about this propertyof directions having meaning.And in some ways it'seven a little subtle thatit's really I thinkmostly about this propertyof being able to add things togetherthat you can sort of independently modifysay gender and royaltyor, you know, cuisine type or country,and the concept of food by adding them.- Do you think thelinear hypothesis holds-- Yes.- That kind of carries scales.- So, so far, I thinkeverything I have seenis consistent with the hypothesis,and it doesn't have to be that way, right?Like you can write down neural networkswhere you write weights such thatthey don't have linear representations,where the right way to understand themis not in terms of linear representations.But I think every natural neural networkI've seen has this property.There's been one paper recentlythat there's been some sortof pushing around the edge.So I think there's been some workrecently studyingmulti-dimensional featureswhere rather than a single direction,it's more like a manifold of directions.This to me still seems likea linear representation.And then there's been some other paperssuggesting that maybein very small models,you get non-linear representations.I think that the jury's still out on that.But I think everythingthat we've seen so far has been consistentwith the linear representationhypothesis and that's wild.It doesn't have to be that way,and yet I think thatthere's a lot of evidencethat certainly at least thisis very, very widespread,and so far the evidenceis consistent with it.And I think, you know,one thing you might sayis you might say, well, Christopher,you know, that's a lot, you know,to go and sort of to ride on.You know, if we don't knowfor sure this is true,and you're sort of, you know,you're investing in neural networksas though it is true, youknow, isn't that dangerous?Well, you know, but I think actually,there's a virtue in takinghypotheses seriouslyand pushing them as far as they can go.So it might be that somedaywe discover somethingthat isn't consistent withlinear representation hypothesis.But science is full of hypothesesand theories that were wrong,and we learned a lot bysort of working under themas a sort of an assumption,and then going and pushingthem as far as we can.I guess this is sort of the heartof what Kuhn would call normal science.I dunno if you want, wecan talk a lot about-- Kuhn.- Philosophy of science and-- That leads to the paradigm shift.So yeah, I love it, takingthe hypothesis seriously,and take it to a natural conclusion.Same with the Scaling Hypothesis, same-- Exactly. Exactly.- I love it.- One of my colleagues, Tom Henighan,who is a former physicist,like made this really nice analogy to meof caloric theory where, you know,once upon a time we thoughtthat heat was actually,you know, this thing called caloric,and like the reason,you know, hot objects,you know, would warm up,cool objects is like thecaloric is flowing through them.And like, you know, because we're so usedto thinking about heat, you know,in terms of the modern theory, you know,that seems kind of silly.But it's actually very hardto construct an experimentthat sort of disprovesthe caloric hypothesis.And, you know, you can actually doa lot of really usefulwork believing in caloric.For example, it turns out thatthe original combustionengines were developedby people who believedin the caloric theory.So I think there's a virtuein taking hypotheses seriously,even when they might be wrong.- Yeah, there's a deepphilosophical truth to that.That's kind of like how Ifeel about space travel,like colonizing Mars.There's a lot of peoplethat criticize that.I think if you just assumewe have to colonize Marsin order to have a backupfor human civilization,even if that's not true,that's gonna producesome interesting engineeringand even scientificbreakthroughs, I think.- Yeah, well, and actuallythis is another thingthat I think is really interesting.So, you know, there's a wayin which I think it can be really usefulfor society to have peoplealmost irrationally dedicatedto investigating particular hypothesesbecause, well, it takes a lot to sortof maintain scientific moraleand really push onsomething when, you know,most scientific hypothesesend up being wrong.You know, a lot ofscience doesn't work out.And yet it's, you know,it's very useful to just, you know,there's a joke about Jeff Hinton,which is that Jeff Hinton has discoveredhow the brain works everyyear for the last 50 years.But you know, I say thatwith like, you know,with really deep respectbecause in fact that's actually, you know,that led to him doing somesome really great work.- Yeah, he won the Nobel Prize.Now who's laughing now?- Exactly, exactly, exactly.- Yeah.- I think one wants to be able to pop upand sort of recognize theappropriate level of confidence.But I think there's also a lot of valueand just being like, you know,I'm going to essentially assume,I'm gonna condition on this problembeing possible or this beingbroadly the right approach,and I'm just gonna go andassume that for a whileand go and work within thatand push really hard on it.And, you know, society has lots of peopledoing that for different things.That's actually really usefulin terms of going and getting to,you know, either reallyruling things out, right?We can be like, well, youknow, that didn't workand we know that somebody tried hard.Or going in and getting to something thatit does teach ussomething about the world.- So another interesting hypothesisis the superposition hypothesis.Can you describe what superposition is?- Yeah, so earlier we weretalking about word defect, right?And we were talking about how, you know,maybe you have one directionthat corresponds to gender,and maybe another thatcorresponds to royalty,and another one that corresponds to Italy,and another one thatcorresponds to, you know,food and all of these things.Well, you know, oftentimesmaybe these word embedding,they might be 500dimensions, 1000 dimensions.And so if you believe thatall of those directions were orthogonal,then you could only have,you know, 500 concepts.And you know, I love pizza, but like,if I was gonna go and like givethe like 500 most important concepts in,you know, the English language,probably Italy wouldn't be,it's not obvious at leastthat Italy would be one of them, right?Because you have to have thingslike plural, and singular,and verb, and noun, and adjective,and, you know, there's a lot of thingswe have to get to beforewe get to Italy, and Japan,and, you know, there's a lotof countries in the world.And so how might it be thatmodels could, you know,simultaneously have the linearrepresentation hypothesisbe true and also represent more thingsthan they have directions.So, what does that mean?Well, okay, so if linearrepresentation hypothesis is true,something interesting has to be going on.Now, I'll tell you onemore interesting thingbefore we go and we dothat, which is, you know,earlier we were talking aboutall these polysemantic neurons, right?These neurons that, you know,when we were looking at Inception V1,there's these nice neuronsthat like the car detectorand the curve detectorand so on that respondto lots of, you know,to very coherent things.But there's lots of neurons that respondto a bunch of unrelated things,and that's also an interesting phenomenon.And it turns out as wellthat even these neuronsthat are really, really clean,if you look at the weakactivations, right?So if you look at like,you know, the activationswhere it's like activating5% of the, you know,of the maximum activation,it's really not the core thingthat it's expecting, right?So if you look at a curvedetector, for instance,and you look at the placeswhere it's 5% active, you know,you could interpret it just as noiseor it could be that it's doingsomething else there, okay?So, how could that be?Well, there's this amazingthing in mathematicscalled compressed sensing,and it's actually thisvery surprising factwhere if you have a high dimensional spaceand you project it intoa low dimensional space,ordinarily you can't goand sort of unproject itand get back your highdimensional vector, right?You threw information away.This is like, you know,you can't invert a rectangular matrix,you can only invert square matrices.But it turns out that that'sactually not quite true.If I tell you that the highdimensional vector was sparse,so it's mostly zeros, then it turns outthat you can often goand find back the high dimensional vectorwith very high probability.So that's a surprising fact, right?It says that, you know,you can have this highdimensional vector space,and as long as things aresparse, you can project it down,you can have a lowerdimensional projection of it,and that works.So the superpositionhypothesis is saying thatthat's what's going on in neural networks.That's, for instance,that's what's going on in word embeddings.That word embeddings are ableto simultaneously have directionsbe the meaningful thing.And by exploiting the fact thatthey're operating on a fairlyhigh dimensional space,they're actually,and the fact that theseconcepts are sparse, right?Like, you know, you usually aren't talkingabout Japan and Italy at the same time.You know, most of thoseconcepts, you know,in most instances, Japanand Italy are both zero.They're not present at all.And if that's true, then you can goand have it be the case thatyou can have many more ofthese sort of directionsthat are meaningful, these featuresthan you have dimensions.And similarly, when we'retalking about neurons,you can have many moreconcepts than you have neurons.So that's, at a high level,the superposition hypothesis.Now it has this even wilder implication,which is to go and saythat neural networks are,it may not just be the case thatthe representations are like this,but the computation mayalso be like this, you know,the connections between all of them.And so in some sense, neuralnetworks may be shadowsof much larger, sparser neural networks,and what we see are these projection.And, you know, the strongest versionof the superpositionhypothesis would be to takethat really seriously andsort of say, you know,there actually is in somesense this upstairs model,this, you know, where theneurons are really sparseand all interpretable,and there's, you know,the weights between them arethese really sparse circuits.And that's what we're studying.And the thing that we're observingis the shadow of evidence.We need to find the original object.- And the process of learningis trying to constructa compression of the upstairs modelthat doesn't lose too muchinformation in the projection.- Yeah, it's finding howto fit it efficientlyor something like this.The gradient descent is doing this.And in fact, so this sort ofsays that gradient descent,you know, it could just representa dense neural network,but it sort of says that gradient descentis implicitly searching over the spaceof extremely sparse modelsthat could be projected intothis low dimensional space.And this large body of workof people going and tryingto study sparse neural networks, right?Where you go and you have,you could design neural networks, right,where the edges are sparseand the activations are sparse.And you know, my sense isthat work has generally,it feels very principled, right?It makes so much sense.And yet that work hasn't really panned outthat well is my impression broadly.And I think that apotential answer for thatis that actually the neural networkis already sparse in some sense.Gradient descent was the whole timeyou were trying to go and do this,gradient descent was actuallyin the behind the scenesgoing and searching moreefficiently than you couldthrough the space of sparse models,and going and learningwhatever sparse modelwas most efficient and then figuring outhow to fold it down nicely to goand run conveniently on your GPU,which does, you know, nice,dense matrix multiplies,and that you just can't beat that.- How many concepts do you thinkcan be shoved into a neural network?- Depends on how sparse they are.So there's probably an upper boundfrom the number of parameters, right?Because you have to have,you still have to have,you know, weights that goand connect them together.So that's one upper bound.There are in fact all these lovely resultsfrom compressed sensingand the Johnson-Lindenstrauss lemma,and things like this.That they basically tell youthat if you have a vector spaceand you want to havealmost orthogonal vectors,which is sort of probably the thingthat you want here, right?So you're gonna say, well, you know,I'm gonna give up on having my concepts,my features be strictly orthogonal,but I'd like them tonot interfere that much.I'm gonna have to ask themto be almost orthogonal.Then this would say thatit's actually, you know,once you set a thresholdfor what you're willing to acceptin terms of how muchcosine similarity there is,that's actually exponentialin the number of neurons that you have.So at some point, that's not gonnaeven be the limiting factor.But there's some beautiful results there.And in fact, it's probably even betterthan that in some sensebecause that's sort of,for saying that, you know, any random setof features could be active.But in fact the features havesort of a correlationalstructure where some features,you know, are more likely to co-occur,and other ones are lesslikely to co-occur.And so neural networks, my guess would becan do very well in terms of goingand packing things in such,to the point that's probablynot the limiting factor.- How does the problem of polysemanticityenter the picture here?- Polysemanticity is this phenomenonwe observe where you look at many neurons,and the neuron doesn't justsort of represent one concept.It's not a clean feature.It responds to a bunchof unrelated things.And superposition is,you can think of as being a hypothesisthat explains the observationof polysemanticity.So polysemanticity isthis observed phenomenonand superposition is a hypothesis thatwould explain it alongwith some other things.- So that makes mechinterp more difficult.- Right, so if you'retrying to understand thingsin terms of individual neurons,and you have polysemantic neurons,you're in an awful lot of trouble, right?I mean, the easiest answer islike, okay, well, you know,you're looking at the neurons,you're trying to understand them.This one responds for a lot of things,it doesn't have a nice meaning.Okay, you know, that's bad.Another thing you could ask is, you know,ultimately, we wannaunderstand the weights.And if you have two polysemantic neuronsand, you know, each one respondsto three things and then, you know,the other neuron responds to three thingsand you have a weight between them,you know, what does that mean?Does it mean that likeall three, you know,like there's these nine, you know,nine interactions going on?It's a very weird thing.But there's also a deeper reason,which is related to the factthat neural networks operateon really high dimensional spaces.So I said that our goal was, you know,to understand neural networksand understand the mechanisms,and one thing you mightsay is like, well, why not?It's just a mathematical function,why not just look at it, right?Like, you know, one ofthe earliest projectsI did studied these neural networksthat match two dimensional spacesto two dimensional spaces,and you can sort of interpret themas in this beautiful wayas like bending manifolds.Why can't we do that?Well, you know, as you havea higher dimensional space,the volume of that space in some sensesis exponential in thenumber of inputs you have.And so you can't just go and visualize it.So we somehow need to break that apart.We need to somehow breakthat exponential spaceinto a bunch of things that we, you know,some non-exponential number of thingsthat we can reason about independently.And the independence is crucialbecause it's theindependence that allows youto not have to think about, you know,all the exponentialcombinations of things.And things being mono-semantic,things only having one meaning.Things having a meaning,that is the key thing that allows youto think about them independently.And so I think that's, ifyou want the deepest reasonwhy we want to have interpretablemono-sematic features,I think that's really the deep reason.- And so the goal here,as your recent work has been aiming at,is how do we extract themono-semantic featuresfrom a neural net thathas poly-sematic featuresand all this mess.- Yes, we observed thesepoly-semantic neurons,and we hypothesized that'swhat's going on is superposition.And if superposition iswhat's going on there,there's actually a sort ofwell established techniquethat is sort of theprincipled thing to do,which is dictionary learning.And it turns out if youdo dictionary learning,in particular, if you dosort of a nice efficient waythat in some sense sort ofnicely regularizes it as wellcalled a sparse auto-encoder.If you train a sparse auto-encoder,these beautiful interpretable featuresstart to just fall out wherethere weren't any beforehand.And so that's not a thingthat you would necessarily predict, right?But it turns out that thatworks very, very well.You know, that to me thatseems like, you know,some non-trivial validationof linear representationsin superposition.- So with dictionarylearning, you're not lookingfor particular kind of categories,you don't know what they are.They just emerge.- Exactly.And this gets back toour earlier point, right?When we're not making assumptions,gradient descent is smarter than us,so we're not makingassumptions about what's there.I mean, one certainlycould do that, right?One could assume thatthere's a PHP featureand go and search for it,but we're not doing that.We're saying we don't knowwhat's gonna be there.Instead we're just gonna goand let the sparse auto-encoderdiscover the things that are there.- So can you talk to the\"Toward Monosemanticity\" paperfrom October last year?It had a lot of like nicebreakthrough results.- That's very kind of youto describe it that way.Yeah, I mean, this wasour first real successusing sparse auto-encoders.So we took a one layer model,and it turns out if you go and, you know,do dictionary learning on it,you find all these reallynice interpretable features.So, you know, the Arabicfeature, the Hebrew feature,the Base64 features,those were some examplesthat we studied in a lot of depth,and really showed that theywere what we thought they were.It turns, if you traina model twice as welland train two different modelsand do dictionary learning,you find analogousfeatures in both of them.So that's fun.You find all kinds ofof different features.So that was really justshowing that this work.And you know, I should mentionthat there was this Cunningham et althat had very similarresults around the same time.- There's something fun aboutdoing these kinds ofsmall scale experimentsand finding that it's actually working.- Yeah, well, and there'sso much structure here,like you know, so maybestepping back for a while,I thought that maybe all thismechanistic interpretability work,the end result was gonna be thatI would have an explanationfor why it was sort of, you know,very hard and not gonna be tractable.You know, we'd be like,well, there's this problemwith superposition,and it turns outsuperposition is really hard,and we're kind of screwed,but that's not what happened.In fact, a very natural,simple technique just works.And so then that's actuallya very good situation.You know, I think this is asort of hard research problemand it's got a lot of research riskand you know, it mightstill very well fail,but I think that some amount of,some very significantamount of research riskwas sort of put behind uswhen that started to work.- Can you describe what kind offeatures can be extracted in this way?- Well, so it depends on the modelthat you're studying, right?So the larger the model,the more sophisticated they're gonna be,and we'll probably talk aboutfollow up work in a minute.But in these one layer models,so some very common thingsI think were languages,both programming languagesand natural languages.There were a lot of featuresthat were specific words inspecific contexts, so \"the.\"And I think really the way to thinkabout this is \"the\" is likelyabout to be followed by a noun.So it's really, you couldthink of this as \"the\" feature,but you could also think of thisas predicting a specific noun feature.And there would be these featuresthat would fire for \"the\" in the contextof say a legal document,or a mathematical documentor something like this.And so, you know, maybein the context of mathyou're like, you know, and\"the\" then predict vector,matrix, you know, allthese mathematical words,whereas, you know, in other context,you would predict otherthings, that was common.- And basically we need clever humansto assign labels to what we're seeing.- Yes, so, you know,this is, the only thing this is doingis that sort of unfolding things for you.So if everything was sortof folded over top of it,you know, superposition foldedeverything on top of itselfand you can't really seeit, this is unfolding it.But now you still havea very complex thingto try to understand.So then you have to do a bunch of workunderstanding what these are.And some of them are really subtle.Like there's some really cool thingseven in this one layermodel about Unicode where,you know, of course somelanguages are in Unicodeand the tokenizer won't necessarily havea dedicated token forevery Unicode character.So instead what you'll haveis you'll have these patternsof alternating token,or alternating tokensthat each represent halfof a Unicode character.- Nice.- And you'll havea different feature that, you know,goes and activates on theopposing ones to be like, okay,you know, I just finisheda character, you know,go and predict next prefix.Then okay, on the prefix,you know, predict a reasonable suffix,and you have to alternate back and forth.So there's, you know,these one player modelsare really interesting.And I mean, it's another thingthat just, you might think,okay, there would justbe one Base64 feature,but it turns out there's actuallya bunch of Base64 featuresbecause you can have Englishtext encoded as Base64,and that has a very different distributionof Base64 tokens than regular.And there's some things about tokenizationas well that it can exploit.And I dunno, there's allall kinds of fun stuff.- How difficult is the taskof sort of assigninglabels to what's going on?Can this be automated by AI?- Well, I think it depends on the featureand it also depends on howmuch you trust your AI.So there's a lot of work doingautomated interpretability.I think that's a reallyexciting direction,and we do a fair amount ofautomated interpretabilityand have Claude go and label our features.- Is there some funny momentswhere it's totally rightor it's totally wrong?- Yeah, well, I think it's very commonthat it's like sayssomething very general,which is like true in some sense,but not really picking upon the specific of what's going on.So I think that's apretty common situation.Yeah, don't know that I havea particularly amusing one.- That's interesting, thatlittle gap between it is truebut doesn't quite get tothe deep nuance of a thing.That's a general challenge.It's like truly anincredible accomplishmentthat it can say a true thing,but it doesn't, it's not,it's missing the depth sometimes.And in this context, it'slike the ARC challenge,you know, the sort of IQ type of tests.It feels like figuring outwhat a feature represents is a bit of,is a little puzzle you have to solve.- Yeah, and I think thatsometimes they're easier,and sometimes they're harder as well.So yeah, I think that's tricky.And there's another thing which, I dunno,maybe in some ways this ismy like aesthetic coming in,but I'll try to giveyou a rationalization.You know, I'm actually a little suspiciousof automated interpretability,and I think that partly just thatI want humans tounderstand neural networks,and if the neural networkis understanding it for me,you know, I don't quite like that.But I do have a bit of,you know, in some ways,I'm sort of like the mathematicianswho are like, you know,if there's a computer automated proof,it doesn't count.- Right?- You know,they won't understand it.But I do also think thatthere is this kind of like reflectionson trusting trust type issue where,you know, if you, there'sthis famous talk about,you know, like when you'rewriting a computer program,you have to trust your compiler,and if there was likemalware in your compiler,then it could go and injectmalware into the next compiler,and you know, you'd bekind of in trouble, right?Well, if you're using neural networksto go and verify that yourneural networks are safe,the hypothesis that you'retesting for is like,okay, well, the neuralnetwork maybe isn't safe,and you have to worry aboutlike, is there some waythat it could be screwing with you?So, you know, I think that'snot a big concern now,but I do wonder in the long runif we have to use reallypowerful AI systemto go and, you know, audit our AI systems,is that actually something we can trust?But maybe I'm just rationalizing'cause I just want to us to have to get itto a point where humansunderstand everything.- Yeah, I mean, especiallythat's hilarious,especially as we talk about AI safetyand it looking for featuresthat would be relevantto AI safety, like deception and so on.So let's talk about the\"Scaling Monosemanticity\" paperin May, 2024.Okay, so what did it take to scale this,to apply to Claude 3s on it?- Well, a lot of GPUs.- A lot more GPUs, got it.- But one of my teammates,Tom Henighan was involved inthe original scaling laws work,and something that hewas sort of interested infrom very early on is,are there scaling lawsfor interpretability?And so something hesort of immediately didwhen this work started to succeed,and we started to havesparse auto-encoders workwas he became very interested in,you know, what are the scaling laws for,you know, for making sparseauto-encoders larger?And how does that relate tomaking the base model larger?And so it turns out this works really welland you can use it tosort of project, you know,if you train a sparseauto-encoder at a given size,you know, how many tokensshould you train on?And so on.So this was actually a very big help to usin scaling up this work,and made it a lot easierfor us to go and train,you know, really largesparse auto-encoders where,you know, it's not liketraining the big models,but it's starting to get to a pointwhere it's actually actually expensiveto go and train the really big ones.- So you have this, I mean,you have to do all this stuffof like splitting it across large GPUs-- Oh yeah, no, I mean there's a hugeengineering challenge here too, right?So, yeah, so there's ascientific question of,how do you scale things effectively?And then there's an enormous amountof engineering to go and scale this up.So you have to chart it,you have to think verycarefully about a lot of things.I'm lucky to work with abunch of great engineers'cause I am definitelynot a great engineer.- Yeah, and the infrastructure especially,yeah, for sure.So it turns out, TLDR, it worked.- It worked, yeah.And I think this is importantbecause you could have imagined, like,you could have imagined a worldwhere you said aftertowards mono-semanticity,you know, Chris, this is great,you know, it works on a one layer model,but one layer models arereally idiosyncratic.Like, you know, maybe,that's just something,like maybe the linearrepresentation hypothesisand superposition hypothesisis the right way tounderstand a one layer model,but it's not the right wayto understand larger models.And so I think, I mean, first of all,the Cunningham et alpaper sort of cut throughthat a little bit and sort of suggestedthat this wasn't the case.But scaling mono-semanticity sort of,I think was significant evidencethat even for very large models,and we did it on Claude 3 Sonnet,which at that point was oneof our production models.You know, even thesemodels seemed to be very,you know, seemed to besubstantially explainedat least by linear features.And, you know, doing dictionarylearning on them works,and as you learn more features,you go and you explain more and more.So that's, I think,quite a promising sign.And you find now reallyfascinating abstract features.And the features are also multimodal.They respond to images and textfor the same concept, which is fun.- Yeah, can you explain that?I mean, like, you know, backdoor,there's just a lot ofexamples that you can-- Yeah, so maybe let's startwith a one example to start,which is we found some features aroundsort of security vulnerabilitiesand backdoors and codes.So it turns out those areactually two different features.So there's a securityvulnerability feature,and if you force itactive, Claude will startto go and write security vulnerabilitieslike buffer overflows into code.And it also, it firesfor all kinds of things.Like, you know, some of thetop dataset examples for itwere things like, youknow, dash dash disable,you know, SSL or something like this,which are sort ofobviously really insecure.- So at this point it's kind of like,maybe it's just because the exampleswere presented that way,it's kind of like a little bitmore obvious examples, right?I guess the idea is that down the line,it might be able to detect more nuanced,like deception or bugsor that kind of stuff.- Yeah, well, I maybe wannadistinguish two things.So one is the complexity of the featureor the concept, right?And the other is the nuanceof how subtle the exampleswe're looking at, right?So, when we show the top dataset examples,those are the most extreme examplesthat cause that feature to activate.And so it doesn't meanthat it doesn't firefor more subtle things.So, you know, the insecure code feature,you know, the stuff that it fires for,most strongly for arethese like really obvious,you know, disable thesecurity type things.But you know, it also firesfor, you know, buffer overflowsand more subtle securityvulnerabilities in code.You know, these featuresare all multimodal,so you could ask like, whatimages activate this feature?And it turns out that thesecurity vulnerability featureactivates for images of like peopleclicking on Chrome tolike go past the, like,you know, this website,the SSL certificate might bewrong or something like this.Another thing that's very entertainingis there's backdoors and code feature.Like you activate it, it goesand Claude writes a backdoorthat like will go and dumpyour data to port or something.But you can ask, okay,what images activate the backdoor feature?It was devices withhidden cameras in them.So there's a wholeapparently genre of peoplegoing and selling devicesthat look innocuous,that have hidden cameras and they have-- That's great.- This hidden camera in it.And I guess that is the, you know,physical version of a backdoor.And so it sort of shows youhow abstract these concepts are, right?And I just thought that was,I'm sort of sad thatthere's a whole marketof people selling devices like that,but I was kind of delighted thatthat was the thing thatit came up with as the topimage examples for the feature.- Yeah, it's nice. It's multimodal.It's multi almost context.It's as broad, strong definitionof a singular concept, it's nice.- Yeah.- To me, one of the reallyinteresting features,especially for AI safetyis deception and lying.And the possibility thatthese kinds of methodscould detect lying in a model,especially gets smarterand smarter and smarter.Presumably that's a big threatof a super intelligent modelthat it can deceivethe people operating itas to its intentions orany of that kind of stuff.So what have you learnedfrom detecting lying inside models?- Yeah, so I think we're in some waysin early days for that.We find quite a few featuresrelated to deception and lying.There's one feature where, you know,fires for people lyingand being deceptive,and you force it activeand starts lying to you.So we have a deception feature.I mean, there's allkinds of other featuresabout withholding informationand not answering questions,features about power seekingand coups and stuff like that.So there's a lot of featuresthat are kind of related to spooky things.And if you force them active,Claude will behave in ways that are,they're not the kindsof behaviors you want.- What are possiblenext exciting directionsto you in the space of mech interp?- Well, there's a lot of things.So for one thing, I wouldreally like to get to a pointwhere we have circuits wherewe can really understandnot just the features,but then use that to understandthe computation of models.That relief for me is theultimate goal of this.And there's been some work,we put out a few things.There's a paper from Sam Marksthat does some stuff like this.And there's been some,I'd say some work around the edges here.But I think there's a lot more to do,and I think that will bea very exciting thing.That's related to a challengewe call interference weights,where due to superposition,if you just sort of naively look atwhere their featuresare connected together,there may be some weights thatsort of don't exist in the upstairs model,but are just sort ofartifacts of superposition.So that's a sort of technicalchallenge related to that.I think another excitingdirection is just, you know,you might think of sparse auto-encodersas being kind of like a telescope.They allow us to, you know,look out and see all thesefeatures that are out there.And you know, as we build betterand better sparse auto-encoders,get better and betterat dictionary learning,we see more and more stars,and you know, we zoom in onsmaller and smaller stars.But there's kind of a lot of evidencethat we're only still seeinga very small fraction of the stars.There's a lot of matter in our, you know,neural network universethat we can't observe yet.And it may be that we'll never be ableto have fine enoughinstruments to observe it,and maybe some of it just isn't possible,isn't computationallytractable to observe it.So it's sort of a kind of dark matter,not in maybe the senseof modern astronomy,but of early astronomy when we didn't knowwhat this unexplained matter is.And so I think a lotabout that dark matterand whether we'll ever observe it,and what that means for safetyif we can't observe it,if there's, you know, ifsome significant fractionof neural networks arenot accessible to us.Another question thatI think a lot about is,at the end of the day, you know,mechanistic interpretability is thisvery microscopic approachto interpretability.It's trying to understand thingsin a very fine-grained way.But a lot of the questions we care aboutare very macroscopic.You know, we care about these questionsabout neural network behavior,and I think that's the thingthat I care most about,but there's lots of othersort of larger scale questionsyou might care about.And somehow, you know,the nice thing about about havinga very microscopic approachis it's maybe easier to ask,you know, is this true?But the downside is it's much furtherfrom the things we care about,and so we now have this ladder to climb.And I think there's a question of,will we be able to find,are there sort of largerscale abstractionsthat we can use tounderstand neural networks?Can we get up from thisvery microscopic approach?- Yeah, you've written aboutthis kind of organs question.- Yeah, exactly.- So if we thinkof interpretability as a kindof anatomy of neural networks,most of the circuits threadsinvolve studying tiny little veins,looking at the small scaleat individual neuronsand how they connect.However, there are many natural questionsthat the small scaleapproach doesn't address.In contrast, the mostprominent abstractionsin biological anatomy involvelarger scale structures,like individual organs, like the heart,or entire organ systems,like the respiratory system.And so we wonder, is there arespiratory system or heartor brain region of anartificial neural network?- Yeah, exactly.And I mean, like if youthink about science, right?A lot of scientific fields have, you know,investigate things at manylevels of abstractions.In biology you have like, you know,molecular biology studying, you know,proteins and molecules and so on.And they have cellular biology,and then you havehistology studying tissues,and you have anatomy, andthen you have zoology,and then you have ecology.And so you have many,many levels of abstractionor you know, physics,maybe the physics of individual particles,and then, you know, statistical physicsgives you thermodynamicsand things like this.And so you often have differentlevels of abstraction.And I think that rightnow we have, you know,mechanistic interpretabilityif it succeedsis sort of like a microbiologyof neural networks,but we want something more like anatomy.And so, and you know, aquestion you might ask is,why can't you just go there directly?And I think the answer is superposition,at least in significant part.It's that it's actually very hard to seethis macroscopic structurewithout first sort of breaking downthe microscopic structurein the right way,and then studying howit connects together.But I'm hopeful that thereis gonna be somethingmuch larger than features and circuits,and that we're gonnabe able to have a storythat involves much bigger things,and then you can sort of studyin detail the parts you care about.- I suppose to neurobiology,like a psychologistor a psychiatrist of a neural network.- And I think that the beautiful thingwould be if we could go,and rather than having disparate fieldsfor those two things,if you could build a bridge between them-- Oh, right.- Such that you could goand have all of yourhigher level abstractionsbe grounded very firmlyin this very solid,you know, more rigorous,ideally, foundation.- What do you think is the differencebetween the human brain, thebiological neural networkand the artificial neural network?- Well, the neuroscientistshave a much harder job than us.You know, sometimes I justlike count my blessingsby how much easier my jobis than the neuroscientists, right?So I have, we can recordfrom all the neurons.We can do that onarbitrary amounts of data.The neurons don't changewhile you're doing that, by the way.You can go and ablate neurons,you can edit the connections and so on,and then you can undo those changes.That's pretty great.You can force, you canintervene on any neuronand force it active and see what happens.You know, which neurons areconnected to everything, right?Neuroscientists wanna get the connectome,we have the connectome and we have itfor like much bigger than like C. elegans.And then not only dowe have the connectome,we know what, you know, which neuronsexcite or inhibit each other, right.So we have, it's not just that we knowthat like the binarymass, we know the weights.We can take gradients,we know computationallywhat each neuron does.So I don't know the list goes on and on.We just have so many advantagesover neuroscientists.And then despite havingall those advantages,it's really hard.And so one thing I dosometimes think is like,gosh, like if it's this hard for us,it seems impossible under the constraintsof neuroscience or, youknow, near impossible.I don't know, maybe part of me is like,I've got a few neuroscientists on my team.Maybe I'm sort sort of like, ah, you know,maybe the neuroscientists,maybe some of themwould like to have an easierproblem that's still very hardand they could come andwork on neural networks.And then after we figure out thingsin sort of the easy little pond of tryingto understand neural networks,which is still very hard,then we could go back tobiological neuroscience.- I love what you'vewritten about the goalof mech interp research as two goals,safety and beauty.So can you talk about thebeauty side of things?- Yeah, so, you know,there's this funny thingwhere I think some people want,some people are kind ofdisappointed by neural networks,I think, where they're like,ah, you know, neural networks,it's these just these simple rules,and then you just likedo a bunch of engineeringto scale it up and it works really well.And like, where's the like complex ideas?You know, this isn't like a very nice,beautiful scientific result.And I sometimes thinkwhen people say that,I picture them being like, youknow, evolution is so boring.It's just a bunch of simple rulesand you run evolution for along time and you get biology.Like what a sucky, you know,way for biology to have turned out.Where's the complex rules?But the beauty is that thesimplicity generates complexity.You know, biology has these simple rulesand it gives rise to,you know, all the lifeand ecosystems that we see around us,all the beauty of nature, thatall just comes from evolutionand from something very simple evolution.And similarly, I thinkthat neural networks build,you know, create enormous complexityand beauty inside andstructure inside themselvesthat people generally don't look atand don't try to understandbecause it's hard to understand.But I think that there isan incredibly rich structureto be discovered inside neural networks,a lot of very deep beautyif we're just willing to take the timeto go and see it and understand it.- Yeah, I love mech interp,the feeling like we are understandingor getting glimpses of understandingthe magic that's going oninside is really wonderful.- It feels to me like one of the questionsthat's just calling out to be asked,and I'm sort of, I mean, a lot of peopleare thinking about this,but I'm often surprisedthat not more are is, how is it thatwe don't know how tocreate computer systemsthat can do these things,and yet we have these amazing systemsthat we don't know how todirectly create computer programsthat can do these things,but these neural networks cando all these amazing things?And it just feels like thatis obviously the questionthat sort of is calling outto be answered if you are,if you have any degree of curiosity.It's like how is it thathumanity now has these artifactsthat can do these thingsthat we don't know how to do?- Yeah, I love the imageof the circuits reaching towards the lightof the objective function.- Yeah, it's just, it's this organic thingthat we've grown and we haveno idea what we've grown.Well, thank you for working on safetyand thank you for appreciatingthe beauty of the things you discover.And thank you for talking today, Chris.This was wonderful.- Yeah.Thank you for taking thetime to chat as well.- Thanks for listening to thisconversation with Chris Olah,and before that with DarioAmodei and Amanda Askell.To support this podcast,please check out oursponsors in the description.And now let me leave you withsome words from Alan Watts.\"The only way to make sense out of changeis to plunge into it, move with it,and join the dance.\"Thank you for listening andhope to see you next time.\n"