Tree of Thoughts - Deliberate Problem Solving with Large Language Models (Full Paper Review)
**A General Problem Solver that Explores its Own Thoughts**
The concept of a general problem solver that explores its own thoughts and guides its own exploration with deliberate reasoning as heuristics is an exciting direction in artificial intelligence. This idea involves implementing algorithms that can think critically about their own thought processes, allowing them to improve themselves over time. In this context, the development of a general problem solver is crucial for creating intelligent machines that can solve complex problems.
**Implementation and Vocabulary**
The current implementation of the language model is still in its early stages. While it has shown promise in solving crossword puzzles, there are limitations to its performance. The model relies heavily on explicit prompts and constraints to guide its reasoning, which limits its ability to explore and find innovative solutions. To move forward, the development team needs to focus on creating a more general problem solver that can think critically about its own thoughts.
**The Importance of Heuristics**
Heuristics play a crucial role in the development of a general problem solver. By incorporating heuristics into the algorithm, it is possible to create a model that can reason effectively and make good decisions. This approach allows the model to explore different solutions and evaluate their effectiveness, allowing it to learn from its mistakes and improve over time.
**The Need for Deliberate Reasoning**
Deliberate reasoning is essential for creating a general problem solver that can think critically about its own thoughts. By incorporating deliberate reasoning into the algorithm, it is possible to create a model that can evaluate its own performance and make adjustments accordingly. This approach allows the model to learn from its mistakes and improve over time, making it a more effective problem solver.
**The Role of Meta-Prompts**
Meta-prompts play a critical role in the development of a general problem solver. A meta-prompt is a prompt that encourages the model to think critically about its own thoughts and evaluate its performance. By incorporating meta-prompts into the algorithm, it is possible to create a model that can reason effectively and make good decisions.
**The Future of General Problem Solving**
The future of general problem solving holds much promise for artificial intelligence. With the development of more advanced algorithms and models, it is possible to create machines that can solve complex problems with ease. The integration of language models with classic algorithms will be crucial in creating intelligent machines that can think critically about their own thoughts.
**Ablation Studies**
Ablation studies have shown that pruning and backtracking are essential components of the algorithm. By removing these components, the model's performance degrades significantly. However, despite this degradation, the model still performs better than before, demonstrating the importance of pruning and backtracking in the algorithm.
**The Impact on Total Games Solved**
The impact of ablation studies on total games solved is significant. When pruning is removed, the model solves fewer total games, indicating that it relies heavily on this component to solve problems effectively. This suggests that the development team needs to focus on improving the pruning and backtracking components of the algorithm.
**Conclusion**
In conclusion, the development of a general problem solver that explores its own thoughts and guides its own exploration with deliberate reasoning as heuristics is an exciting direction in artificial intelligence. While there are limitations to the current implementation, the potential for growth and improvement is vast. By incorporating meta-prompts, ablation studies, and deliberate reasoning into the algorithm, it is possible to create a model that can reason effectively and make good decisions. The future of general problem solving holds much promise for artificial intelligence, and the integration of language models with classic algorithms will be crucial in creating intelligent machines that can think critically about their own thoughts.
"WEBVTTKind: captionsLanguage: enhi there today I thought we'd just give a quick look at this paper tree of thoughts deliberate problem solving with large language models in summary this paper proposes a sort of decoding technique like a way to use large language models where you don't just ask them once what they think and try to structure your prompts really smartly like something like Chain of Thought but instead you do an explicit tree search over outputs of the language model with the language model itself valuing these three states and therefore being able to Branch off and backtrack and so on it turns out this can help for tasks where such a a pattern of investigating the task is really helpful so the paper proposes the decoding technique and proposes some new tasks where they expect the decoding technique to work really well and then the decoding technique works really well on those tasks make of that as you will it's an interesting idea I do think it's like a small step into a new Direction Where We mix language models with essentially with programming with algorithms so I'm pretty excited for that and this paper is a step into that direction it's by people from Princeton University and Google deepmind and will go right into it so the main proposition is this one right here it's if you simply prompt a language model even if you prompt it really well you say well you are a super helpful helper you're so helpful you just help me and everything and um I I want you to make to do this task that's sort of called what they call here input output prompting input output means you specify the task and optionally you may also specify an output format so you may say to a language model hey I have this task right here I want to write an email to my boss please write it in the following format first write Dear boss then write the text in between and at the end write the signature more commonly if you want models to for example output Json as a response you say you know don't give me a textual answer only respond using Json formatted output or even more commonly if you want the model to do a classification task for example you say don't answer in text answer with one of the following four options right just output the word and then you give the four classes and then you rely on the model only outputting that there are other techniques like restricted decoding that can enforce this stuff and so on but in general input output prompting simply says you ask the language model what you want it to do and optionally you give a you give a um a specif a specifier for how the output should look like in textual form um that doesn't also optional in here you can do some in-context examples if you want so all of this is like like this this is like prompting this is like standard prompting uh today then you have Chain of Thought Chain of Thought prompting is a different prompting technique where you instruct the model to explicitly make intermediate steps so you say you know um please invent please write a song about a little bird right and please do that in steps so please first make a like an overall plan for the song and output that into individual faults so you would instruct the model not only to have to have a prompt that's too fat you would instruct it not only to have to have a prompt uh followed by followed by an answer but you would instruct it to Output its thoughts so prom goes here and then you say you know write your thoughts on each line starting with like T and then the model is supposed to put its thoughts right here and its thoughts right here and its thoughts right here and then at the end answer so you have to input output prompt for this structure right here um you have to tell it please do this but it turns out if you do that if you do this Chain of Thought prompting your instructor model to explicitly write down the intermediate steps of problem solving the problem is going to be solved better than if you just ask the model to just provide you the answer it turns out and I think that is not there are hypotheses around why that is but it's I believe it's not yet fully understood why exactly the Chain of Thought prompting helps but hypotheses are that it gives the model kind of a scratch Pad um like right here in order to write down its thoughts and then the next thoughts can refer back to the previous explain illicit thoughts and not everything has to happen sort of in the weights the second opinion is that it just gives the model sort of a longer time to compute like it can since it decodes more tokens it can sort of invest more compute into a given problem and you're leading it into the correct Direction with these thoughts multiple hypotheses in any case the next iteration that this paper considers is self-consistency with Chain of Thought this essentially mixes jaina fought with voting so you just do Chain of Thought multiple times you just kind of sample multiple times and then you majority vote on the output that obviously is only possible where you have some sort of a classification task um where you can assemble a majority vote there's also another concept that's not mentioned here but that is sort of iterative refinement which comes in later in the paper what you can always do is you can always just append a prompt that says consider your last answer how how might you improve it or consider your last unders do you think it's correct if not please improve it and you can sort of add that onto any of these techniques so there's lots of these techniques this paper here considers the tree of thoughts and the tree of thoughts is here contraposed to the chain of thoughts as in you can see it is in fact a tree so you have nodes and you have nodes branching out and some notes are kind of abandoned which here in the red and then some nodes are continue to be expanded here in the green so this represents a language model that I ask something and then similarly to Chain of Thought I ask it to Output its faults but I can do that multiple times so three times I go to the language model and just ask it to Output the first thought in the problem solving step just the first one not the whole problem just the first one and those that gives me the three thoughts for example right here one two three then then as a second step so I finish that step as a second step I take the language model and I use it to self-critique all of these thoughts right here with respect to the input prompt so I ask it something like hey what you just output here as a thought as an intermediate step do you think that's a good step towards solving the original problem and this relies on the fact that a lot of these models like the language models but machine learning in general is much better at evaluating whether two things fit together then generating a new thing that's just it just turns out to be like that it kind of makes sense because you only need to Output a number or a value or something like this rather than generating something it is the case so even if you use the same model to self-critique you'll get a better signal for the critic than you get for the generation so that's why it does make sense that after creating something you go and you ask the same model to consider what it just did and how well it fits so the model for example you you'll give it all of these tasks let me grab you'll you'll give it all of these tasks and then you'll ask it to consider and maybe it will say this one here is really bad this one's kinda good and this one's like the best you can do that in multiple ways you can just give it ask it to give you a score or a label from like good medium bad for everyone or you can put everything into context and then say hey which one of these is the best please vote for the best and you just sample that multiple times so you get like a non-noisy signal and then what you can do is you can simply say well this one here the model thinks is really bad so I'll discard that and I'll just continue with the ones where I'm more confident then let's say we we consider the middle one right here so we've eliminated the one on the right hand side and we consider it the middle state right here again I sample maybe this time four different thoughts um and again I ask the model um hey what do you think of those four thoughts and now here the the the drawing is a bit wonky I think in that it doesn't display the algorithm like this this note right here should probably be well I don't know um but let's say we do that and let's assume that this here isn't like full green it's also red so it kind of says well none of these are really good towards the goal so it's like no no no no no let's just assume the critic says all of these are bad in that case what we do is we'd actually backtrack we'd go up here and go down to the next best top level State and go from that and say well okay here is the prompt and here is the thought you output for this particular node Now give me the next thought right as we did over here but we found that none of these continuations made any sense so we try a different branch go over here and now we maybe output here it's just one but we may be again output four of them we prune away three but one of them is actually good then we continue and so on so I hope you can see this is a very classic Tree Search right it's a tree search you can do this this breath first or depth first um but it's a tree search essentially with pruning an ordered tree search if you will um according to the critics value so we always go and expand the node of the tree that has the highest current value assigned to it this is very similar to yeah things like a star search or yeah various various forms of tree search they do say they keep it simple right here and leave more advanced algorithms such as Monte Carlo to research and all of that for later so I hope it's kind of clear what's happening um they themselves formulate it into two different so into two different things um you have a thought generator that's just generating one thought at a time so one intermediate step of the problem solving that you ask the language model based on the input and the previous thoughts you just say please make one intermediate step don't solve the problem completely just wake make one little step and explicitly write down the result of that step that's a thought in in Co in parlance um so they say that's so we can either sample or propose those which means that we can go to the language model three times and Sample three times for three thoughts or what we can do is we can just say please give me three thoughts and then it outputs a list again we might have to do i o prompting to get the correct output format but we can we can do that they say um one for example this one right here you might want to do if you generate some story or something because two samples are extremely unlikely to be equal this one right here is more appropriate when you have short proposals but they're more constrained so you want diversity you don't want stuff to repeat itself but in any case you generate Thoughts by sampling and then you have a state evaluator where you simply ask the model how good do you think that is on either way that's value or vote please do you give it all the all the thoughts that have been output and you say which one of these is the best and then you count the votes with voting it might be a bit more tricky to do sort of the backtrack like to compare nodes globally um you might have to do another voting because in a different branch the the winner node might have a completely different value than in this Branch over here but they they're both winners of their respective votes but you can Implement that in in various ways um again they can mix that with BF oh sorry with BFS or DFS um just the like in what in what order do they consider things for um things for expansion and as we said we can also do that globally but for example in a DFS they would go they would have some kind of Maximum steps that's that's fine um they would sort all the candidates they they have available if one of the candidates is above the value threshold they expanded so they go down down until no node like until all of these nodes in our example were like if all of that these nodes are red they say are no notice above the value threshold so we backtrack um this is it's not a like a global expansion as I mentioned it I guess that would be a step further foreign so I hope the I hope the um the overall picture here is relatively clear now let's go to what they research is on or what they um evaluate this on so this compared to Chain of Thought right Chain of Thought you can Implement in two ways um one way is to explicitly always sample the next thought but you might as well just input in say you know put out all your thoughts in one go and then give me the answer so one prompt one sampling because it's linear anyway right you just want the model to Output a linear sequence of things and therefore you might as well sample it at once and even the the self-consistency right here it's just sampling multiple times in parallel whereas this thing the big difference is that you actively have to stop after each step um like sample three thoughts stop evaluate dot One Stop evaluate thought to stop evaluate thought three star off and then you have to decide which one is the best one and then from that point on you go to the language model again and say you know here is input and one thought number one now give me um three new thought number twos and then and so on so this leads to a lot more evaluations I want to say of the language model but I'm pretty I'm pretty excited that maybe in the future we can just include this into our programming language and say you know this piece here I don't want to call like a function that does something I just want the the language model to take care of this particular small part and the rest I program around it and as we'll see that's kind of the spirit of the experiments even though I think they they what they want to go for is like a general Problem Solver I I don't think this goes into the direction of a general Problem Solver I think this goes into the direction of like including it into programming but the first um there are three tasks to evaluate on one is this game of 24. you have four numbers uh for example this these four numbers right here and you're asked to come up with a mathematical expression that results in 24 right um so they what they have to do is they have to prompt it explicitly right um to to say okay give me possible next steps all right here are possible next steps then they parse that um and they have this prompt right here evaluative given numbers can reach 24 into these things again with examples like with few short examples and then um I I think with few short examples yes and then the model outputs something like good bad impossible so this here is the thought generation and this here is the evaluation and yes you can you can reach like that you can get the language model to solve these things and you can probably do them better however however they um have to prompt it really specifically here in order to do that which is fine for research right but the the prompts are really like like you would program the algorithm except that the one part is really taken care of by the language model but the prompts are so specific to the problems that they almost and this is this is really so here we have okay we have creative writing and in creative writing you can see it's not that big of a difference like if you if you if you look right here the IR prompting versus the trio fault like sure it's higher but this is I think um gpt4 evaluation and this is human evaluation then sure this one here is higher um but it's like not that much of a difference and especially if you have these refine prompts then you get up there too so it probably helps more in these algorithmic tasks and this here is like mini crosswords so you have a grid of five by five let's do just do three by three for demonstration purposes and it's all letters and it's a crossword puzzle so you for example here you'd have a word um like let's say you have ape ape okay and then here you have a clue like the life form I don't know like like human 's are I'm I'm a terrible crossword Q generator or something like this um or like animal with long arms that lives in trees I don't okay there's there's cues right here you know what crossword puzzles are so um and then here is like Poe or something like this and then here it says like poet Edgar Edgar Allan and okay so you have these cues and you can input that into a language model and the language model can output an answer whether it's correct or not right that's the question that's the task but you can clearly see the task would profit a lot from you being able to do this backtracking because you know fill in a word that you think might be correct and you fill in another word and all of a sudden you realize ah that doesn't work out do you kind of cross out the ones again that you previously filled in try some other ones this is extremely handy like in this problem like a backtracking tree search is extremely handy that's why they evaluated on it they're not shy of saying look we evaluate the things on the task where we think it's going to benefit but that that also should tell you that this method um is probably going to to shine well in such tasks and it's yet to be seen in like normal everyday usage how well this does so you can see that in the input output prompting in the baselines they say we provide five example pairs in the Chain of Thought prompt We additionally include intermediate words in the order of this or this we run each prompt 10 samples and average the result so Chain of Thought prompt saying you know the intermediates are you know the intermediate words right here you don't have to Output the full puzzle at one point just intermediate ones and then at the end give me the result the tree of thought setup however is much more integral in intricate um so they use depth first with with tree of thoughts um keeps exploring the most promising subsequent word clue until this state is no longer promising then backtrack to the parent state to explore alternative thoughts that's the DFS tree of thoughts algorithm they presented above they say okay to make search more tractable subsequent thoughts are constrained not to change any filled words or letters so that the Toots that has at most 10 intermediate steps for at each state we translate all existing thoughts for the state into letter constraints for the remaining cues like this one right here and prompt a proposal five times to come up with candidates what to fill in um for State evaluations we translate each state into letter constraints for remaining Clues then evaluate for each clue if it is possible to fill it and then yeah so what I mean right here is that they help a lot right and in fact which isn't bad and you know the only criticism I would have right here is that these these things all the oh we translated into letter constraints and so on um I guess it would be possible to help the Chain of Thought prompting like the Baseline uh a bit by doing that as well I feel like that should be possible and I feel like for a fair evaluation that should probably be attempted but in any case you can see they help a lot they they help this process a lot to make it really sure that you know the the model um here you know you can see like here it goes down this but then the state evaluation says oh no actually this is probably impossible to solve right now not that you filled in these words and you go back over here and try a different route so you can see this this Salon here wasn't was probably not appropriate um However the fact that they help a lot essentially they implement the crossword solving algorithm right they say in the text hey our goal isn't they say it's something like this our goal we know they say we know that there are algorithms to solve this the goal is not just to solve the task as more General crosswords can be readily solved with specialized NLP pipelines that leverage large-scale retrieval instead of language models okay so the they say you know our we want to do a general Problem Solver that explores its own thoughts and guides its own Exploration with deliberate reasoning as heuristics yet still they essentially implement the crossword solving algorithm implementing all the three search and the constraints and and so on um and helping the language model to such a degree that I think well they essentially just replaced the lookup from the vocabulary with the language model so what we have right here is kind of a random word sampler because all of the rest is essentially implemented um by by the the algorithm itself and by the constraints they give again this is not bad but it does mean to me that the way I see this as I said is in programming so in programming I could have my code yada yada do this do that do this and then here instead of calling a function like f uh that function would not be somewhere in my code but that particular function would be sort of maybe a language model doing something somewhere and then I could Implement something like a DFS or so um and try to call that function as part of it but I don't see this at the level yet of what I think they want to do what they want to do is to have this General thing that looks at its own thoughts and explorers and backtracks and so on but for that in order to show that that's possible The Next Step the next paper following up on that really has to go away from these explicit prompts for the problems and really has to do it such that um there is one prompt like I should I should be able to have one prompt initially like one surrounding prompt I can describe the problem a bit sure but I should have the one prompt and then the intermediate steps they should all be governed by one prompt right not by explicit prompts saying hey uh here is the here is the decoding constraints and so on now give thoughts about this and we parse it for you even with the math problems they like parse it intermittently and so on um none of that should happen it should just be one prompt that essentially generically says consider your previous thought you know how good do you think it is and so on um it could even be a meta prompt saying if you were to ask yourself how good this thought is how would you evaluate it and then use that as a prompt right but that really has to be the next step in order to make this a viable General Problem Solver until then I see this as a cool thing that we could use inside of algorithms part of algorithms um but that's a very different direction what is good to see is they do a lot of ablations uh on you know what the pruning and the backtrack and so on gives so this here is the crossword results you can see the i o prompting in Chain of Thought prompting they barely managed to solve full games they sometimes have word and letter successes but not that many the tree of thoughts uh obviously is much better and if you they say this is pretty cool they say hey if we heuristically if we Oracle we know which words go where right so if we always at valuation time so when the model criticizes itself and selects the best thought if at that time we always tell the model what the best um thought is then it goes up even more interestingly it doesn't go up that much right it just gets like that last bit to solve many more games than it could previously solve but the word success and the letter success rate don't go up that much that's pretty interesting um so it just kind of helps it to be extra consistent right and um yeah if they take away the pruning like if they never prune um or if they take away the backtracking so if they always just go down never backtrack and go to another Branch you can see that the performance it decorates again interestingly it doesn't degrade too much in sort of the success rate it does degrade a bit but okay it does the great I guess I guess the numbers are fairly big but the real killer is it solves a lot less total games so the total games are kind of an indication of if you make an a mistake like somewhere then you might get a lot of the words correct but the total game won't be solved um and so the total games is kind of an indication of how well that planning and so on works all right that is what I wanted to say about this paper all in all I think it is it is pretty cool um it is definitely a good direction I feel there's a lot that can be done on top of this to make it more intricate and I'm excited for a future where I'm obviously excited for the auto GPT style that is thought of here but I'm also quite excited for a future where these these things are just part of algorithms like classic algorithms mixed with language models I think that's an interesting World let me know what you think that was it for me bye-byehi there today I thought we'd just give a quick look at this paper tree of thoughts deliberate problem solving with large language models in summary this paper proposes a sort of decoding technique like a way to use large language models where you don't just ask them once what they think and try to structure your prompts really smartly like something like Chain of Thought but instead you do an explicit tree search over outputs of the language model with the language model itself valuing these three states and therefore being able to Branch off and backtrack and so on it turns out this can help for tasks where such a a pattern of investigating the task is really helpful so the paper proposes the decoding technique and proposes some new tasks where they expect the decoding technique to work really well and then the decoding technique works really well on those tasks make of that as you will it's an interesting idea I do think it's like a small step into a new Direction Where We mix language models with essentially with programming with algorithms so I'm pretty excited for that and this paper is a step into that direction it's by people from Princeton University and Google deepmind and will go right into it so the main proposition is this one right here it's if you simply prompt a language model even if you prompt it really well you say well you are a super helpful helper you're so helpful you just help me and everything and um I I want you to make to do this task that's sort of called what they call here input output prompting input output means you specify the task and optionally you may also specify an output format so you may say to a language model hey I have this task right here I want to write an email to my boss please write it in the following format first write Dear boss then write the text in between and at the end write the signature more commonly if you want models to for example output Json as a response you say you know don't give me a textual answer only respond using Json formatted output or even more commonly if you want the model to do a classification task for example you say don't answer in text answer with one of the following four options right just output the word and then you give the four classes and then you rely on the model only outputting that there are other techniques like restricted decoding that can enforce this stuff and so on but in general input output prompting simply says you ask the language model what you want it to do and optionally you give a you give a um a specif a specifier for how the output should look like in textual form um that doesn't also optional in here you can do some in-context examples if you want so all of this is like like this this is like prompting this is like standard prompting uh today then you have Chain of Thought Chain of Thought prompting is a different prompting technique where you instruct the model to explicitly make intermediate steps so you say you know um please invent please write a song about a little bird right and please do that in steps so please first make a like an overall plan for the song and output that into individual faults so you would instruct the model not only to have to have a prompt that's too fat you would instruct it not only to have to have a prompt uh followed by followed by an answer but you would instruct it to Output its thoughts so prom goes here and then you say you know write your thoughts on each line starting with like T and then the model is supposed to put its thoughts right here and its thoughts right here and its thoughts right here and then at the end answer so you have to input output prompt for this structure right here um you have to tell it please do this but it turns out if you do that if you do this Chain of Thought prompting your instructor model to explicitly write down the intermediate steps of problem solving the problem is going to be solved better than if you just ask the model to just provide you the answer it turns out and I think that is not there are hypotheses around why that is but it's I believe it's not yet fully understood why exactly the Chain of Thought prompting helps but hypotheses are that it gives the model kind of a scratch Pad um like right here in order to write down its thoughts and then the next thoughts can refer back to the previous explain illicit thoughts and not everything has to happen sort of in the weights the second opinion is that it just gives the model sort of a longer time to compute like it can since it decodes more tokens it can sort of invest more compute into a given problem and you're leading it into the correct Direction with these thoughts multiple hypotheses in any case the next iteration that this paper considers is self-consistency with Chain of Thought this essentially mixes jaina fought with voting so you just do Chain of Thought multiple times you just kind of sample multiple times and then you majority vote on the output that obviously is only possible where you have some sort of a classification task um where you can assemble a majority vote there's also another concept that's not mentioned here but that is sort of iterative refinement which comes in later in the paper what you can always do is you can always just append a prompt that says consider your last answer how how might you improve it or consider your last unders do you think it's correct if not please improve it and you can sort of add that onto any of these techniques so there's lots of these techniques this paper here considers the tree of thoughts and the tree of thoughts is here contraposed to the chain of thoughts as in you can see it is in fact a tree so you have nodes and you have nodes branching out and some notes are kind of abandoned which here in the red and then some nodes are continue to be expanded here in the green so this represents a language model that I ask something and then similarly to Chain of Thought I ask it to Output its faults but I can do that multiple times so three times I go to the language model and just ask it to Output the first thought in the problem solving step just the first one not the whole problem just the first one and those that gives me the three thoughts for example right here one two three then then as a second step so I finish that step as a second step I take the language model and I use it to self-critique all of these thoughts right here with respect to the input prompt so I ask it something like hey what you just output here as a thought as an intermediate step do you think that's a good step towards solving the original problem and this relies on the fact that a lot of these models like the language models but machine learning in general is much better at evaluating whether two things fit together then generating a new thing that's just it just turns out to be like that it kind of makes sense because you only need to Output a number or a value or something like this rather than generating something it is the case so even if you use the same model to self-critique you'll get a better signal for the critic than you get for the generation so that's why it does make sense that after creating something you go and you ask the same model to consider what it just did and how well it fits so the model for example you you'll give it all of these tasks let me grab you'll you'll give it all of these tasks and then you'll ask it to consider and maybe it will say this one here is really bad this one's kinda good and this one's like the best you can do that in multiple ways you can just give it ask it to give you a score or a label from like good medium bad for everyone or you can put everything into context and then say hey which one of these is the best please vote for the best and you just sample that multiple times so you get like a non-noisy signal and then what you can do is you can simply say well this one here the model thinks is really bad so I'll discard that and I'll just continue with the ones where I'm more confident then let's say we we consider the middle one right here so we've eliminated the one on the right hand side and we consider it the middle state right here again I sample maybe this time four different thoughts um and again I ask the model um hey what do you think of those four thoughts and now here the the the drawing is a bit wonky I think in that it doesn't display the algorithm like this this note right here should probably be well I don't know um but let's say we do that and let's assume that this here isn't like full green it's also red so it kind of says well none of these are really good towards the goal so it's like no no no no no let's just assume the critic says all of these are bad in that case what we do is we'd actually backtrack we'd go up here and go down to the next best top level State and go from that and say well okay here is the prompt and here is the thought you output for this particular node Now give me the next thought right as we did over here but we found that none of these continuations made any sense so we try a different branch go over here and now we maybe output here it's just one but we may be again output four of them we prune away three but one of them is actually good then we continue and so on so I hope you can see this is a very classic Tree Search right it's a tree search you can do this this breath first or depth first um but it's a tree search essentially with pruning an ordered tree search if you will um according to the critics value so we always go and expand the node of the tree that has the highest current value assigned to it this is very similar to yeah things like a star search or yeah various various forms of tree search they do say they keep it simple right here and leave more advanced algorithms such as Monte Carlo to research and all of that for later so I hope it's kind of clear what's happening um they themselves formulate it into two different so into two different things um you have a thought generator that's just generating one thought at a time so one intermediate step of the problem solving that you ask the language model based on the input and the previous thoughts you just say please make one intermediate step don't solve the problem completely just wake make one little step and explicitly write down the result of that step that's a thought in in Co in parlance um so they say that's so we can either sample or propose those which means that we can go to the language model three times and Sample three times for three thoughts or what we can do is we can just say please give me three thoughts and then it outputs a list again we might have to do i o prompting to get the correct output format but we can we can do that they say um one for example this one right here you might want to do if you generate some story or something because two samples are extremely unlikely to be equal this one right here is more appropriate when you have short proposals but they're more constrained so you want diversity you don't want stuff to repeat itself but in any case you generate Thoughts by sampling and then you have a state evaluator where you simply ask the model how good do you think that is on either way that's value or vote please do you give it all the all the thoughts that have been output and you say which one of these is the best and then you count the votes with voting it might be a bit more tricky to do sort of the backtrack like to compare nodes globally um you might have to do another voting because in a different branch the the winner node might have a completely different value than in this Branch over here but they they're both winners of their respective votes but you can Implement that in in various ways um again they can mix that with BF oh sorry with BFS or DFS um just the like in what in what order do they consider things for um things for expansion and as we said we can also do that globally but for example in a DFS they would go they would have some kind of Maximum steps that's that's fine um they would sort all the candidates they they have available if one of the candidates is above the value threshold they expanded so they go down down until no node like until all of these nodes in our example were like if all of that these nodes are red they say are no notice above the value threshold so we backtrack um this is it's not a like a global expansion as I mentioned it I guess that would be a step further foreign so I hope the I hope the um the overall picture here is relatively clear now let's go to what they research is on or what they um evaluate this on so this compared to Chain of Thought right Chain of Thought you can Implement in two ways um one way is to explicitly always sample the next thought but you might as well just input in say you know put out all your thoughts in one go and then give me the answer so one prompt one sampling because it's linear anyway right you just want the model to Output a linear sequence of things and therefore you might as well sample it at once and even the the self-consistency right here it's just sampling multiple times in parallel whereas this thing the big difference is that you actively have to stop after each step um like sample three thoughts stop evaluate dot One Stop evaluate thought to stop evaluate thought three star off and then you have to decide which one is the best one and then from that point on you go to the language model again and say you know here is input and one thought number one now give me um three new thought number twos and then and so on so this leads to a lot more evaluations I want to say of the language model but I'm pretty I'm pretty excited that maybe in the future we can just include this into our programming language and say you know this piece here I don't want to call like a function that does something I just want the the language model to take care of this particular small part and the rest I program around it and as we'll see that's kind of the spirit of the experiments even though I think they they what they want to go for is like a general Problem Solver I I don't think this goes into the direction of a general Problem Solver I think this goes into the direction of like including it into programming but the first um there are three tasks to evaluate on one is this game of 24. you have four numbers uh for example this these four numbers right here and you're asked to come up with a mathematical expression that results in 24 right um so they what they have to do is they have to prompt it explicitly right um to to say okay give me possible next steps all right here are possible next steps then they parse that um and they have this prompt right here evaluative given numbers can reach 24 into these things again with examples like with few short examples and then um I I think with few short examples yes and then the model outputs something like good bad impossible so this here is the thought generation and this here is the evaluation and yes you can you can reach like that you can get the language model to solve these things and you can probably do them better however however they um have to prompt it really specifically here in order to do that which is fine for research right but the the prompts are really like like you would program the algorithm except that the one part is really taken care of by the language model but the prompts are so specific to the problems that they almost and this is this is really so here we have okay we have creative writing and in creative writing you can see it's not that big of a difference like if you if you if you look right here the IR prompting versus the trio fault like sure it's higher but this is I think um gpt4 evaluation and this is human evaluation then sure this one here is higher um but it's like not that much of a difference and especially if you have these refine prompts then you get up there too so it probably helps more in these algorithmic tasks and this here is like mini crosswords so you have a grid of five by five let's do just do three by three for demonstration purposes and it's all letters and it's a crossword puzzle so you for example here you'd have a word um like let's say you have ape ape okay and then here you have a clue like the life form I don't know like like human 's are I'm I'm a terrible crossword Q generator or something like this um or like animal with long arms that lives in trees I don't okay there's there's cues right here you know what crossword puzzles are so um and then here is like Poe or something like this and then here it says like poet Edgar Edgar Allan and okay so you have these cues and you can input that into a language model and the language model can output an answer whether it's correct or not right that's the question that's the task but you can clearly see the task would profit a lot from you being able to do this backtracking because you know fill in a word that you think might be correct and you fill in another word and all of a sudden you realize ah that doesn't work out do you kind of cross out the ones again that you previously filled in try some other ones this is extremely handy like in this problem like a backtracking tree search is extremely handy that's why they evaluated on it they're not shy of saying look we evaluate the things on the task where we think it's going to benefit but that that also should tell you that this method um is probably going to to shine well in such tasks and it's yet to be seen in like normal everyday usage how well this does so you can see that in the input output prompting in the baselines they say we provide five example pairs in the Chain of Thought prompt We additionally include intermediate words in the order of this or this we run each prompt 10 samples and average the result so Chain of Thought prompt saying you know the intermediates are you know the intermediate words right here you don't have to Output the full puzzle at one point just intermediate ones and then at the end give me the result the tree of thought setup however is much more integral in intricate um so they use depth first with with tree of thoughts um keeps exploring the most promising subsequent word clue until this state is no longer promising then backtrack to the parent state to explore alternative thoughts that's the DFS tree of thoughts algorithm they presented above they say okay to make search more tractable subsequent thoughts are constrained not to change any filled words or letters so that the Toots that has at most 10 intermediate steps for at each state we translate all existing thoughts for the state into letter constraints for the remaining cues like this one right here and prompt a proposal five times to come up with candidates what to fill in um for State evaluations we translate each state into letter constraints for remaining Clues then evaluate for each clue if it is possible to fill it and then yeah so what I mean right here is that they help a lot right and in fact which isn't bad and you know the only criticism I would have right here is that these these things all the oh we translated into letter constraints and so on um I guess it would be possible to help the Chain of Thought prompting like the Baseline uh a bit by doing that as well I feel like that should be possible and I feel like for a fair evaluation that should probably be attempted but in any case you can see they help a lot they they help this process a lot to make it really sure that you know the the model um here you know you can see like here it goes down this but then the state evaluation says oh no actually this is probably impossible to solve right now not that you filled in these words and you go back over here and try a different route so you can see this this Salon here wasn't was probably not appropriate um However the fact that they help a lot essentially they implement the crossword solving algorithm right they say in the text hey our goal isn't they say it's something like this our goal we know they say we know that there are algorithms to solve this the goal is not just to solve the task as more General crosswords can be readily solved with specialized NLP pipelines that leverage large-scale retrieval instead of language models okay so the they say you know our we want to do a general Problem Solver that explores its own thoughts and guides its own Exploration with deliberate reasoning as heuristics yet still they essentially implement the crossword solving algorithm implementing all the three search and the constraints and and so on um and helping the language model to such a degree that I think well they essentially just replaced the lookup from the vocabulary with the language model so what we have right here is kind of a random word sampler because all of the rest is essentially implemented um by by the the algorithm itself and by the constraints they give again this is not bad but it does mean to me that the way I see this as I said is in programming so in programming I could have my code yada yada do this do that do this and then here instead of calling a function like f uh that function would not be somewhere in my code but that particular function would be sort of maybe a language model doing something somewhere and then I could Implement something like a DFS or so um and try to call that function as part of it but I don't see this at the level yet of what I think they want to do what they want to do is to have this General thing that looks at its own thoughts and explorers and backtracks and so on but for that in order to show that that's possible The Next Step the next paper following up on that really has to go away from these explicit prompts for the problems and really has to do it such that um there is one prompt like I should I should be able to have one prompt initially like one surrounding prompt I can describe the problem a bit sure but I should have the one prompt and then the intermediate steps they should all be governed by one prompt right not by explicit prompts saying hey uh here is the here is the decoding constraints and so on now give thoughts about this and we parse it for you even with the math problems they like parse it intermittently and so on um none of that should happen it should just be one prompt that essentially generically says consider your previous thought you know how good do you think it is and so on um it could even be a meta prompt saying if you were to ask yourself how good this thought is how would you evaluate it and then use that as a prompt right but that really has to be the next step in order to make this a viable General Problem Solver until then I see this as a cool thing that we could use inside of algorithms part of algorithms um but that's a very different direction what is good to see is they do a lot of ablations uh on you know what the pruning and the backtrack and so on gives so this here is the crossword results you can see the i o prompting in Chain of Thought prompting they barely managed to solve full games they sometimes have word and letter successes but not that many the tree of thoughts uh obviously is much better and if you they say this is pretty cool they say hey if we heuristically if we Oracle we know which words go where right so if we always at valuation time so when the model criticizes itself and selects the best thought if at that time we always tell the model what the best um thought is then it goes up even more interestingly it doesn't go up that much right it just gets like that last bit to solve many more games than it could previously solve but the word success and the letter success rate don't go up that much that's pretty interesting um so it just kind of helps it to be extra consistent right and um yeah if they take away the pruning like if they never prune um or if they take away the backtracking so if they always just go down never backtrack and go to another Branch you can see that the performance it decorates again interestingly it doesn't degrade too much in sort of the success rate it does degrade a bit but okay it does the great I guess I guess the numbers are fairly big but the real killer is it solves a lot less total games so the total games are kind of an indication of if you make an a mistake like somewhere then you might get a lot of the words correct but the total game won't be solved um and so the total games is kind of an indication of how well that planning and so on works all right that is what I wanted to say about this paper all in all I think it is it is pretty cool um it is definitely a good direction I feel there's a lot that can be done on top of this to make it more intricate and I'm excited for a future where I'm obviously excited for the auto GPT style that is thought of here but I'm also quite excited for a future where these these things are just part of algorithms like classic algorithms mixed with language models I think that's an interesting World let me know what you think that was it for me bye-bye\n"