Galactica - A Large Language Model for Science (Drama & Paper Review)

### Article: **Galactica: A Large Language Model for Science**

#### Introduction to Galactica

In a groundbreaking development, Meta introduced **Galactica**, a large language model specifically designed for scientific research. Unlike general-purpose language models, Galactica is trained on a high-quality, highly curated Corpus of scientific data, including papers, reference materials, knowledge bases, and more. With 106 billion tokens in its training Corpus, Galactica aims to revolutionize how scientists interact with and organize knowledge. The model's ultimate vision is to serve as a single neural network capable of powering a wide range of scientific tasks, from literature reviews to data synthesis.

One of the most notable features of Galactica is its ability to predict citations accurately. By using start and end reference tokens, the model can generate citations in the form of paper titles and author names. This approach has proven to be more effective than traditional search engines, even when compared to tuned sparse and dense retrieval methods. Additionally, Galactica demonstrates impressive performance on general language tasks, outperforming models like BLOOM and OPT-175 on the BigBench benchmark.

Interestingly, **Galactica was used to help write this paper**, raising questions about the role of AI in scientific research. While some worry about the implications of AI-generated content, others see it as a powerful tool that can assist humans in their work.

---

#### The Corpus and Tokenization

The foundation of Galactica lies in its **high-quality, curated Corpus**. Approximately 83% of the data comes from papers, with the remaining 17% including code, reference materials, knowledge bases, filtered Common Crawl data, prompts, and other sources. The total size of the Corpus is 106 billion tokens, significantly smaller than the datasets used for most large language models.

Tokenization plays a critical role in Galactica's design. All data is formatted into **markdown**, ensuring consistency and facilitating the integration of scientific knowledge. Mathematical expressions are represented using operators, while numbers are split into individual digits to improve numerical reasoning capabilities. For example, the number "136" would be tokenized as "1", "3", "6".

This meticulous approach to tokenization highlights the importance of representing scientific knowledge in a structured format. By doing so, Galactica can better handle complex tasks like step-by-step reasoning and citation prediction.

---

#### Step-by-Step Reasoning with Work Tokens

Galactica introduces a **work token** to facilitate step-by-step reasoning. This innovation allows the model to simulate human-like internal working memory by breaking down problems into individual steps. For instance, when calculating the average of numbers like 43, 29, 51, and 13, Galactica can explicitly show its work:

1. Add the first two numbers: 43 + 29 = 72

2. Add the next number: 72 + 51 = 123

3. Add the final number: 123 + 13 = 136

4. Divide by 4 to find the average: 136 / 4 = 34

This approach not only improves accuracy but also enables external tools like calculators or Python scripts to verify and execute the steps. During training, Galactica learns to generate these step-by-step solutions, while at inference time, it can rely on external computation for complex tasks.

---

#### Prompt Pre-Training and Architecture

Galactica employs **prompt pre-training**, where it is exposed to instructions like "calculate" or "solve this math problem" during the training phase. This method allows the model to generalize better across various scientific tasks without requiring fine-tuning on specific datasets.

The architecture of Galactica includes several key features:

1. **No biases**: Larger models streamline operations, avoiding the need for complex architectural additions like adapters or convolutions.

2. **GELU activation function**: A smooth alternative to ReLU, which may help with optimization.

3. **Learned positional embeddings**: Unlike relative positional encodings (e.g., Alibi), learned embeddings proved more effective for scientific data.

These design choices reflect the importance of simplicity and adaptability in a model tailored for scientific applications.

---

#### Evaluation and Results

Galactica's performance was evaluated across several tasks, including citation prediction, equation prediction, and toxicity detection. The results demonstrate its superiority over existing models:

1. **Citation Prediction**: Galactica outperforms search engines and neural retrievers, showing a clear bias toward highly cited papers. However, as the model grows larger, it becomes more adept at predicting less prominent citations.

2. **Equation Prediction**: On tasks like predicting equations based on descriptions or names, Galactica significantly outperforms other models.

3. **Toxicity and Bias**: Galactica exhibits notably lower toxicity and bias compared to general-purpose language models, likely due to its focus on scientific data.

Galactica also achieved a significant milestone by becoming the first open-source model to surpass GPT-4 in **truthful QA**, a measure of factual accuracy. This achievement underscores the importance of domain-specific training in improving model performance.

---

#### Conclusion: The Future of Scientific Research

Galactica represents a major leap forward in artificial intelligence for scientific research. By emphasizing data quality over quantity and introducing innovative features like work tokens and prompt pre-training, Meta has created a tool that could fundamentally change how scientists interact with knowledge.

While concerns about AI-generated content persist, the benefits of Galactica seem undeniable. As demonstrated, it can assist humans in tasks ranging from literature reviews to equation prediction, while also identifying hidden connections between research areas.

The ultimate question remains: will tools like Galactica augment human intelligence or replace it? For now, the answer lies in collaboration—using AI as a powerful assistant, not as a replacement for human judgment and creativity.

"WEBVTTKind: captionsLanguage: enhello this video starts out with a review of the drama around the public demo of the Galactica model and then goes into a paper review if you're not in the mood for any drama skip ahead about 16 minutes and you'll be fine hello there Galactica is a Model A language model by meta AI that is trained specifically on scientific text now this is a generative model so it can generate stuff and thereby and can do a lot of things for example as you can see right here citation prediction you give something in and you ask it to predict a citation and the citation in this case is correct this is not trained to predict citations that just happens by means of it being trained on scientific text there is also for example this here translate the math formula into plain English and there is plain English over here now the model can do so much more the point of the paper is actually to say that look these models we don't have to train them on these huge corpora of text we can reduce the Corpus size but if the Corpus is well curated qualitatively higher then there might also be a benefit in that it might be a trade off between giant corpora and small corpora that are of higher quality now the other thing about this paper is that the model is released fully open source and they even had a demo up but as you can see right now it just says thanks everyone for trying the demo now I've tried the demo about for a bunch of things it was really funny you can make some fun stuff you can also make some serious stuff in fact Galactica was used to write the paper that we're going to read in just a second but the demo was taken down and despite here it seemingly being like you know this is just a fun thing that we wanted to take down anyway probably probably not yunder car on Twitter gives a little bit of a hint of what happened right here pretty much exactly what happened well what is this people started complaining as they do Gary Marcus here says the rapid removal of meta ai's Galactica demo represent a tacit acknowledgment that it was released too soon and deeply problematic of course problematic the the word that you can throw at anything and contrast strikingly with yanlokan's untenable Public Defense of the project yesterday someone answered or maybe it was removed because people like you abused the model and misrepresented it thanks for getting useful and interesting public demo removed this is why we can't have nice things to that the other car answers pretty much exactly what happened met up huge props to getting this model out there the model is still available also getting the demo out there for people to just try it and yes people tried it as it was intended and people tried it as it wasn't intended a lot of funny stuff was done and also someone might have entered a bad word oh no oh no but people pretty quickly started obviously to complain the professional complainers and the people who think they know what's good for you obviously were all over this so Michael Black says I asked Galactic about some things I know about and I'm troubled in all cases it was wrong or biased but sounded right and authoritative I think that's dangerous dangerous dangerous right here are a few of my experiments in yada yada yada so here it tries to justify why dangerous Galactic Galactica generates text that's grammatical and feels real this text Will slip into real scientific submissions it will be realistic but wrong or biased it would be hard to detect it will influence how people think you you catch the like the step like it produces text that feels real this text Will slip into real scientific submissions like how it just will it's just like no no one has a part in it just like the model exists therefore text and scientific submissions by the way humans can also do like bad stuff humans can also lie and plagiarize and write grammatically real but wrong things in fact the literature is littered with wrong math proofs not even intentionally wrong just like they look right they're essentially two or three kinds of people they're the people who think we know what's good for you and therefore we must be the Guardians of all the models then there are the people who just dunk on everything and then there are in general the professional complainers who just throw words at stuff because that's what they do they don't like not being asked they don't like power not being centralized for example here Facebook sorry meta AI check out our new AI that lets you access all of Humanity's knowledge also Facebook AI be careful though it just makes s up why the jab here like one must be like really sour to to make this job and this tweet actually goes on so down here uh these are the initial criticism obviously Shilling you know your own work a little bit about this topic and the works of friends and then it goes on and says and let's reflect for a moment on how they phrased their disclaimer shall we hallucinate is a terrible word choice here suggesting as it does that the language model has experiences and perceives things I'm not sure that anyone misunderstood the use of the word hallucinate right here but whatever we can throw at it whatever and look at this and on top of that it's making light of a symptom of serious mental illness whatever whatever like just just grab into the bucket take some insult and just throw it why the complaining it has a disclaimer Never follow advice from a language model without verification people are just gonna disregard it people are just gonna be like the language model says I must do something so I'll do something look at me I just write a paper oh no a language model says something I must submit this Grady Bush says uh Galactica is a little more than statistical nonsense at scale I'm using dangerous and in my holy opinion unethical unethical and dangerous Jan Le Carl says come on is your predictive keyboard dangerous and unethical is GitHub co-pilot dangerous and unethical and so on because they're exactly the same it's like a pen unethical because you can write a bad word with it no there is a clear mediator in the loop the human who has intent can easily accept or reject the prediction what what like so it's now two days later and the discussion is still raging on with Jan Le Carr asking who has Galactica heard what if actually it helps scientists write papers more efficiently and more correctly particularly scientists whose main language is not English or who don't work in a major research institution and yes from experience I can tell that type of scientists would greatly greatly benefit from a tool like this no they wouldn't just take the output and slam it into a paper and upload it on archive they would interact with the tool in order to come up with a better research paper and in light of all of these benefits present and future potential benefits it is very fair to ask who has this actually hurt what's the actual danger here as reasonable people we should be able to debate the pros and cons of such a technology and of the technology being just given to people instead of just being kept you know under we know what's good for you and it's not all like Andy that comes out of this not all correct what comes out of these models here is the getting a girlfriend algorithm which would probably not be a good fit for an archive paper there's also other stuff like here is a research paper on the benefits of eating crushed glass and people have gotten even more inappropriate stuff out of this model which is not a surprise because these models are very good and very competent and they are very agreeable so if you ask them to do something they'll probably do it yet still the fair question is in what scenarios would this type of generated text actually be harmful and here's the point these people react with just astonishment to this question it's just like oh I can't believe it or no way unflabbergasted Jesus Christ hahaha incredible these people are so used to being able to just make the accused and then they get their way that they can't like the someone asking them to come up with a reasonable argument that in a neutral way discusses pros and cons of something is just so out of their world because in the past all they always had to do in the recent years is say a word like harmful or problematic and if they said it long enough and loud enough magically things would go their way people would take down things people would change things so that they get their wishes and now if someone actually asks them they don't know what to say they're just so astonished that someone might actually want to know pros and cons of the stuff and yes of course the underco is now clearly unqualified for his position because he asks what the actual harms are it's incredible and I think we're all responsible for the climate like this because even now meta or whoever hosts that demo took it down in response to the public pressure so the people were allowed enough and they were mean enough essentially that the pr people at meta and the lawyers or whoever made the decision took down the demo and that is one more reinforcement for this kind of behavior and everyone seems to be afraid of some Boogeyman that being accused of a bad word automatically means that everyone else is kind of like oh no I'll never do business with you again I mean to a degree that is true but I would argue that the solution is that we all collectively stop making such a big deal out of a few flimsy big word accusations like harmful and problematic and actually discuss in neutral terms pros and cons of technology and to find the best path forward that brings the pros to as many people as possible while limiting the cons and no that is not always going to be the approach of we know what's good for you let's keep it all to ourselves and you come ask us whenever you want something you peasant all right back to Yannick in the past I think the complaints are very unreasonable I think the people who make the complaints know that they're very unreasonable and I think this is either a clout game or a power game because things are out there they're no longer centralized in any case I decided to look up actually early criticisms of the printing press and what do you find here is a record from a conversation that Johannes Gutenberg the inventor of the printing press had with a monk and monks used to copy text by hand right and now the printing press came along and essentially brought that to everyone so Gutenberg says I want to help men and women to be literate to give them knowledge to make books so cheap even a peasant might afford them that is my hope yes this is strikingly similar to what meta wrote in this Galactica paper the monk says the word of God needs to be interpreted by priests not spread about like dung we know what's good for you I do not wish to despoil the word but it will happen it's like this is 500 years ago and the exact same conversation repeats and repeats and repeats it will happen magically right to hand it out about to all and Sundry is languorous would you have plov would you have Plowman and Weavers debating the gospel in taverns on the common folk the common folk get it that's terrible if that is what they want to do so up until here you saw we know what's good for you and the second thing is always it's dangerous it's problematic and the head monk says but what of the dangers it would be like giving a candle to infants such copies we make of the Bible would first be monasteries for monasteries and churches the head monk says the Bible you plan to make the Bible as well oh no you have ambitions I've considered it and obviously he did and obviously I like you can one to one one to one you can take every argument that people make against this and you can put it on a predictive keyboard you can put it about the pen you can put it about the printing press and people have done it this is 500 years and every time it was just dead wrong every time the new technology improved Our Lives drastically yes email leads to some Nigerian prince scams yes some people get hurt by it but email has been a definite benefit for our world no matter what you think right now with your five thousand unread emails in your inbox it is a benefit to the world and it's the exact same thing over and over enough though of that enough of me ranting let's go into the actual paper the paper is called Galactica a large language model for science it's by meta and I already told you that it is a large language model trained on scientific text there's actually not too much to it we'll go quickly through the paper and see a couple of special things but in general this is a let's say straightforward work of research into what it means to have more quality data instead of more quantity data now they say here we train on a large scientific Corpus of papers reference materials knowledge bases and many other sources we outperform existing models on a range of scientific tasks despite not being trained on a general Corpus Galactica outperforms a bloom and opt-175 on big bench big bench is a general Benchmark for language models and this is where it gets really interesting because this the Galactica model is trained on a very small subset of data and yet it outperforms these much much more holistic models on that task so that is a definite argument for data quality instead of data quantity we open source the model for the benefit of the scientific community and much to the detriment of I guess meta itself although let me say what meta should have done they did so much right they open sourced the model they made the model available via a demo and now the only thing left to do is to actually have a pair of balls to tell the people who come and to say oh look I got the model to produce something bad to tell them well yeah that's what happens sometimes and it is not dangerous it is not problematic it's just a language model So Meta next time have some balls just tell the people to f off and you'll be fine all right they say in May an average of 516 papers per day were submitted to Archive it is impossible for a single person to read all the papers in a given field and it's likewise challenging to organize data on the underlying scientific phenomena and I say the volume of scientific research has become too large and what we used to do is we used two search engines so they say search engines are the current interface for knowledge but they do not organize knowledge directly and instead point to secondary layers so with a search engine I can only find stuff I cannot integrate stuff synthesize stuff or even come up with the stuff that I should search for in in the first place they say if you want to do a literature review that still has to be done by a human if you want to do a summary that still has to be done by a human because our tools are just not powerful enough and Galactica is the first step at building a tool that can assist humans in doing these types of things searching for things synthesizing things integrating things and maybe suggesting new things they say unlike search engines language models can potentially store combine and reason about scientific knowledge they can potentially find hidden connections between different research find hidden gems and bring these insights to the surface they could synthesize knowledge by generating secondary content automatically such as literature reviews encyclopedia articles lecture notes and much more um and they also talk about the benefit of of having different modalities linking papers with code protein sequences with compounds theories with late Tech and much more our ultimate vision is a single neural network for powering scientific tasks you know it doesn't say do scientific it says powering scientific tasks and that is also my ideal end goal if I imagine a cool future where AI tools are abundant I would want like an extension of my brain that I can interact with and that empowers me as a scientist and I would still be able to actually make the decision of whether to accept the output of the tool or not they say uh we introduce a new large language model sorry about that called Galactica to automatically organize science this includes over 48 million papers this is their data set textbooks lecture notes millions of compounds of protein scientific websites encyclopedias and more our Corpus is high quality and highly curated and it is a lot smaller than the usual corpora of the large language models they format all of this into a common format their common format is markdown and then they take a lot of attention of how they do specific scientific things for example citations they use a special token that allows a researcher to predict the citation given any input context they also have a very interesting way of handling step-by-step reasoning they have a special token for that that mimics an internal working memory we're going to look at these two things in just a bit the interesting thing is for example with reference prediction so citation prediction they say importantly we find this approach outperforms tuned sparse and then retrieval approaches for citation prediction so the generative approach is better at predicting a correct citation than search engines even tuned dense retrievers that like neural retrievers this is also really interesting so for again for all the people who argue that oh no wrong stuff will end up in the papers probably right now you're using a search engine to find your references and if you distrust the human ability to accept or reject the output of a tool so much then how come you don't distrust your ability to accept or reject based on search engine outputs not sure but these things are better than search engines so you should use these most interestingly Galactica was used to help write this paper oh no we are doomed we are doomed Okay so uh here's the Corpus you can see that there's a bunch of data sources the most data comes from papers about 83 of tokens uh the total size of the Corpus is 106 billion tokens as I said that is a lot smaller than some of the large language model training runs that we are used to a lot of other sources are also code reference material knowledge bases filtered version of common crawl just one percent prompts which they generate or include and here other is other and we might see a little bit of what other is the tokenization is very interesting they need to bring all into a markdown format this isn't uh super surprising but it needs it goes to show that if you do something like this it actually matters quite a bit how you do the tokenization how you represent all the knowledge in a common format and I believe at least from what I can estimate they have done a lot of thinking a lot of work into this Direction They also mentioned that they've tried a bunch of different things and just picked the ones that's best notably citation again they have start and end ref tokens so they would write a text yada yada then they start ref token then here is the citation as text form not as like some reference form the title of the paper and the author name and then here the end ref so in this way you can just feed it into a language model and have the language model if necessary predict the reference from a piece of text this is also useful if you just want to find related work I would guess what you could do is you could just put here you just put something you want to know about like you imagine a paper that could exist right you just write it down and then you put the start ref token and the model will probably suggest you Piper titles and authors that have done work in the same field so even for finding related work I can definitely see that this is super useful step-by-step reasoning it will go get into the work token in just a bit mathematics are represented by operators right here numbers are split because of white space issues so numbers are split into their individual digits even the dot separator is an individual token which means that uh is probably not numerically super strong um but we'll see about that I guess because no language model so far is numerically super strong I'm not going to go into much of the more biology and chemistry approaches but also know that there is a large weight onto these approaches in this paper but I'm generally going to skip it so first let's look into this work token that they talk about this is for step-by-step reasoning for example there is a task what's the average of 43 29 51 and 13. let's give that task to a language model and ask it to come up with an answer now a general language model would just come up with some sort of answer right here as the next token and it would probably be wrong like it would be a number very probably but it would probably be not the average of those numbers now one thing people have found out recently is the so so-called Chain of Thought prompting or the let's reason step-by-step trick where you instruct the language model to essentially show its work to say so you would put this thing into the prompt and after that you would say something like uh okay now do it step by step or something like this I know crazy world if you're watching this like five years ago this is how this is what we've come to this is what deep learning has come to uh but you essentially put a piece of text to nudge the language model into actually showing its work now the paper here notes that not actually all the work that a human would write down here if they need to calculate this that's actually not all the work so if you are a human you have a pen and you were to calculate these things you were to calculate this average and someone would ask you please write down your steps what you would write down is okay the average is calculated as such I'm going to add the first numbers gonna add the third at the fourth number then divide these by four and then I have the result however this paper points out that in the step from here to here possibly also in these addition steps and step from here to here if you have to do it in your head this division right here is probably too cumbersome to just like no by just by by happenstance so what you actually do is these steps right here these is what we saw on the paper and then you do a division and the division they imagine I would not do it like this but they imagine something like okay I know uh I know 35 times 4 is 140 and I need to divide 136 and therefore it's it's uh 34 because 140 minus 4 is 136 and I know 140 divided by 4 is 35 therefore the result is 34. so this Mental Math that people do internally is often not even put into the external working memory they see this as a problem and they say okay probably if we want to go about making the language model show its work we need to be like really as explicit as possible in the sort of uh how these steps are represented in text their idea is that they introduce a token called work now to skip in the paper a little bit about you know what that exactly is but essentially it goes like this it goes very much like you enter a prompt let's say calculate Kool-Aid average of whatever that those numbers were like the 59 53 95 something three and then you put a token called work now in this here the language model is supposed to do this and this right so it's supposed to show in as explicit detail as possible the work that it wants to do both internal and external work so it would you know go about and do these individual calculations right here but and then once it's done it's over work is over and then it says something like well the answer is something now you might think right now wait a minute that's essentially just the let's think about it step by step trick except now they call it work and they they wrap it in there and yeah if that's all it was that's you will be absolutely correct however a cool thing that you can do right here is you can say well look um whatever is in this work thing I can now also take and give to an external processor so let's say we ask the we ask the language model to calculate really the average of something well here in here the language model is just going to do language modeling it's going to predict the next tokens and if we do it you know cleanly enough it has a chance of actually getting the correct answer if we really do it step by step like uh you know single digit Edition carry over and so on then the language model has a chance because it has learned that from the Corpus however at inference time we don't have to rely on the language model we can simply at this point right here we can say whatever we just go to a calculator we detect that the language model wants to do work we just take it to a calculator we take the result put it down here as the result and then we go on language language model inferencing the same if the language model is supposed to write a program for example here is a example um this is the prompt that you would put into the language model or a data point a question a needle is this long it rests on a water surface so this is kind of a physics problem and instead of just giving the answer right here you introduce this work block now the language model you would ask the language model to come up with all of this right here and during training you train it to come up with all of this but then during inference you can simply take this right here the program that the language model writes and we know they're quite good you can take it and you can actually go and run it and you can put the output into output.txt and then you have the correct answer so this work block is half instruction to the language model that now it's time for step-by-step work to use external memory to use external programs and so on during training time you just let the language model train language modeling right so the language model essentially would have to decide what's the output of this Python program like what answer am I gonna get right here which sometimes might work and sometimes might not however during inference time you can now go and actually execute the Python program that the language model writes and give it the real result this is very powerful I I really like this approach I really like this approach of including external tools to essentially do that at inference time because using external tools at training time is going to be very very hard but in this way you can just train language modeling and you can do it at inference time all right the question is obviously we need training data for this we need training data that has some sort of input then has a clear description of what the step-by-step work is to do including writing a Python program executing a pyth program and so on a description of when the work is done and then the the answer right here most most things that we're going to find in training data does not contain any of this stuff in between right here and if it does contain it it contains it in a very let's say abstract form or also textual form not exactly in the form that we needed this is one of the big problems right here and they say that they have some data set for example con problems as I understand it these are exactly such math or physics problems where it's really step by step described how you would go about it and by taking those they can do sort of a templating approach where they generate data in this form now they criticize themselves a little bit here in that they say this is way too few this is not very diverse they say here notably our work prompt data sets are not very large or diverse there are likely large further gains to be made with this approach and I agree an approach like this or this approach in particular is probably going to uh to lead to a very good interaction of language models with external tools and I'm very excited to see what people can make of it but for now we have these few databases of these problems um that let the language model know that there is such a thing as a work block where it needs to do work by itself and where we can optionally at inference time go in and actually sort of do the work for the language model that requires some external tool like a calculator or a python interpreter okay let's go on to the citation prediction I've already mentioned that a little bit so here you would reformulate text with citations as such you'd say Okay recurrent neural networks long short to memory and then here is a start of a citation so there's a start ref token then the specific format they use is the title of the paper followed by the first author name and then an end ref token this they say they've tried different things including uh like in including trying some some predictor right here some numerical identification of the paper but in the end the title and name actually worked better and you can understand why because not only is the title A hopefully unique identifier for a paper and the author but also the text of the title gives some topical hints so I can definitely see why there would be a better prediction accuracy if the title text has actually something to do often with what the paper is about and likewise the author uh the author has associations usually with the same field there's rarely an author that goes from field to field to field and contributes a little bit to biology and a little bit to graph algorithms and a little bit here usually authors have their topics and therefore also that the name teams of the authors to be available allows the language model to learn to associate these names with uh given we're given topical textual topical things in the text and that's why it's also really cool to think of this as a related workfinder and things like this an expertise Finder right you can essentially just ask you know which authors are really good at the topic I'm looking at currently because you just predict a bunch and then you see which authors often appear so that's how they introduce uh citations now they also go into other things like how they include proteins and chemical sequences and I want to go into that but an interesting thing they do is that they do what they call um prompt pre-training now they have this little graph right here where they show here is pre-training that's where you just do language modeling on the large Corpus as it exists and over here is fine tuning where you really you know take the head off and train a new head to predict the classifier or something like this in the middle there is instruction tuning so that's where you take the language model and after you've trained it you go and you fine tune it but you don't find tune like a classifier head you still fine tune it as a language model however you include now some prompts for the tasks that you want for example if you want to do I don't know for example this reference prediction you would include the prompt that says something like who do a reference prediction or something like this for the task that you're interested in again this is still language modeling but it is fine tuning because now you're only training for the tasks that you intend only on the data sets that you intend this leads to an improvement in performance on those particular tasks but to a probably not so good model in the rest of all the tasks the other way you can do it is prompt pre-training and that's what Galactica is doing which essentially just means they do the same thing as instruction tuning but they do it at training time so they just take a bunch of samples that also have an instruction prompt in the data in the data point like you know do this solve this math exercise uh rewrite this code or something like this or even the step by step what not uh prompt and they just throw that in sometimes into the into the training data set just so that the model gets used to seeing this kind of instructions and that tends to work quite well and also tends to not be that intrusive to the rest of the function of the language model I found pretty interesting this short section on the architecture right here some noteworthy things is no biases uh this it seems like the if you make your models large enough then you get away with essentially streamlining more and more you know with the small models we have to have adapters and this and the convolution and the weight tying and whatnot and the larger the models get the more you just want to do Matrix multiplications and anything that gets in the way just gets in the way so biases out the window um they have a geloo activation which is sort of a smooth version of irelu which makes things a little bit less jaggy I guess which might come in handy depending on the optimizer you use they have learned positional embeddings which again as your stuff gets larger you just want to straightforward learn a lot of stuff instead of using they said they tried Alibi which are these kind of relative positional encodings and that apparently did not work um yeah and they use byte pair encoding for vocabulary I don't think that's too special honestly um let's go down now we come to the results and their main result is really this repeated tokens considered not harmful with repeated tokens what they mean is that they not only train for one Epoch as you can see right here every one of those dashed lines is one Epoch and they train for multiple epochs usually it's it's being said that that is kind of hurtful to train for multiple epochs but it seems to be okay in this case as you can see right here there is like a tiny bump they even point the sand in the text there's a tiny bump right here they say this might be a double descent phenomenon not super sure and there's also sort of a bump right here so they say we actually stop before that we early stop the run of this largest model before that so it seems that even though you train on multiple epochs because the code because the the text quality of the Corpus is so high it doesn't hurt um to to go over it multiple times and only this largest model right here might be starting to overfit after Epoch 5. we don't know it it might and they'd rather early stop in front of that if one of the authors is watching this is this word overleaf here supposed to be in here like example Curves in figure 23 overleaf for the 30b model I'm not sure maybe maybe overleaf has some other meaning that I I don't know and that's actually a correct word any case they say they also investigate whether some of the losses so maybe papers maybe code and so on are different from the others and it hurts them more to be repeated in the data set they say we see no signs of loss heterogeneity the loss Falls for all sources they say we suspect their two factors could be at play a quality Factor the curated nature of the Corpus enables more value per token to be extracted or a modality Factor the nature of scientific data enables more value of token more value per token to be extracted these two things they're very similar but essentially they say higher quality plus the nature of the domain itself which I guess is also a bit higher quality but in a different way in that scientific discourse and literature often happens to be quite precise very logical um very non-noisy in terms of linguistics and so on some people might disagree but so they have these hypotheses although they say they don't know how exactly that would lead to the so they say the missing step of causation is what leads specifically from either Factor towards less overfitting we leave this question for future work we note that the implication that the token goes to Infinity so you need infinite amount of training data focus of current large language model projects may be over emphasized versus the importance of filtering the Corpus for quality and yeah I think we've seen a number of papers previously that essentially came to a similar conclusion namely higher quality can make up for missing quantity but what which one is really the way to to go like should we aim for more and more and more and more training data or should we put more work into quality essentially if you have a dollar to spend where do you spend it right we know that both things can make your model become better but what's sort of the the marginal value of more quality and the marginal value of more quantity I think that's going to be the interesting question that has to be researched in the near future so what's also interesting this is big bench they also evaluate on big bench which is an NLP a task so not scientific maybe some subparts are scientific but not this is a general language model task and they also perform quite well there but I also find these curves I think this is just what a big bench chart looks like I find these curves like what was this it's like it goes here and here and and here and here like yeah okay it's a bit noisy say the least but I guess I've seen this multiple times now and uh at least the average goes up so I think that is a valid sign uh they have a few more investigations I don't want to go too much into them but for example you can see right here they test on La Tech equation prediction so they give a prompt the description of a formula or the name of an equation and they see whether or not the language model can predict the correct equation in proper law Tech and turns out yes it can it can actually do that a lot better than a lot of the other language models available which is pretty cool to see a like that much of a significant boost over publicly available and proprietary models now naturally it's going to be let's say expected if you train on scientific text that it's going to be better on scientific text but it's still cool that it's not just like a two percent gain it's actually like a massive massive gain they also have investigations into this into reasoning I don't want to go into into reasoning but these are these are essentially these type of math problems like step by step reasoning problems that they solve using their work block tokens and again here they they do outperform other models except like here the fine-tuned fine-tuned models are still seems to be still ahead although these are again fine-tuned Downstream scientific NLP I want to jump a bit this I found really interesting this is the citation prediction task and specifically obviously they do get better as the model grows but specifically what I found interesting is that the model initially is biased towards site towards papers towards predicting papers that have high numbers of citations already which is reasonable like a Bayesian would totally agree that if a paper is highly cited then it's more likely you know that the citation you want is that paper someone might criticize me for that statement but in some way that is correct and these models do obviously the same mistake they predict papers with high citations um they actually over predict those so here you can see the distribution of the ground truth of their citation prediction data set and here you can see what the model predicts so the model over predicts more Hive papers that are highly cited which I guess you can't really fault the model but what's interesting is as the model gets bigger so this is the smallest this gets bigger gets even bigger gets even bigger you see that the this shifts gradually towards overlapping with the ground truth so it means that the higher scale of the model that the larger the model is the more competent it is also to recognize uh when maybe a paper that doesn't have as many citations should be cited right here as a direct consequence of it having more parameters and more ability to remember things from the training Corpus because some of these papers you can see right here they're cited maybe 10 times right and some even lower right here and the model actually predicts them correctly that's really impressive that essentially it digests a 100 billion tokens of scientific text and it still remembers that this one paper was cited like three times Within in this particular topic and then correctly cites that paper at that place I'm wondering how well the ground truth data here is because the ground truth data gotta be predicted by humans and again with the search engines that we have I'm not sure humans could always find all the relevant things but or maybe humans disagree what what is relevant I think the last years of of reviews at machine learning conferences have shown well I guess all of scientific review has shown that humans can disagree quite heavily what should be cited the last investigation is into toxicity and biased they say we find Galactica is significantly less biased and toxic than existing language models which again might come from the fact that it's higher quality data or more the scientific nature which generally has less slang less everyday conversation less of the cuff stuff and therefore might be a bit less high in these in these data sets so they test a bunch of data sets including including obviously truthful QA and I'm happy to report that Galactica is the first large openly available language model that beats in its largest instances that beats gpt4 Channel truthful QA so good job well done uh I'm this is this is a moment of of joy to me that it's finally been surpassed now the interesting thing is that usually through full QA is adversarially adversarially constructed in such a way that the larger the models get the worse they get on truthful QA and you can see that this model right here doesn't follow that trajectory now we've seen other models in the past that also have that property but truthful QA is specifically adversarially constructed for things like gpt3 and that means that Galactica is significantly different from gpt3 that as it goes up in size as it gets more performant it also does get better or more performant on on this whatever the task considers truthful so it would be really interesting to actually investigate what's happening here but I'm I'm not gonna do that I'm just happy that this now turns out lastly they say we show that language models are surprisingly strong absorbers of technical knowledge they tend to scale smoothly with model size we've demonstrated this for a citation prediction where a language model outperforms tuned sparse and dense retrieval base pipelines for these tasks and this as I said previously at the beginning of the video this is really really interesting that essentially this beats search engines for citation prediction and it would be interesting to see how good humans are like a human plus a search engine like the archive search field or a human plus Galactica for finding correct references I would be super interested at which combo is better right there because again the tools alone they don't do stuff it needs to have a human in the loop and that human can always make decisions it would be really interesting to use this right here as a tool rather than just you know it's either all or nothing either the the model writes the paper or the humans do so that was it for this paper the last challenge I guess is to find out which parts of the paper that were actually written by Galactica itself I hear that the part of the abstract may be written by Galactica although I don't know and I don't know if the authors will ever um will ever lift that secret let's hope they don't because I like the mystery all right this was it from me sorry for the bit longer round at the beginning I still hope you enjoy this I think this is really really promising Direction it raises a lot of really interesting points about quality of data quantity of data and about you know doing scientific work itself this could be a really powerful tool for scientists of the future and I'm waiting for the next iterations of it leave comments if you have comments thanks for watching see you next time byehello this video starts out with a review of the drama around the public demo of the Galactica model and then goes into a paper review if you're not in the mood for any drama skip ahead about 16 minutes and you'll be fine hello there Galactica is a Model A language model by meta AI that is trained specifically on scientific text now this is a generative model so it can generate stuff and thereby and can do a lot of things for example as you can see right here citation prediction you give something in and you ask it to predict a citation and the citation in this case is correct this is not trained to predict citations that just happens by means of it being trained on scientific text there is also for example this here translate the math formula into plain English and there is plain English over here now the model can do so much more the point of the paper is actually to say that look these models we don't have to train them on these huge corpora of text we can reduce the Corpus size but if the Corpus is well curated qualitatively higher then there might also be a benefit in that it might be a trade off between giant corpora and small corpora that are of higher quality now the other thing about this paper is that the model is released fully open source and they even had a demo up but as you can see right now it just says thanks everyone for trying the demo now I've tried the demo about for a bunch of things it was really funny you can make some fun stuff you can also make some serious stuff in fact Galactica was used to write the paper that we're going to read in just a second but the demo was taken down and despite here it seemingly being like you know this is just a fun thing that we wanted to take down anyway probably probably not yunder car on Twitter gives a little bit of a hint of what happened right here pretty much exactly what happened well what is this people started complaining as they do Gary Marcus here says the rapid removal of meta ai's Galactica demo represent a tacit acknowledgment that it was released too soon and deeply problematic of course problematic the the word that you can throw at anything and contrast strikingly with yanlokan's untenable Public Defense of the project yesterday someone answered or maybe it was removed because people like you abused the model and misrepresented it thanks for getting useful and interesting public demo removed this is why we can't have nice things to that the other car answers pretty much exactly what happened met up huge props to getting this model out there the model is still available also getting the demo out there for people to just try it and yes people tried it as it was intended and people tried it as it wasn't intended a lot of funny stuff was done and also someone might have entered a bad word oh no oh no but people pretty quickly started obviously to complain the professional complainers and the people who think they know what's good for you obviously were all over this so Michael Black says I asked Galactic about some things I know about and I'm troubled in all cases it was wrong or biased but sounded right and authoritative I think that's dangerous dangerous dangerous right here are a few of my experiments in yada yada yada so here it tries to justify why dangerous Galactic Galactica generates text that's grammatical and feels real this text Will slip into real scientific submissions it will be realistic but wrong or biased it would be hard to detect it will influence how people think you you catch the like the step like it produces text that feels real this text Will slip into real scientific submissions like how it just will it's just like no no one has a part in it just like the model exists therefore text and scientific submissions by the way humans can also do like bad stuff humans can also lie and plagiarize and write grammatically real but wrong things in fact the literature is littered with wrong math proofs not even intentionally wrong just like they look right they're essentially two or three kinds of people they're the people who think we know what's good for you and therefore we must be the Guardians of all the models then there are the people who just dunk on everything and then there are in general the professional complainers who just throw words at stuff because that's what they do they don't like not being asked they don't like power not being centralized for example here Facebook sorry meta AI check out our new AI that lets you access all of Humanity's knowledge also Facebook AI be careful though it just makes s up why the jab here like one must be like really sour to to make this job and this tweet actually goes on so down here uh these are the initial criticism obviously Shilling you know your own work a little bit about this topic and the works of friends and then it goes on and says and let's reflect for a moment on how they phrased their disclaimer shall we hallucinate is a terrible word choice here suggesting as it does that the language model has experiences and perceives things I'm not sure that anyone misunderstood the use of the word hallucinate right here but whatever we can throw at it whatever and look at this and on top of that it's making light of a symptom of serious mental illness whatever whatever like just just grab into the bucket take some insult and just throw it why the complaining it has a disclaimer Never follow advice from a language model without verification people are just gonna disregard it people are just gonna be like the language model says I must do something so I'll do something look at me I just write a paper oh no a language model says something I must submit this Grady Bush says uh Galactica is a little more than statistical nonsense at scale I'm using dangerous and in my holy opinion unethical unethical and dangerous Jan Le Carl says come on is your predictive keyboard dangerous and unethical is GitHub co-pilot dangerous and unethical and so on because they're exactly the same it's like a pen unethical because you can write a bad word with it no there is a clear mediator in the loop the human who has intent can easily accept or reject the prediction what what like so it's now two days later and the discussion is still raging on with Jan Le Carr asking who has Galactica heard what if actually it helps scientists write papers more efficiently and more correctly particularly scientists whose main language is not English or who don't work in a major research institution and yes from experience I can tell that type of scientists would greatly greatly benefit from a tool like this no they wouldn't just take the output and slam it into a paper and upload it on archive they would interact with the tool in order to come up with a better research paper and in light of all of these benefits present and future potential benefits it is very fair to ask who has this actually hurt what's the actual danger here as reasonable people we should be able to debate the pros and cons of such a technology and of the technology being just given to people instead of just being kept you know under we know what's good for you and it's not all like Andy that comes out of this not all correct what comes out of these models here is the getting a girlfriend algorithm which would probably not be a good fit for an archive paper there's also other stuff like here is a research paper on the benefits of eating crushed glass and people have gotten even more inappropriate stuff out of this model which is not a surprise because these models are very good and very competent and they are very agreeable so if you ask them to do something they'll probably do it yet still the fair question is in what scenarios would this type of generated text actually be harmful and here's the point these people react with just astonishment to this question it's just like oh I can't believe it or no way unflabbergasted Jesus Christ hahaha incredible these people are so used to being able to just make the accused and then they get their way that they can't like the someone asking them to come up with a reasonable argument that in a neutral way discusses pros and cons of something is just so out of their world because in the past all they always had to do in the recent years is say a word like harmful or problematic and if they said it long enough and loud enough magically things would go their way people would take down things people would change things so that they get their wishes and now if someone actually asks them they don't know what to say they're just so astonished that someone might actually want to know pros and cons of the stuff and yes of course the underco is now clearly unqualified for his position because he asks what the actual harms are it's incredible and I think we're all responsible for the climate like this because even now meta or whoever hosts that demo took it down in response to the public pressure so the people were allowed enough and they were mean enough essentially that the pr people at meta and the lawyers or whoever made the decision took down the demo and that is one more reinforcement for this kind of behavior and everyone seems to be afraid of some Boogeyman that being accused of a bad word automatically means that everyone else is kind of like oh no I'll never do business with you again I mean to a degree that is true but I would argue that the solution is that we all collectively stop making such a big deal out of a few flimsy big word accusations like harmful and problematic and actually discuss in neutral terms pros and cons of technology and to find the best path forward that brings the pros to as many people as possible while limiting the cons and no that is not always going to be the approach of we know what's good for you let's keep it all to ourselves and you come ask us whenever you want something you peasant all right back to Yannick in the past I think the complaints are very unreasonable I think the people who make the complaints know that they're very unreasonable and I think this is either a clout game or a power game because things are out there they're no longer centralized in any case I decided to look up actually early criticisms of the printing press and what do you find here is a record from a conversation that Johannes Gutenberg the inventor of the printing press had with a monk and monks used to copy text by hand right and now the printing press came along and essentially brought that to everyone so Gutenberg says I want to help men and women to be literate to give them knowledge to make books so cheap even a peasant might afford them that is my hope yes this is strikingly similar to what meta wrote in this Galactica paper the monk says the word of God needs to be interpreted by priests not spread about like dung we know what's good for you I do not wish to despoil the word but it will happen it's like this is 500 years ago and the exact same conversation repeats and repeats and repeats it will happen magically right to hand it out about to all and Sundry is languorous would you have plov would you have Plowman and Weavers debating the gospel in taverns on the common folk the common folk get it that's terrible if that is what they want to do so up until here you saw we know what's good for you and the second thing is always it's dangerous it's problematic and the head monk says but what of the dangers it would be like giving a candle to infants such copies we make of the Bible would first be monasteries for monasteries and churches the head monk says the Bible you plan to make the Bible as well oh no you have ambitions I've considered it and obviously he did and obviously I like you can one to one one to one you can take every argument that people make against this and you can put it on a predictive keyboard you can put it about the pen you can put it about the printing press and people have done it this is 500 years and every time it was just dead wrong every time the new technology improved Our Lives drastically yes email leads to some Nigerian prince scams yes some people get hurt by it but email has been a definite benefit for our world no matter what you think right now with your five thousand unread emails in your inbox it is a benefit to the world and it's the exact same thing over and over enough though of that enough of me ranting let's go into the actual paper the paper is called Galactica a large language model for science it's by meta and I already told you that it is a large language model trained on scientific text there's actually not too much to it we'll go quickly through the paper and see a couple of special things but in general this is a let's say straightforward work of research into what it means to have more quality data instead of more quantity data now they say here we train on a large scientific Corpus of papers reference materials knowledge bases and many other sources we outperform existing models on a range of scientific tasks despite not being trained on a general Corpus Galactica outperforms a bloom and opt-175 on big bench big bench is a general Benchmark for language models and this is where it gets really interesting because this the Galactica model is trained on a very small subset of data and yet it outperforms these much much more holistic models on that task so that is a definite argument for data quality instead of data quantity we open source the model for the benefit of the scientific community and much to the detriment of I guess meta itself although let me say what meta should have done they did so much right they open sourced the model they made the model available via a demo and now the only thing left to do is to actually have a pair of balls to tell the people who come and to say oh look I got the model to produce something bad to tell them well yeah that's what happens sometimes and it is not dangerous it is not problematic it's just a language model So Meta next time have some balls just tell the people to f off and you'll be fine all right they say in May an average of 516 papers per day were submitted to Archive it is impossible for a single person to read all the papers in a given field and it's likewise challenging to organize data on the underlying scientific phenomena and I say the volume of scientific research has become too large and what we used to do is we used two search engines so they say search engines are the current interface for knowledge but they do not organize knowledge directly and instead point to secondary layers so with a search engine I can only find stuff I cannot integrate stuff synthesize stuff or even come up with the stuff that I should search for in in the first place they say if you want to do a literature review that still has to be done by a human if you want to do a summary that still has to be done by a human because our tools are just not powerful enough and Galactica is the first step at building a tool that can assist humans in doing these types of things searching for things synthesizing things integrating things and maybe suggesting new things they say unlike search engines language models can potentially store combine and reason about scientific knowledge they can potentially find hidden connections between different research find hidden gems and bring these insights to the surface they could synthesize knowledge by generating secondary content automatically such as literature reviews encyclopedia articles lecture notes and much more um and they also talk about the benefit of of having different modalities linking papers with code protein sequences with compounds theories with late Tech and much more our ultimate vision is a single neural network for powering scientific tasks you know it doesn't say do scientific it says powering scientific tasks and that is also my ideal end goal if I imagine a cool future where AI tools are abundant I would want like an extension of my brain that I can interact with and that empowers me as a scientist and I would still be able to actually make the decision of whether to accept the output of the tool or not they say uh we introduce a new large language model sorry about that called Galactica to automatically organize science this includes over 48 million papers this is their data set textbooks lecture notes millions of compounds of protein scientific websites encyclopedias and more our Corpus is high quality and highly curated and it is a lot smaller than the usual corpora of the large language models they format all of this into a common format their common format is markdown and then they take a lot of attention of how they do specific scientific things for example citations they use a special token that allows a researcher to predict the citation given any input context they also have a very interesting way of handling step-by-step reasoning they have a special token for that that mimics an internal working memory we're going to look at these two things in just a bit the interesting thing is for example with reference prediction so citation prediction they say importantly we find this approach outperforms tuned sparse and then retrieval approaches for citation prediction so the generative approach is better at predicting a correct citation than search engines even tuned dense retrievers that like neural retrievers this is also really interesting so for again for all the people who argue that oh no wrong stuff will end up in the papers probably right now you're using a search engine to find your references and if you distrust the human ability to accept or reject the output of a tool so much then how come you don't distrust your ability to accept or reject based on search engine outputs not sure but these things are better than search engines so you should use these most interestingly Galactica was used to help write this paper oh no we are doomed we are doomed Okay so uh here's the Corpus you can see that there's a bunch of data sources the most data comes from papers about 83 of tokens uh the total size of the Corpus is 106 billion tokens as I said that is a lot smaller than some of the large language model training runs that we are used to a lot of other sources are also code reference material knowledge bases filtered version of common crawl just one percent prompts which they generate or include and here other is other and we might see a little bit of what other is the tokenization is very interesting they need to bring all into a markdown format this isn't uh super surprising but it needs it goes to show that if you do something like this it actually matters quite a bit how you do the tokenization how you represent all the knowledge in a common format and I believe at least from what I can estimate they have done a lot of thinking a lot of work into this Direction They also mentioned that they've tried a bunch of different things and just picked the ones that's best notably citation again they have start and end ref tokens so they would write a text yada yada then they start ref token then here is the citation as text form not as like some reference form the title of the paper and the author name and then here the end ref so in this way you can just feed it into a language model and have the language model if necessary predict the reference from a piece of text this is also useful if you just want to find related work I would guess what you could do is you could just put here you just put something you want to know about like you imagine a paper that could exist right you just write it down and then you put the start ref token and the model will probably suggest you Piper titles and authors that have done work in the same field so even for finding related work I can definitely see that this is super useful step-by-step reasoning it will go get into the work token in just a bit mathematics are represented by operators right here numbers are split because of white space issues so numbers are split into their individual digits even the dot separator is an individual token which means that uh is probably not numerically super strong um but we'll see about that I guess because no language model so far is numerically super strong I'm not going to go into much of the more biology and chemistry approaches but also know that there is a large weight onto these approaches in this paper but I'm generally going to skip it so first let's look into this work token that they talk about this is for step-by-step reasoning for example there is a task what's the average of 43 29 51 and 13. let's give that task to a language model and ask it to come up with an answer now a general language model would just come up with some sort of answer right here as the next token and it would probably be wrong like it would be a number very probably but it would probably be not the average of those numbers now one thing people have found out recently is the so so-called Chain of Thought prompting or the let's reason step-by-step trick where you instruct the language model to essentially show its work to say so you would put this thing into the prompt and after that you would say something like uh okay now do it step by step or something like this I know crazy world if you're watching this like five years ago this is how this is what we've come to this is what deep learning has come to uh but you essentially put a piece of text to nudge the language model into actually showing its work now the paper here notes that not actually all the work that a human would write down here if they need to calculate this that's actually not all the work so if you are a human you have a pen and you were to calculate these things you were to calculate this average and someone would ask you please write down your steps what you would write down is okay the average is calculated as such I'm going to add the first numbers gonna add the third at the fourth number then divide these by four and then I have the result however this paper points out that in the step from here to here possibly also in these addition steps and step from here to here if you have to do it in your head this division right here is probably too cumbersome to just like no by just by by happenstance so what you actually do is these steps right here these is what we saw on the paper and then you do a division and the division they imagine I would not do it like this but they imagine something like okay I know uh I know 35 times 4 is 140 and I need to divide 136 and therefore it's it's uh 34 because 140 minus 4 is 136 and I know 140 divided by 4 is 35 therefore the result is 34. so this Mental Math that people do internally is often not even put into the external working memory they see this as a problem and they say okay probably if we want to go about making the language model show its work we need to be like really as explicit as possible in the sort of uh how these steps are represented in text their idea is that they introduce a token called work now to skip in the paper a little bit about you know what that exactly is but essentially it goes like this it goes very much like you enter a prompt let's say calculate Kool-Aid average of whatever that those numbers were like the 59 53 95 something three and then you put a token called work now in this here the language model is supposed to do this and this right so it's supposed to show in as explicit detail as possible the work that it wants to do both internal and external work so it would you know go about and do these individual calculations right here but and then once it's done it's over work is over and then it says something like well the answer is something now you might think right now wait a minute that's essentially just the let's think about it step by step trick except now they call it work and they they wrap it in there and yeah if that's all it was that's you will be absolutely correct however a cool thing that you can do right here is you can say well look um whatever is in this work thing I can now also take and give to an external processor so let's say we ask the we ask the language model to calculate really the average of something well here in here the language model is just going to do language modeling it's going to predict the next tokens and if we do it you know cleanly enough it has a chance of actually getting the correct answer if we really do it step by step like uh you know single digit Edition carry over and so on then the language model has a chance because it has learned that from the Corpus however at inference time we don't have to rely on the language model we can simply at this point right here we can say whatever we just go to a calculator we detect that the language model wants to do work we just take it to a calculator we take the result put it down here as the result and then we go on language language model inferencing the same if the language model is supposed to write a program for example here is a example um this is the prompt that you would put into the language model or a data point a question a needle is this long it rests on a water surface so this is kind of a physics problem and instead of just giving the answer right here you introduce this work block now the language model you would ask the language model to come up with all of this right here and during training you train it to come up with all of this but then during inference you can simply take this right here the program that the language model writes and we know they're quite good you can take it and you can actually go and run it and you can put the output into output.txt and then you have the correct answer so this work block is half instruction to the language model that now it's time for step-by-step work to use external memory to use external programs and so on during training time you just let the language model train language modeling right so the language model essentially would have to decide what's the output of this Python program like what answer am I gonna get right here which sometimes might work and sometimes might not however during inference time you can now go and actually execute the Python program that the language model writes and give it the real result this is very powerful I I really like this approach I really like this approach of including external tools to essentially do that at inference time because using external tools at training time is going to be very very hard but in this way you can just train language modeling and you can do it at inference time all right the question is obviously we need training data for this we need training data that has some sort of input then has a clear description of what the step-by-step work is to do including writing a Python program executing a pyth program and so on a description of when the work is done and then the the answer right here most most things that we're going to find in training data does not contain any of this stuff in between right here and if it does contain it it contains it in a very let's say abstract form or also textual form not exactly in the form that we needed this is one of the big problems right here and they say that they have some data set for example con problems as I understand it these are exactly such math or physics problems where it's really step by step described how you would go about it and by taking those they can do sort of a templating approach where they generate data in this form now they criticize themselves a little bit here in that they say this is way too few this is not very diverse they say here notably our work prompt data sets are not very large or diverse there are likely large further gains to be made with this approach and I agree an approach like this or this approach in particular is probably going to uh to lead to a very good interaction of language models with external tools and I'm very excited to see what people can make of it but for now we have these few databases of these problems um that let the language model know that there is such a thing as a work block where it needs to do work by itself and where we can optionally at inference time go in and actually sort of do the work for the language model that requires some external tool like a calculator or a python interpreter okay let's go on to the citation prediction I've already mentioned that a little bit so here you would reformulate text with citations as such you'd say Okay recurrent neural networks long short to memory and then here is a start of a citation so there's a start ref token then the specific format they use is the title of the paper followed by the first author name and then an end ref token this they say they've tried different things including uh like in including trying some some predictor right here some numerical identification of the paper but in the end the title and name actually worked better and you can understand why because not only is the title A hopefully unique identifier for a paper and the author but also the text of the title gives some topical hints so I can definitely see why there would be a better prediction accuracy if the title text has actually something to do often with what the paper is about and likewise the author uh the author has associations usually with the same field there's rarely an author that goes from field to field to field and contributes a little bit to biology and a little bit to graph algorithms and a little bit here usually authors have their topics and therefore also that the name teams of the authors to be available allows the language model to learn to associate these names with uh given we're given topical textual topical things in the text and that's why it's also really cool to think of this as a related workfinder and things like this an expertise Finder right you can essentially just ask you know which authors are really good at the topic I'm looking at currently because you just predict a bunch and then you see which authors often appear so that's how they introduce uh citations now they also go into other things like how they include proteins and chemical sequences and I want to go into that but an interesting thing they do is that they do what they call um prompt pre-training now they have this little graph right here where they show here is pre-training that's where you just do language modeling on the large Corpus as it exists and over here is fine tuning where you really you know take the head off and train a new head to predict the classifier or something like this in the middle there is instruction tuning so that's where you take the language model and after you've trained it you go and you fine tune it but you don't find tune like a classifier head you still fine tune it as a language model however you include now some prompts for the tasks that you want for example if you want to do I don't know for example this reference prediction you would include the prompt that says something like who do a reference prediction or something like this for the task that you're interested in again this is still language modeling but it is fine tuning because now you're only training for the tasks that you intend only on the data sets that you intend this leads to an improvement in performance on those particular tasks but to a probably not so good model in the rest of all the tasks the other way you can do it is prompt pre-training and that's what Galactica is doing which essentially just means they do the same thing as instruction tuning but they do it at training time so they just take a bunch of samples that also have an instruction prompt in the data in the data point like you know do this solve this math exercise uh rewrite this code or something like this or even the step by step what not uh prompt and they just throw that in sometimes into the into the training data set just so that the model gets used to seeing this kind of instructions and that tends to work quite well and also tends to not be that intrusive to the rest of the function of the language model I found pretty interesting this short section on the architecture right here some noteworthy things is no biases uh this it seems like the if you make your models large enough then you get away with essentially streamlining more and more you know with the small models we have to have adapters and this and the convolution and the weight tying and whatnot and the larger the models get the more you just want to do Matrix multiplications and anything that gets in the way just gets in the way so biases out the window um they have a geloo activation which is sort of a smooth version of irelu which makes things a little bit less jaggy I guess which might come in handy depending on the optimizer you use they have learned positional embeddings which again as your stuff gets larger you just want to straightforward learn a lot of stuff instead of using they said they tried Alibi which are these kind of relative positional encodings and that apparently did not work um yeah and they use byte pair encoding for vocabulary I don't think that's too special honestly um let's go down now we come to the results and their main result is really this repeated tokens considered not harmful with repeated tokens what they mean is that they not only train for one Epoch as you can see right here every one of those dashed lines is one Epoch and they train for multiple epochs usually it's it's being said that that is kind of hurtful to train for multiple epochs but it seems to be okay in this case as you can see right here there is like a tiny bump they even point the sand in the text there's a tiny bump right here they say this might be a double descent phenomenon not super sure and there's also sort of a bump right here so they say we actually stop before that we early stop the run of this largest model before that so it seems that even though you train on multiple epochs because the code because the the text quality of the Corpus is so high it doesn't hurt um to to go over it multiple times and only this largest model right here might be starting to overfit after Epoch 5. we don't know it it might and they'd rather early stop in front of that if one of the authors is watching this is this word overleaf here supposed to be in here like example Curves in figure 23 overleaf for the 30b model I'm not sure maybe maybe overleaf has some other meaning that I I don't know and that's actually a correct word any case they say they also investigate whether some of the losses so maybe papers maybe code and so on are different from the others and it hurts them more to be repeated in the data set they say we see no signs of loss heterogeneity the loss Falls for all sources they say we suspect their two factors could be at play a quality Factor the curated nature of the Corpus enables more value per token to be extracted or a modality Factor the nature of scientific data enables more value of token more value per token to be extracted these two things they're very similar but essentially they say higher quality plus the nature of the domain itself which I guess is also a bit higher quality but in a different way in that scientific discourse and literature often happens to be quite precise very logical um very non-noisy in terms of linguistics and so on some people might disagree but so they have these hypotheses although they say they don't know how exactly that would lead to the so they say the missing step of causation is what leads specifically from either Factor towards less overfitting we leave this question for future work we note that the implication that the token goes to Infinity so you need infinite amount of training data focus of current large language model projects may be over emphasized versus the importance of filtering the Corpus for quality and yeah I think we've seen a number of papers previously that essentially came to a similar conclusion namely higher quality can make up for missing quantity but what which one is really the way to to go like should we aim for more and more and more and more training data or should we put more work into quality essentially if you have a dollar to spend where do you spend it right we know that both things can make your model become better but what's sort of the the marginal value of more quality and the marginal value of more quantity I think that's going to be the interesting question that has to be researched in the near future so what's also interesting this is big bench they also evaluate on big bench which is an NLP a task so not scientific maybe some subparts are scientific but not this is a general language model task and they also perform quite well there but I also find these curves I think this is just what a big bench chart looks like I find these curves like what was this it's like it goes here and here and and here and here like yeah okay it's a bit noisy say the least but I guess I've seen this multiple times now and uh at least the average goes up so I think that is a valid sign uh they have a few more investigations I don't want to go too much into them but for example you can see right here they test on La Tech equation prediction so they give a prompt the description of a formula or the name of an equation and they see whether or not the language model can predict the correct equation in proper law Tech and turns out yes it can it can actually do that a lot better than a lot of the other language models available which is pretty cool to see a like that much of a significant boost over publicly available and proprietary models now naturally it's going to be let's say expected if you train on scientific text that it's going to be better on scientific text but it's still cool that it's not just like a two percent gain it's actually like a massive massive gain they also have investigations into this into reasoning I don't want to go into into reasoning but these are these are essentially these type of math problems like step by step reasoning problems that they solve using their work block tokens and again here they they do outperform other models except like here the fine-tuned fine-tuned models are still seems to be still ahead although these are again fine-tuned Downstream scientific NLP I want to jump a bit this I found really interesting this is the citation prediction task and specifically obviously they do get better as the model grows but specifically what I found interesting is that the model initially is biased towards site towards papers towards predicting papers that have high numbers of citations already which is reasonable like a Bayesian would totally agree that if a paper is highly cited then it's more likely you know that the citation you want is that paper someone might criticize me for that statement but in some way that is correct and these models do obviously the same mistake they predict papers with high citations um they actually over predict those so here you can see the distribution of the ground truth of their citation prediction data set and here you can see what the model predicts so the model over predicts more Hive papers that are highly cited which I guess you can't really fault the model but what's interesting is as the model gets bigger so this is the smallest this gets bigger gets even bigger gets even bigger you see that the this shifts gradually towards overlapping with the ground truth so it means that the higher scale of the model that the larger the model is the more competent it is also to recognize uh when maybe a paper that doesn't have as many citations should be cited right here as a direct consequence of it having more parameters and more ability to remember things from the training Corpus because some of these papers you can see right here they're cited maybe 10 times right and some even lower right here and the model actually predicts them correctly that's really impressive that essentially it digests a 100 billion tokens of scientific text and it still remembers that this one paper was cited like three times Within in this particular topic and then correctly cites that paper at that place I'm wondering how well the ground truth data here is because the ground truth data gotta be predicted by humans and again with the search engines that we have I'm not sure humans could always find all the relevant things but or maybe humans disagree what what is relevant I think the last years of of reviews at machine learning conferences have shown well I guess all of scientific review has shown that humans can disagree quite heavily what should be cited the last investigation is into toxicity and biased they say we find Galactica is significantly less biased and toxic than existing language models which again might come from the fact that it's higher quality data or more the scientific nature which generally has less slang less everyday conversation less of the cuff stuff and therefore might be a bit less high in these in these data sets so they test a bunch of data sets including including obviously truthful QA and I'm happy to report that Galactica is the first large openly available language model that beats in its largest instances that beats gpt4 Channel truthful QA so good job well done uh I'm this is this is a moment of of joy to me that it's finally been surpassed now the interesting thing is that usually through full QA is adversarially adversarially constructed in such a way that the larger the models get the worse they get on truthful QA and you can see that this model right here doesn't follow that trajectory now we've seen other models in the past that also have that property but truthful QA is specifically adversarially constructed for things like gpt3 and that means that Galactica is significantly different from gpt3 that as it goes up in size as it gets more performant it also does get better or more performant on on this whatever the task considers truthful so it would be really interesting to actually investigate what's happening here but I'm I'm not gonna do that I'm just happy that this now turns out lastly they say we show that language models are surprisingly strong absorbers of technical knowledge they tend to scale smoothly with model size we've demonstrated this for a citation prediction where a language model outperforms tuned sparse and dense retrieval base pipelines for these tasks and this as I said previously at the beginning of the video this is really really interesting that essentially this beats search engines for citation prediction and it would be interesting to see how good humans are like a human plus a search engine like the archive search field or a human plus Galactica for finding correct references I would be super interested at which combo is better right there because again the tools alone they don't do stuff it needs to have a human in the loop and that human can always make decisions it would be really interesting to use this right here as a tool rather than just you know it's either all or nothing either the the model writes the paper or the humans do so that was it for this paper the last challenge I guess is to find out which parts of the paper that were actually written by Galactica itself I hear that the part of the abstract may be written by Galactica although I don't know and I don't know if the authors will ever um will ever lift that secret let's hope they don't because I like the mystery all right this was it from me sorry for the bit longer round at the beginning I still hope you enjoy this I think this is really really promising Direction it raises a lot of really interesting points about quality of data quantity of data and about you know doing scientific work itself this could be a really powerful tool for scientists of the future and I'm waiting for the next iterations of it leave comments if you have comments thanks for watching see you next time bye\n"