Galactica - A Large Language Model for Science (Drama & Paper Review)
### Article: **Galactica: A Large Language Model for Science**
#### Introduction to Galactica
In a groundbreaking development, Meta introduced **Galactica**, a large language model specifically designed for scientific research. Unlike general-purpose language models, Galactica is trained on a high-quality, highly curated Corpus of scientific data, including papers, reference materials, knowledge bases, and more. With 106 billion tokens in its training Corpus, Galactica aims to revolutionize how scientists interact with and organize knowledge. The model's ultimate vision is to serve as a single neural network capable of powering a wide range of scientific tasks, from literature reviews to data synthesis.
One of the most notable features of Galactica is its ability to predict citations accurately. By using start and end reference tokens, the model can generate citations in the form of paper titles and author names. This approach has proven to be more effective than traditional search engines, even when compared to tuned sparse and dense retrieval methods. Additionally, Galactica demonstrates impressive performance on general language tasks, outperforming models like BLOOM and OPT-175 on the BigBench benchmark.
Interestingly, **Galactica was used to help write this paper**, raising questions about the role of AI in scientific research. While some worry about the implications of AI-generated content, others see it as a powerful tool that can assist humans in their work.
---
#### The Corpus and Tokenization
The foundation of Galactica lies in its **high-quality, curated Corpus**. Approximately 83% of the data comes from papers, with the remaining 17% including code, reference materials, knowledge bases, filtered Common Crawl data, prompts, and other sources. The total size of the Corpus is 106 billion tokens, significantly smaller than the datasets used for most large language models.
Tokenization plays a critical role in Galactica's design. All data is formatted into **markdown**, ensuring consistency and facilitating the integration of scientific knowledge. Mathematical expressions are represented using operators, while numbers are split into individual digits to improve numerical reasoning capabilities. For example, the number "136" would be tokenized as "1", "3", "6".
This meticulous approach to tokenization highlights the importance of representing scientific knowledge in a structured format. By doing so, Galactica can better handle complex tasks like step-by-step reasoning and citation prediction.
---
#### Step-by-Step Reasoning with Work Tokens
Galactica introduces a **work token** to facilitate step-by-step reasoning. This innovation allows the model to simulate human-like internal working memory by breaking down problems into individual steps. For instance, when calculating the average of numbers like 43, 29, 51, and 13, Galactica can explicitly show its work:
1. Add the first two numbers: 43 + 29 = 72
2. Add the next number: 72 + 51 = 123
3. Add the final number: 123 + 13 = 136
4. Divide by 4 to find the average: 136 / 4 = 34
This approach not only improves accuracy but also enables external tools like calculators or Python scripts to verify and execute the steps. During training, Galactica learns to generate these step-by-step solutions, while at inference time, it can rely on external computation for complex tasks.
---
#### Prompt Pre-Training and Architecture
Galactica employs **prompt pre-training**, where it is exposed to instructions like "calculate" or "solve this math problem" during the training phase. This method allows the model to generalize better across various scientific tasks without requiring fine-tuning on specific datasets.
The architecture of Galactica includes several key features:
1. **No biases**: Larger models streamline operations, avoiding the need for complex architectural additions like adapters or convolutions.
2. **GELU activation function**: A smooth alternative to ReLU, which may help with optimization.
3. **Learned positional embeddings**: Unlike relative positional encodings (e.g., Alibi), learned embeddings proved more effective for scientific data.
These design choices reflect the importance of simplicity and adaptability in a model tailored for scientific applications.
---
#### Evaluation and Results
Galactica's performance was evaluated across several tasks, including citation prediction, equation prediction, and toxicity detection. The results demonstrate its superiority over existing models:
1. **Citation Prediction**: Galactica outperforms search engines and neural retrievers, showing a clear bias toward highly cited papers. However, as the model grows larger, it becomes more adept at predicting less prominent citations.
2. **Equation Prediction**: On tasks like predicting equations based on descriptions or names, Galactica significantly outperforms other models.
3. **Toxicity and Bias**: Galactica exhibits notably lower toxicity and bias compared to general-purpose language models, likely due to its focus on scientific data.
Galactica also achieved a significant milestone by becoming the first open-source model to surpass GPT-4 in **truthful QA**, a measure of factual accuracy. This achievement underscores the importance of domain-specific training in improving model performance.
---
#### Conclusion: The Future of Scientific Research
Galactica represents a major leap forward in artificial intelligence for scientific research. By emphasizing data quality over quantity and introducing innovative features like work tokens and prompt pre-training, Meta has created a tool that could fundamentally change how scientists interact with knowledge.
While concerns about AI-generated content persist, the benefits of Galactica seem undeniable. As demonstrated, it can assist humans in tasks ranging from literature reviews to equation prediction, while also identifying hidden connections between research areas.
The ultimate question remains: will tools like Galactica augment human intelligence or replace it? For now, the answer lies in collaboration—using AI as a powerful assistant, not as a replacement for human judgment and creativity.