Long Context Language Models and their Biological Applications with Eric Nguyen - 690

Focusing on Architecture Comparisons

We did compare ourselves with other approaches that someone trying to address these tasks might typically use, which is well, maybe how would you characterize the efficiency of these models relative to other approaches that someone trying to address these tasks might typically use. Yeah, so we do a scaling analysis uh of pre-training on DNA across the prevailing DNA architectures, so we do comparing with Transformers Mamba and then hyena in its hybrid variant as well, and so on that front, we do compare performance across architectures mainly on perplexity in terms of how they scale with perplexity, and it was interesting to see that these convolutional based models being particularly strong on DNA over Transformers, and again this kind of alludes back to to seeing noting the difference between DNA and natural language, you know, being sparer in terms of information and also noisy at the same time.

Observations in Terms of Zero Shot Performance versus Few Shot Performance

We focus initially on zero shot tasks in this work because we wanted to Showcase um I guess most work previously in biology usually focuses on fine tuning but we sort of thought that was um uh I don't know taking an off-the-shelf natural language model and trying to fine-tune it to uh be useful in the biological context, um I mean that's not necessarily because I think these models still also trained from scratch but I think I think was more so that previous models we observed didn't F didn't focus on generalizability, and so they typically would always find tune just because our hunches and what we observe in mini models is that they didn't the zero shot performance wasn't that compelling um and so they they weren't practical for for many applications, and so in this particular work because we had scale and because we were focusing on generalizability we wanted to see how well it could generalize zero shot across different modalities in biology. That being said, we did have some fine-tuning results which in in many of the tasks were also very strong quite competitive if not more competitive than the zero shot results we we observed before.

Future Research Agenda

We're certainly interested in both you know biological perspective and a machine learning perspective how to improve Pro and so from a machine learning side, absolutely do think scale is going to be really important that's why we did a set of DNA scaling laws to see do you get some kind of gain observed That You observe in natural language also in DNA and indeed we do see this similar trajectory for DNA which is something we're going to focus on. From the biological perspective, trying to model more complex organisms is another key area so mammals and humans we definitely want to build Foundation models there because the application for human health and Therapeutics is just huge if we can be able to have a language model that understands the grammar of our own GNA as well as we've learned procaryotes or with microbial life I think that the application space is is far more than what we're aware of right now, I'd say. My my sort of philosophy or thinking or hope is that in similar ways that we've seen folks like open AI uh merge modalities in language and vision and gave us things like Sora and Dolly I think there's a potentially bigger opportunity in biology um to merge modalities into into designing sequences or Therapeutics or different drugs because there's so many different things in in uh assays or uh types of sequences that try to measure similar things in our body, and so being able to merge those signals to design or or predict useful things in DNA or other biological sequences I think the potential is is far greater and potentially far more impactful.

Evolution of Hyena Architecture

I think it's a place where we've put our bets because in a resource constrained World we've gotten good results in terms of longer sequences, but is it enough to for example have a single ginormous biological Foundation model that will solve biology no it's not enough to get us there and so I do think there's I do think there's room for for Innovation on uh many many fronts including uh architectures down the road.

Future Directions

We would like to continue exploring the potential of Hyena architecture in DNA and its applications. We are excited about the prospect of merging modalities into designing sequences or Therapeutics or different drugs, as it has the potential to have a significant impact on human health and therapeutics. We believe that scaling is key to improving Pro, and we plan to continue studying this aspect further. Additionally, we aim to build Foundation models for complex organisms such as mammals and humans, which will enable us to tackle some of the most pressing challenges in biology.

"WEBVTTKind: captionsLanguage: enall right everyone welcome to another episode of the twiml AI podcast I am your host Sam charington and today I'm excited to be joined by Eric Gan Eric is a PhD student at Stanford University before we get going be sure to take a moment to hit that subscribe button wherever you're listening to Today's Show Eric welcome to the podcast thanks for having me Sam really excited to be here I'm excited for our conversation as well we're going to be talking about your research into long sequence Foundation models and their application in biology in particular hyena and hyena DNA and most recently EVO to get us started I'd love to have you share a little bit about your background and research interests sure so I am in the bioengineering department at Stanford but my background is mostly in the machine learning site I started off in computer vision and started working with State space models and seeing if they can be used for some of these vision and language domains uh and then started moving on to see if we can extrapolate them and use them for biology applications in particular for their long context purposes and so I've been exploring these applications of hyena in particular which is a convolutional based language model to the domain of biology in particular DNA you know let's start by talking a little bit about hyena and some of the motivations for that architecture in addressing longer context lengths for language modeling and other sequence models tell us about the Big Challenge there in the motivation so our lab has been interested in this long sequence task for quite some number of years and it started with Albert goo who was interested in in state space models at the time and so initially their work in our lab was applying these models to continuous signals typically so like time series and audio and so when language models started to become really popular we too were interested in applying it to language but we noticed there was a gap in applying some of these early State space models and so the hyen architecture was really focused as an offshoot of these State space models to see if we can get it to work on language and so we focused on getting it to work on these discret signals that you know we think was language so Michael ply the lead author in architecture and Stephan maseri the other co-author on the hyena paper we started focusing on what's the Gap uh and getting this Gap to to close between Transformers and so that was initial work that they focused on and it's when we started getting excited we were able to sort of match similar quality but then explore this component which was being able to fit longer sequences um and so that's what we got started getting excited about your applications so the obvious question is Transformers have proven to work very well why the need for an alternative what is the challenge that is presented with the Transformer architecture in dealing with longer sequences yeah so you know at the time when we were working on it um this was still in the age where not as many folks were thinking about context particular and so um there's this key constraint with Transformers which uh as the sequence length grows the time complexity right the algorithmic efficiency of the operation has this squaring uh law that grows and so when you double the sequence length the computational requirement quadruples and so when the sequence gets very long uh if you care about things like DNA or you know entire code bases then it becomes intractable for these models it can become attractable for you know these classic Transformer models and so the the motivation is is there you know a different type of operation or primitive that could be hopefully as expressive but will reduce that complexity in computation for these longer sequences and you mentioned one of the early attempts to try to address that in particular with State space models I refer folks to my interview with Dan Fu from actually just about exactly a year ago language modeling with State space models was the title of that episode and so this work Builds on some of that state space work does it retain some of the kind of hierarchical approach that made that that made the state space work yeah it it uses a a good portion of it in terms of skeleton so the differences that we took from that paper with Dan Fu on state space models for languages was how we parameterize the convolutional kernel so a key way to make convolutions work for long sequences in particular is to move on from this like sort of explicit parameterization of kernels to an implicit version and so that allows you to do is essentially have a a parameter efficient way to use convolutions and so the state space model would use State space models for this kernel and then the original H work we would use we sort of generalized it to just use an MLP so an MLP that would parameterize the filters for a convolution kernel and so the thinking was to have an unconstrained version that can learn perhaps a more expressive kernel for the convolution and your research group published a really interesting blog post last year talking about this scope of research from Deep to long learning was the is the post that I'm referring to and in that post there's a really interesting graphic and we'll link to all of this in the show notes but one of the elements discussed in that post and the Centric to this graphic is the idea of the fft as kind of a new primitive that's driving this alternative family of models you know talk about the role of the fft in hyena so for for hyena we rely on long convolutions and so typical convolutions they'll have uh short filters right and so the time complexity of having a kernel stride across an input is still basically a linear operation but when you have a convolutional kernel like hyena the key difference is that we're going to have have a filter the same length as the input so a global convolution um now the challenge is the time complexity will increase so you you essentially get back uh a Time complexity that's similar to a tension like a n squ time complexity the coral is the same length as the input and so you lose that benefit of a lower time complexity if you just applied a convolution in this standard way that people think of in the space domain instead we'll take advantage of a convolutional the right from for folks famili with signal processing which states that a convolution done through the 4A domain is equivalent as a element wise multiplication in that space to its convolution in the space domain and so what we'll do is we'll use a fft fast for transform to essentially convert the input in kernel to the frequency domain do an element wise multiplication and then bring it back to the space domain and that whole operation allows us to reduce time collect from an N squ to an N log n operation which is for long sequences near linear MH and do you retain the benefit of being able to accelerate using gpus yeah that's a good question so in general gpus are a lot more optimized for matrix multiplication and so there's probably some room for improvement in terms of fully utilizing the hardware but in general we can get you know fairly efficient convolutions using gpus and that's that's what we do for language model training for haa now there's follow-on work from Dan Fu which I also helped out on okay called flash fft com which flash fft work exactly bringing the sophistication from Hardware wear algorithm design from flash attention to the fft and so that was really exciting work as well H and going from State space modeling to the long sequence convolution do you lose some um explainability or mechanistic understanding uh in doing that the you know having a state space model that describes explicitly how the state evolves from from one state to the next seems like it would have some of those benefits is that true at all and and do you kind of abstract away from some of that with a convolutional approach yeah that's a good question I haven't focused too much on the explainability of you know what these State space models are learning in terms of their um State space representations so could be but I think what we've focused on more so is using sort of attention Matrix like uh view of the convolutions so even though we use convolutions we can also still produce an attention map you know quote unquote which allows us to essentially see what inputs are being attended to but but more so in that convolutional like what's lit up right in terms of attributing weights essentially so that's not been a constraint so far for interpretability talk about how far you've pushed the hyena model in terms of input size or or context length yeah so in the initial work we we start off with 64,000 tokens for the high hierarchy paper on synthetic tasks and then in a follow on work we applied it to to DNA sequences uh me that um is is constant need of having longrange dependency modeling when we created a model called hyena DNA which we scaled up to 1 million token context which at the time was the the longest language model context uh since then lots of folks have caught on and found this this kind of work also interesting and also get Transformers to work as well and so uh with long context so interesting to see the the work evolv or the space of all can you talk a little bit about um the the landscape of long contexts enable on the Transformer side what are the approaches that folks are taking there and what's working do you have a sense for what Gemini is doing uh to you know get at 1 million 2 million uh context windows and how those approaches will fundamentally differ from um swapping out Transformers for convolutions yeah so I I haven't so I think I've heard some rumors about how Gemini might do it but I yeah I actually don't have a a strong idea myself there's you know talk of using things like ring attention or some kind of sliding window attention where it's attending to a local uh window and then having a hierarchy where these local windows can get aggregated into more meaningful uh groups so I yeah I I'm not sure actually but um I think I suppose I yeah I think there's a lot more that could be explored I think the the sort of my takeaway is that folks have focused a lot on the memory reduction of using attenion for long context but there's sort of no way to get around or at least that I'm aware of of reducing the time complexity it's still going to be an N byn type of operation or N squared operation um and so you can you know reduce the hardware requirement but then the amount of time is still going to be pretty at least with the current setups I've been seeing and so I think that's sort of motivated folks to look at these alternative architectures um if they C if they do care about long contexts including you know Mamba obviously very popular um and you know we're sort of placing our bets on a slightly different approach with with hyena um variants but um yeah in general I think quality of these long context model with a Transformer or alternative architecture seems to be something that people are still trying to improve like you can fit it in context but does it you know generate high quality sequences throughout the long context for example um still an open question something we're observing and is hyena comparable to Transformers from a memory utilization perspective or well I guess the implication is that it's it's much more efficient yeah so in terms of you know if we compare to vanilla Transformers van vanilla tension algorithm yeah we're definitely a reduction in terms of uh memory and time the I guess bar is raised when you look at flash attention you know a very clever sophisticated algorithm that's essentially learned how to uh linearize the memory by doing it in blocks right and so I think that is a hard bar like attention super fast and I so I think there's still room for improvement to get convolutional based models to that degree of sophistication but in certain yeah so the short answer is depends on which settings you care about there's certain very long sequences that you still take very long on flash tension or Transformers which it could be faster on the convolutions but the memory side the full benefit is not there yet because there probably have some more room in terms of the systems optimization and you alluded to this a moment ago um in training hyena the training Corpus was synthetic language as opposed to a traditional training data set is that am I remembering that correctly and what does that mean so so in the original H hierarchy paper um we trained on language for I believe uh 2K contexts on the pile so like a pre-training perplexity comparison with Transformers and that essentially was stacking up um favorably for haa and then to push specifically the long context at the time there wasn't a lot of benchmarks for analyzing long context or you know very useful ones that we thought and so we came up with or used synthetic tasks that focused on a specific task which is recall okay so being able to look up portions sort of like a key value Dictionary look up uh look up values earlier in the sequence um and then we use a simple synthetic task of key value scores like letters and numbers essentially just to see if we can do this simple task but over long range and so that that sort of was a prototype for even fitting in Long context but then that really motivated to look for you know useful applications and that kind of segue to the hyena DNA work got it got it so let's jump into the hyena DNA work how did you identify biology as an application for this problem so part of it was I was always looking for it in in impactful application in healthcare or biology when I started my PhD sort of like the right opportunity or right technology to to dig in deep and so I sat in a number of different Labs on campus at Stanford who work at the intersection of biology and computer science and um long sequences and long derange dependencies kept on coming up in the world of DNA where people can model it just like a language like natural language in the sense that you have letters you have a vocabulary you can feed them into language models and and then you can ideally learn all the interactions and grammar associated with that language to do useful tasks like prediction or or perhaps generation as well can you talk about how the long range dependencies kind of manifest from a biological perspective and and how does that differ from natural langwood sure so I think the the really neat thing that makes DNA uh an interesting problem area or application to work on for machine learning is that it does like physically have these longrange dependencies so in DNA you know you have this sequence of uh letters essentially right they're actually chemical compounds but we can represent them as letters right the four bases AC and G and in general when you're modeling DNA there's processes that where what you care about are essentially things binding to DNA and kickstarting the process of creating proteins and so the process of having something bind to DNA can be influenced by what we call motifs that are present in the DNA that will essentially alter the probability of things binding or not okay now that the interesting thing for longrange dependencies is that uh some of these interactions can occur over millions of base pairs away from each other and so having a model that could essentially identify the presence of certain motifs and how they effect Downstream binding of certain molecules can mean the difference between someone expressing some kind of disease or altering a functions or their traits essentially and so it's really desirable to to capture these longrange dependencies within a single model so can you talk a little bit about how um one might use a long range genomic Foundation model and how that differs from the way someone might use something like an alpha fold and kind of compare and contrast those approaches sure so Alpha fold Alpha 3 in particular U very powerful very successful model and has probably the most impacted with machine learning and biology so far um now as as powerful as it is it is a supervised task which is uh specifically designed to predict structure right so given a sequence of amino acids and now more recently including DNA U they want to predict the 3D Shape of that sequence because it has a a 3D shape analog in physical space for Alpha fold it's specifically a supervised task and so it could be difficult to modify it for use of different purposes so the nice thing about language models and Foundation models they're meant to be this General learning representation machine that could be modified to be applied to a number of different Downstream tasks and that could be more difficult or perhaps more challenging for an alal model that's restrict and so some of those tasks for a a a biological Foundation model that's you know language based can include the prediction tasks right so let's say you have a sequence what's its function uh effect gene expression or how much of a certain protein you know this is this particular sequence expected to make um all the way to the generative side as well and so these language models right for natural language can generate lots of text and and answer prompts and questions right uh for DNA or proteins even right what you can imagine is and what they're currently being used for um is to design novel drugs or novel molecules proteins or uh systems uh that could be designed just from the sequence space um with these models got it can you talk a little bit about um how the model is trained the human genome has what 3.2 billion nucleotides which is quite a bit less than the uh the Corpus size for a state-of-the-art language model nowadays um but the language is a lot simpler in terms of the number of tokens like what are what's the training process and what are the um elements that come into play yeah so you're right so our initial work start off with training on the human genome which is made up of three billion base pairs or or nucleotides and the vocabulary you know depends on how you want to tokenize but for our case we we use the simple vocabulary of just four bases or you can consider like the bite level um and that can make it challenging for a number of ways um but yeah for focusing our initial use case we'll Train by just taking in an arbitrarily long sequence from this Human Genome and then do next token prediction or next nucleotide prediction sort of a causal style uh but we'll do this essentially billions of times just like natural language and so yeah that that is a challenge for I guess scale right so you know initially this could be seen as like a a first attempt or prototype to focus on the human genome but more desirably we're follow one to work for Evo in particular we we grabbed millions of genomes and millions of species to try to train a much larger DNA Foundation model across different species in that case you know the good thing about DNA it's it's the one the most common or most widely available sources of biological data out there and so there's definitely trillions of of tokens also in in in this domain that uh we can reach to to train a foundation model let's maybe take a step back and have you introduce Evo is it simply this idea of taking the hyena DNA approach and architecture and incorporating additional species or uh did the architecture itself evolve to enable you to do that so Evo is is a sort of our next stage in in training large DNA Foundation models taking some of the learnings we had from hyena DNA and scaling it much larger so hyena DNA was about a 7 million parameter model which in language models very small and then Evo is a 7 billion parameter model so it's about a thousand times bigger and to get that to scale effectively we did have some architecture improvements in this this work so a couple of key ones is that it's a hybrid model so we actually mix in reintroduce some attention layers with uh with these high convolutional layers and they thinking there is sort of to get the best of both worlds uh where where convolutions can get you know the efficiency gain and then attention has somewhat stronger abilities in terms of uh recall ability and so we'll take advantage of those uh those as well um and so we also have some modifications in terms of how we parameterize the convolution kernels themselves and then you know change some of the the the MLP blocks as well but um the overall training style and core of this convolutional style language model from hyena is is in large part uh used in Evo as well okay and you talked about Evo uh the Evo data set spanning many species do you retain the human species from hyen DNA and then incorporate the uh Fage and procaryotic and uh you know those types of sequences or is it you know only the lad I'm trying to wrap my head around does it make sense to have all of that uh in one model yeah that's that's really good question so we actually made the sort of strategic choice to focus on just procaryotic and phase genomes and so which you know folks who are not famili familiar with that biology uh essentially microbial life and the thinking there was we're going to focus on these genomes that are somewhat shorter they're shorter and they simpler life forms and their DNA the language of their the grammar of their DNA follows more clear rules than than humans or mammals for example humans and mammals are quite complex organisms as as as we can imagine and so having a large model trained on on humans and mammals or what we call ukar um is a a level complexity that uh that we're sort working our way up to and actually in follow on work for Evo 2 is what we plan to do next but we sort of wanted to to focus on simple life forms first because um sort of the thinking so far before we started working on Evo was we were unsure that it would a large DNA model could work in general we we weren't sure if some of the tasks that we were going to do or go after could actually for specifically to be able to generate and design DNA was a key task so we wanted to start with simpler organisms first dig into that task in a little bit more detail in terms of generating and designing DNA what are some examples of reasons why you might want to do that and ways that it might be applied sure so one of the initial questions before we started the evil project was is it possible to have a language model or Foundation model design an entire genome from scratch and the motivation for that was a lot of work previously had you know or a lot of research in general allows us to write DNA read DNA edit any sequence of DNA but to be able to design DNA we still didn't know enough about how DNA worked to be able to design and so we wanted to use language models to see if we can help design through a foundation model and so you know why would you want to do that if you could the thinking was if we could design an entire genome then we can perhaps control the function of these genomes as well and so some of the ideas were can we target cancer cells if we can design organisms to to have that function can we make better biofuels can we make antibiotics these were some of the the sort of the moonshots we thought that could be used for um you know biomanufacturing or climate or or therapeutic purposes for example and now that Transformer based approaches are accommodating longer context lengths could those be as easily applied to these problems we we've seen attempts to apply them to these types of problems you know is it simply quote unquote a choice of kind of training Corpus and going through the the Motions or are there unique adaptations of this approach that lends itself to the biological use case so that's a good question we we got this question a lot um or and I and it depends you know I think people have different answers but I I certainly you know having trained at these DNA models for the past year and a half certainly have my biases in my opinions which um I do believe that convolutional based models have this inherent inductive bias advantage over attention models and I think some of these reasons is that DNA is not natural language for one you know natural langu is a very dense Rich um type of sequence you know type of data distribution in DNA the information is far more sparse and over long range it's noisier uh because there's you know Dr biologists there's there's a lot of talk about junk DNA in your genome where people don't know what it does or if it has it might have no function in your genome and so what we observed is that these convolutional models we hypothesize is that they're better out filtering out this noise and being able to parse the signals over the long range is better than attention perhaps because convolutions from signal processing and they're built to filter out information really well and use that analogy but yeah there's other other things about modeling DNA that make it challenging like being able to tokenize at the Single Character level or the single nucleotide level in general we've seen Transformers just do a number of different works that have tried to to model language at the Character level and it usually underperforms and they've not been able to use long sequences for example and so in particular it's important to have this single character tokenization for DNA because a single character change could mean the difference between having a disease or not and so you want that sensitivity and resolution at that level that Transformers I haven't seen I don't think anyone seen a successful large scale language model that's tokenized at the Single Character level yet okay one of the properties that is characteristic of Transformer based models is uh hallucination of course is there anything uh kind of inherent to the hyena models that yields different characteristics in terms of propensity to hallucinate Ah that's that's interesting I I don't think I've played around with natural language enough to be able to identify human interpretable hallucinations I suppos but yeah I mean it's in general it should be SU I imagine to these hallucinations just like Transformers I guess maybe another way to ask the question is is hallucination a function of the architecture you know more so than the probabilistic nature of the approach in general more or the training approach if you have a take on whether approaches like hyena or the hyena approaches you know for some reason less susceptible to it the and I think the the context is that hallucination and language can be bad hallucination in you know DNA generation sounds scary yeah fair enough um yeah I'd say that these machine these systems they're probalistic machines like you're teaching them the probability of the next token and so because it's probabilistic these probabilistic errors could accumulate and then hallucinations can abound and I don't think that's going to change by the architecture or like the layer that we swap in it's still a fundamentally a next token prediction with a probabilistic mechanism um but yeah you're right so in terms of biology this could be scary as one way to think of it but it also could be a source of variability right or like a controlled Evolution which that's usually the analogy that folks in machine learning try to describe these generative models in biology you're controlling it's an opportunity for creativity and to use it for the generative task that you envision is that where you're going yeah exactly at the same time it's still not a deterministic thing right because if if it was then you you wouldn't have much variety right so it could be also viewed as a a feature not not a necessarily a scary thing and so to overcome that fear perhaps we definitely filter out for example in the same way in language you would want to filter out good quality Generations luckily in in biology there's a lot more heris software tools to be able to filter out high quality sequences to perhaps appease some of these concerns in the original hyena work um we talked a little bit about the synthetic tasks uh do you take advantage of uh and you just mentioned some of the existing tools and taking advantage of existing biological tools on the filtering side of things do you also use uh existing simulation tools and synthetic data generation tools is that a part of what you've done with hyena DNA or EVO um yeah that's a good question um we haven't done done so much on the synthetic side actually um but that is something that we're interested in in particular to see if we can create some synthetic um tasks for long range abilities I think that's an ongoing search for you know uh ideal tasks on that front but um I think in biology there you know there seems to be so many different applications that we could try um that we you know we're we're trying to be more um thoughtful and choosing the most relevant or impactful tasks so there's there isn't a shortage of tasks is what I mean so um we haven't had to look for synthetic ones it's more so prioritizing which biological meaningful tasks to go after first and in the note of tasks folks listening may be familiar with crisper which is a gene editing tool that's become popular over the past several years is is there a direct application of this work to uh crisper exactly and so the thinking there is that normally let's say like a drug or Pharma company might try to look for existing molecules in nature by just mining a bunch of genomes or DNA from different microbial life uh what they'll do is just take the exact copy they'll just look for it take that exact copy what we want to do instead is to learn what's out there and then essentially conceptually how I think about it is interpolate between what's seen before to hopefully design something not existing in nature right something new with a desired function you know hopefully that that works better that more efficient that enables in our case for example Gene editing tools that can work on different modalities so maybe not just DNA but RNA you know different different types of biological sequences um different levels of on andof targets like you know does it cut DNA in where you want it versus cut DNA in areas that you don't want it um you know being able to engineer and design these new tools um could open up new therapies for example is that's that's why we're interested in finding new ones and to make sure I'm understanding this correctly you we talk about uh you know crisper as a tool and a family of tools and kind of cutting DNA in places it's not like it's using a laser to cut the DNA it's uh a a chemical sequence and it's a particular chemical sequence in a family of chemical sequences and you could use uh generation here to identify chemical sequences in the distribution of that family uh that may have different properties exactly yeah so stepping back to give folks a sense of what these crispr cast systems are so in general they're they're defense systems so they're they're sequences of RNA and proteins MH and bacteria will have these sequences in their own genome and they'll use it as a defense mechanism for viruses and so what happens when a virus tries to come in and inject its own viral DNA into a bacteria this crisper system essentially will recognize uh a pattern of this DNA and then essentially cut up that DNA and render it innocuous and so that's sort of it this defense system that will both identify DNA and then cut DNA and so as humans we've sort of hijacked that system to to use it for tool for Gene editing essentially and as a therapy and so that you know specifically what it's made up of is RNA and and proteins so the RNA is going to be used to recognize where to cut and the protein is going to what is going to be the thing that cuts and so it's this complex of RNA and proteins that we're trying to generate because okay it's just DNA DNA as well talk about in the case of both hyena DNA and Evo evaluation and evaluation criteria benchmarks performance what have you seen and what are you comparing against yeah so I I'll I'll focus on the EVO One it's most most tough of mind and and and and relevant in this case for the benchmarks that we look at we breaking them down two two two areas or I guess our tasks are brok into two areas the predictive tasks and the generative tasks okay and the predictive tasks are the more Benchmark oriented ones that machine learning folks are used to seeing which is given a sequence can it predict some kind of function or Fitness score associated with that sequence so for example if they protein sequence it in the lab will have a data set that measures some kind of Fitness score associated with the sequence meaning what's its solubility or or stability or binding Affinity metrics related to sort of its like drug properties and so what we'll do for this Evo uh work is to see if the model is is uh sensitive to these uh mutations or perturbations that we introduce and see if it changes the fitness score in some meaningful way and so we'll take for example sequence of proteins and add a mutation into a portion and see if that mutation can be picked up in terms of how the outputs of the model correlate with the fitness score for that molecule and so specifically the data set called protein gym that we apply this to and the nice thing about this sort of task setup is that we can probe the model to see if it's learned useful features zero shot or without fine tuning and so what we saw for for our work or our performance is that a model like Evo was comparable with state-of-the-art protein models in terms of this pertubation and correlation to Fitness score without that was interesting even though it's just competitive or in some cases a little bit better the the more compelling part is that Evo is trained on Raw unannotated DNA sequences whereas protein models are trained specifically on justess protein sequences so a clear advantage and the main difference there is that DNA has both protein sequences and non-protein sequences so if you train your model on the entire DNA or genome your model has to basically tease apart and figure out what's a protein and what's not a protein and the thinking before this was that most models could not do this they had to focus they had to be one model for proteins one model for DNA one model for RNA to be competitive and what we were trying to show is that you can have a single model that trains across all of DNA that can learn all three modalities okay is there something you can say about or did I miss in there you saying in terms of relative performance to existing things state-of-the-art on some number of things so for uh yep so for proteins we were competitive with state-of-the-art so about the same and then for RNA and DNA modalities we we're able to show Evo has a significant performance boost or state-ofthe-art on these zero shot tasks compared to other domain specific models for for DNA or RNA and so we we focus on those three those those are the central dogma of biology what they're known for as and so we we try to have a little bit of Benchmark on each of these domains for the predictive tasks and so just as strong as proteins and then state-of-the-art on DNA and R okay and in terms of compute budget how does this approach compare to the prior State ofth art yeah so when we talk about state-ofthe-art it's more so a comparison of absolute performance versus controlling for parameter size or architecture so they do have different parameter sizes and data sets and so it's hard to make that exact comparison and so yeah it's most yeah it's less so focusing on the architecture comparisons well maybe how would you characterize the efficiency of these models relative to other approaches that someone trying to address these tasks might typically use yeah so so I take that back so we did compare we do a scaling analysis uh of pre-training on DNA across the prevailing DNA architectures so we do comparing with Transformers Mamba and then hyena in its hybrid variant as well and so on that front we do compare performance across architectures mainly on perplexity in terms of how they scale with perplexity and so it was interesting to see that these convolutional based models being particularly strong on DNA over Transformers and again this kind of alludes back to to seeing noting the difference between DNA and natural language you know being sparer in terms of information and also noisy at the same time What Did You observe in terms of zero shot performance versus few shot performance and is fine-tuning a thing that makes sense in this context and if so how so we we focus initially on zero shot tasks in this work because we wanted to Showcase um I guess most work previously in biology usually focuses on fine tuning but we sort of thought that was um uh I don't know taking an off-the-shelf natural language model and trying to fine-tune it to uh be useful in the biological context um I mean that's not necessarily because I think these models still also trained from scratch but I think I think was more so that previous models we observed didn't F didn't focus on generalizability and so they typically would always find tune just because our hunches and what we observe in mini models is that they didn't the zero shot performance wasn't that compelling um and so they they weren't practical for for many applications and so in this particular work because we had scale and because we were focusing on generalizability we wanted to see how well it could generalize zero shot across different modalities in biology um that that being said we did have some fine-tuning results which in in many of the tasks were also very strong quite competitive if not more competitive than the zero shot results we we observed before okay awesome awesome uh talk a little bit about where you see it all going what's next on your research agenda uh in this particular uh line of work and kind of how do you see it evolving more broadly so this is a something we been thinking a lot about um so we're we're certainly interested in both you know biological perspective and a machine learning perspective how to improve Pro and so from a machine learning side we absolutely do think scale is going to be really important that's why we did a set of DNA scaling laws to see do you get some kind of gain observed That You observe in natural language also in DNA and and indeed we do see this similar trajectory for DNA which is something we're going to focus on um from the biological perspective um trying to model more complex organisms is another key area so mammals and humans we definitely want to build Foundation models there because the application for human health and Therapeutics is just huge if we can be able to have a language model that understands the grammar of our own GNA as well as we've learned procaryotes or with microbial life I think that the application space is is far more than what we're aware of right now I'd say and so in general my my sort of philosophy or thinking or hope is that in similar ways that we've seen folks like open AI uh merge modalities in language and vision and gave us things like Sora and Dolly I think there's a potentially bigger opportunity in biology um to merge modalities into into designing sequences or Therapeutics or different drugs because there's so many different things in in uh assays or uh types of sequences that try to measure similar things in our body and so being able to merge those signals to design or or predict useful things in DNA or other biological sequences I think the potential is is far greater and potentially far more impactful and does uh do you see the hyena architecture as uh getting you there in the same way that Transformers are uh applied to multimodality or uh will it need significant Evolution to accommodate that uh I mean there's always room for improvement I'd say I think it's a place where we've put our bets because in a resource constrained World we've gotten good results in terms of longer sequences but is it enough to for example have a single ginormous biological Foundation model that will solve biology no it's not enough to get us there and so I do think there's I do think there's room for for Innovation on uh many many fronts including uh architectures down the road awesome awesome well well Eric thanks so much for taking the time to share with us a bit about what you're working on and hyena and hyena DNA and Evo in particular it was a great conversation really appreciate it awesome super fun really fun to be here thanks Samall right everyone welcome to another episode of the twiml AI podcast I am your host Sam charington and today I'm excited to be joined by Eric Gan Eric is a PhD student at Stanford University before we get going be sure to take a moment to hit that subscribe button wherever you're listening to Today's Show Eric welcome to the podcast thanks for having me Sam really excited to be here I'm excited for our conversation as well we're going to be talking about your research into long sequence Foundation models and their application in biology in particular hyena and hyena DNA and most recently EVO to get us started I'd love to have you share a little bit about your background and research interests sure so I am in the bioengineering department at Stanford but my background is mostly in the machine learning site I started off in computer vision and started working with State space models and seeing if they can be used for some of these vision and language domains uh and then started moving on to see if we can extrapolate them and use them for biology applications in particular for their long context purposes and so I've been exploring these applications of hyena in particular which is a convolutional based language model to the domain of biology in particular DNA you know let's start by talking a little bit about hyena and some of the motivations for that architecture in addressing longer context lengths for language modeling and other sequence models tell us about the Big Challenge there in the motivation so our lab has been interested in this long sequence task for quite some number of years and it started with Albert goo who was interested in in state space models at the time and so initially their work in our lab was applying these models to continuous signals typically so like time series and audio and so when language models started to become really popular we too were interested in applying it to language but we noticed there was a gap in applying some of these early State space models and so the hyen architecture was really focused as an offshoot of these State space models to see if we can get it to work on language and so we focused on getting it to work on these discret signals that you know we think was language so Michael ply the lead author in architecture and Stephan maseri the other co-author on the hyena paper we started focusing on what's the Gap uh and getting this Gap to to close between Transformers and so that was initial work that they focused on and it's when we started getting excited we were able to sort of match similar quality but then explore this component which was being able to fit longer sequences um and so that's what we got started getting excited about your applications so the obvious question is Transformers have proven to work very well why the need for an alternative what is the challenge that is presented with the Transformer architecture in dealing with longer sequences yeah so you know at the time when we were working on it um this was still in the age where not as many folks were thinking about context particular and so um there's this key constraint with Transformers which uh as the sequence length grows the time complexity right the algorithmic efficiency of the operation has this squaring uh law that grows and so when you double the sequence length the computational requirement quadruples and so when the sequence gets very long uh if you care about things like DNA or you know entire code bases then it becomes intractable for these models it can become attractable for you know these classic Transformer models and so the the motivation is is there you know a different type of operation or primitive that could be hopefully as expressive but will reduce that complexity in computation for these longer sequences and you mentioned one of the early attempts to try to address that in particular with State space models I refer folks to my interview with Dan Fu from actually just about exactly a year ago language modeling with State space models was the title of that episode and so this work Builds on some of that state space work does it retain some of the kind of hierarchical approach that made that that made the state space work yeah it it uses a a good portion of it in terms of skeleton so the differences that we took from that paper with Dan Fu on state space models for languages was how we parameterize the convolutional kernel so a key way to make convolutions work for long sequences in particular is to move on from this like sort of explicit parameterization of kernels to an implicit version and so that allows you to do is essentially have a a parameter efficient way to use convolutions and so the state space model would use State space models for this kernel and then the original H work we would use we sort of generalized it to just use an MLP so an MLP that would parameterize the filters for a convolution kernel and so the thinking was to have an unconstrained version that can learn perhaps a more expressive kernel for the convolution and your research group published a really interesting blog post last year talking about this scope of research from Deep to long learning was the is the post that I'm referring to and in that post there's a really interesting graphic and we'll link to all of this in the show notes but one of the elements discussed in that post and the Centric to this graphic is the idea of the fft as kind of a new primitive that's driving this alternative family of models you know talk about the role of the fft in hyena so for for hyena we rely on long convolutions and so typical convolutions they'll have uh short filters right and so the time complexity of having a kernel stride across an input is still basically a linear operation but when you have a convolutional kernel like hyena the key difference is that we're going to have have a filter the same length as the input so a global convolution um now the challenge is the time complexity will increase so you you essentially get back uh a Time complexity that's similar to a tension like a n squ time complexity the coral is the same length as the input and so you lose that benefit of a lower time complexity if you just applied a convolution in this standard way that people think of in the space domain instead we'll take advantage of a convolutional the right from for folks famili with signal processing which states that a convolution done through the 4A domain is equivalent as a element wise multiplication in that space to its convolution in the space domain and so what we'll do is we'll use a fft fast for transform to essentially convert the input in kernel to the frequency domain do an element wise multiplication and then bring it back to the space domain and that whole operation allows us to reduce time collect from an N squ to an N log n operation which is for long sequences near linear MH and do you retain the benefit of being able to accelerate using gpus yeah that's a good question so in general gpus are a lot more optimized for matrix multiplication and so there's probably some room for improvement in terms of fully utilizing the hardware but in general we can get you know fairly efficient convolutions using gpus and that's that's what we do for language model training for haa now there's follow-on work from Dan Fu which I also helped out on okay called flash fft com which flash fft work exactly bringing the sophistication from Hardware wear algorithm design from flash attention to the fft and so that was really exciting work as well H and going from State space modeling to the long sequence convolution do you lose some um explainability or mechanistic understanding uh in doing that the you know having a state space model that describes explicitly how the state evolves from from one state to the next seems like it would have some of those benefits is that true at all and and do you kind of abstract away from some of that with a convolutional approach yeah that's a good question I haven't focused too much on the explainability of you know what these State space models are learning in terms of their um State space representations so could be but I think what we've focused on more so is using sort of attention Matrix like uh view of the convolutions so even though we use convolutions we can also still produce an attention map you know quote unquote which allows us to essentially see what inputs are being attended to but but more so in that convolutional like what's lit up right in terms of attributing weights essentially so that's not been a constraint so far for interpretability talk about how far you've pushed the hyena model in terms of input size or or context length yeah so in the initial work we we start off with 64,000 tokens for the high hierarchy paper on synthetic tasks and then in a follow on work we applied it to to DNA sequences uh me that um is is constant need of having longrange dependency modeling when we created a model called hyena DNA which we scaled up to 1 million token context which at the time was the the longest language model context uh since then lots of folks have caught on and found this this kind of work also interesting and also get Transformers to work as well and so uh with long context so interesting to see the the work evolv or the space of all can you talk a little bit about um the the landscape of long contexts enable on the Transformer side what are the approaches that folks are taking there and what's working do you have a sense for what Gemini is doing uh to you know get at 1 million 2 million uh context windows and how those approaches will fundamentally differ from um swapping out Transformers for convolutions yeah so I I haven't so I think I've heard some rumors about how Gemini might do it but I yeah I actually don't have a a strong idea myself there's you know talk of using things like ring attention or some kind of sliding window attention where it's attending to a local uh window and then having a hierarchy where these local windows can get aggregated into more meaningful uh groups so I yeah I I'm not sure actually but um I think I suppose I yeah I think there's a lot more that could be explored I think the the sort of my takeaway is that folks have focused a lot on the memory reduction of using attenion for long context but there's sort of no way to get around or at least that I'm aware of of reducing the time complexity it's still going to be an N byn type of operation or N squared operation um and so you can you know reduce the hardware requirement but then the amount of time is still going to be pretty at least with the current setups I've been seeing and so I think that's sort of motivated folks to look at these alternative architectures um if they C if they do care about long contexts including you know Mamba obviously very popular um and you know we're sort of placing our bets on a slightly different approach with with hyena um variants but um yeah in general I think quality of these long context model with a Transformer or alternative architecture seems to be something that people are still trying to improve like you can fit it in context but does it you know generate high quality sequences throughout the long context for example um still an open question something we're observing and is hyena comparable to Transformers from a memory utilization perspective or well I guess the implication is that it's it's much more efficient yeah so in terms of you know if we compare to vanilla Transformers van vanilla tension algorithm yeah we're definitely a reduction in terms of uh memory and time the I guess bar is raised when you look at flash attention you know a very clever sophisticated algorithm that's essentially learned how to uh linearize the memory by doing it in blocks right and so I think that is a hard bar like attention super fast and I so I think there's still room for improvement to get convolutional based models to that degree of sophistication but in certain yeah so the short answer is depends on which settings you care about there's certain very long sequences that you still take very long on flash tension or Transformers which it could be faster on the convolutions but the memory side the full benefit is not there yet because there probably have some more room in terms of the systems optimization and you alluded to this a moment ago um in training hyena the training Corpus was synthetic language as opposed to a traditional training data set is that am I remembering that correctly and what does that mean so so in the original H hierarchy paper um we trained on language for I believe uh 2K contexts on the pile so like a pre-training perplexity comparison with Transformers and that essentially was stacking up um favorably for haa and then to push specifically the long context at the time there wasn't a lot of benchmarks for analyzing long context or you know very useful ones that we thought and so we came up with or used synthetic tasks that focused on a specific task which is recall okay so being able to look up portions sort of like a key value Dictionary look up uh look up values earlier in the sequence um and then we use a simple synthetic task of key value scores like letters and numbers essentially just to see if we can do this simple task but over long range and so that that sort of was a prototype for even fitting in Long context but then that really motivated to look for you know useful applications and that kind of segue to the hyena DNA work got it got it so let's jump into the hyena DNA work how did you identify biology as an application for this problem so part of it was I was always looking for it in in impactful application in healthcare or biology when I started my PhD sort of like the right opportunity or right technology to to dig in deep and so I sat in a number of different Labs on campus at Stanford who work at the intersection of biology and computer science and um long sequences and long derange dependencies kept on coming up in the world of DNA where people can model it just like a language like natural language in the sense that you have letters you have a vocabulary you can feed them into language models and and then you can ideally learn all the interactions and grammar associated with that language to do useful tasks like prediction or or perhaps generation as well can you talk about how the long range dependencies kind of manifest from a biological perspective and and how does that differ from natural langwood sure so I think the the really neat thing that makes DNA uh an interesting problem area or application to work on for machine learning is that it does like physically have these longrange dependencies so in DNA you know you have this sequence of uh letters essentially right they're actually chemical compounds but we can represent them as letters right the four bases AC and G and in general when you're modeling DNA there's processes that where what you care about are essentially things binding to DNA and kickstarting the process of creating proteins and so the process of having something bind to DNA can be influenced by what we call motifs that are present in the DNA that will essentially alter the probability of things binding or not okay now that the interesting thing for longrange dependencies is that uh some of these interactions can occur over millions of base pairs away from each other and so having a model that could essentially identify the presence of certain motifs and how they effect Downstream binding of certain molecules can mean the difference between someone expressing some kind of disease or altering a functions or their traits essentially and so it's really desirable to to capture these longrange dependencies within a single model so can you talk a little bit about how um one might use a long range genomic Foundation model and how that differs from the way someone might use something like an alpha fold and kind of compare and contrast those approaches sure so Alpha fold Alpha 3 in particular U very powerful very successful model and has probably the most impacted with machine learning and biology so far um now as as powerful as it is it is a supervised task which is uh specifically designed to predict structure right so given a sequence of amino acids and now more recently including DNA U they want to predict the 3D Shape of that sequence because it has a a 3D shape analog in physical space for Alpha fold it's specifically a supervised task and so it could be difficult to modify it for use of different purposes so the nice thing about language models and Foundation models they're meant to be this General learning representation machine that could be modified to be applied to a number of different Downstream tasks and that could be more difficult or perhaps more challenging for an alal model that's restrict and so some of those tasks for a a a biological Foundation model that's you know language based can include the prediction tasks right so let's say you have a sequence what's its function uh effect gene expression or how much of a certain protein you know this is this particular sequence expected to make um all the way to the generative side as well and so these language models right for natural language can generate lots of text and and answer prompts and questions right uh for DNA or proteins even right what you can imagine is and what they're currently being used for um is to design novel drugs or novel molecules proteins or uh systems uh that could be designed just from the sequence space um with these models got it can you talk a little bit about um how the model is trained the human genome has what 3.2 billion nucleotides which is quite a bit less than the uh the Corpus size for a state-of-the-art language model nowadays um but the language is a lot simpler in terms of the number of tokens like what are what's the training process and what are the um elements that come into play yeah so you're right so our initial work start off with training on the human genome which is made up of three billion base pairs or or nucleotides and the vocabulary you know depends on how you want to tokenize but for our case we we use the simple vocabulary of just four bases or you can consider like the bite level um and that can make it challenging for a number of ways um but yeah for focusing our initial use case we'll Train by just taking in an arbitrarily long sequence from this Human Genome and then do next token prediction or next nucleotide prediction sort of a causal style uh but we'll do this essentially billions of times just like natural language and so yeah that that is a challenge for I guess scale right so you know initially this could be seen as like a a first attempt or prototype to focus on the human genome but more desirably we're follow one to work for Evo in particular we we grabbed millions of genomes and millions of species to try to train a much larger DNA Foundation model across different species in that case you know the good thing about DNA it's it's the one the most common or most widely available sources of biological data out there and so there's definitely trillions of of tokens also in in in this domain that uh we can reach to to train a foundation model let's maybe take a step back and have you introduce Evo is it simply this idea of taking the hyena DNA approach and architecture and incorporating additional species or uh did the architecture itself evolve to enable you to do that so Evo is is a sort of our next stage in in training large DNA Foundation models taking some of the learnings we had from hyena DNA and scaling it much larger so hyena DNA was about a 7 million parameter model which in language models very small and then Evo is a 7 billion parameter model so it's about a thousand times bigger and to get that to scale effectively we did have some architecture improvements in this this work so a couple of key ones is that it's a hybrid model so we actually mix in reintroduce some attention layers with uh with these high convolutional layers and they thinking there is sort of to get the best of both worlds uh where where convolutions can get you know the efficiency gain and then attention has somewhat stronger abilities in terms of uh recall ability and so we'll take advantage of those uh those as well um and so we also have some modifications in terms of how we parameterize the convolution kernels themselves and then you know change some of the the the MLP blocks as well but um the overall training style and core of this convolutional style language model from hyena is is in large part uh used in Evo as well okay and you talked about Evo uh the Evo data set spanning many species do you retain the human species from hyen DNA and then incorporate the uh Fage and procaryotic and uh you know those types of sequences or is it you know only the lad I'm trying to wrap my head around does it make sense to have all of that uh in one model yeah that's that's really good question so we actually made the sort of strategic choice to focus on just procaryotic and phase genomes and so which you know folks who are not famili familiar with that biology uh essentially microbial life and the thinking there was we're going to focus on these genomes that are somewhat shorter they're shorter and they simpler life forms and their DNA the language of their the grammar of their DNA follows more clear rules than than humans or mammals for example humans and mammals are quite complex organisms as as as we can imagine and so having a large model trained on on humans and mammals or what we call ukar um is a a level complexity that uh that we're sort working our way up to and actually in follow on work for Evo 2 is what we plan to do next but we sort of wanted to to focus on simple life forms first because um sort of the thinking so far before we started working on Evo was we were unsure that it would a large DNA model could work in general we we weren't sure if some of the tasks that we were going to do or go after could actually for specifically to be able to generate and design DNA was a key task so we wanted to start with simpler organisms first dig into that task in a little bit more detail in terms of generating and designing DNA what are some examples of reasons why you might want to do that and ways that it might be applied sure so one of the initial questions before we started the evil project was is it possible to have a language model or Foundation model design an entire genome from scratch and the motivation for that was a lot of work previously had you know or a lot of research in general allows us to write DNA read DNA edit any sequence of DNA but to be able to design DNA we still didn't know enough about how DNA worked to be able to design and so we wanted to use language models to see if we can help design through a foundation model and so you know why would you want to do that if you could the thinking was if we could design an entire genome then we can perhaps control the function of these genomes as well and so some of the ideas were can we target cancer cells if we can design organisms to to have that function can we make better biofuels can we make antibiotics these were some of the the sort of the moonshots we thought that could be used for um you know biomanufacturing or climate or or therapeutic purposes for example and now that Transformer based approaches are accommodating longer context lengths could those be as easily applied to these problems we we've seen attempts to apply them to these types of problems you know is it simply quote unquote a choice of kind of training Corpus and going through the the Motions or are there unique adaptations of this approach that lends itself to the biological use case so that's a good question we we got this question a lot um or and I and it depends you know I think people have different answers but I I certainly you know having trained at these DNA models for the past year and a half certainly have my biases in my opinions which um I do believe that convolutional based models have this inherent inductive bias advantage over attention models and I think some of these reasons is that DNA is not natural language for one you know natural langu is a very dense Rich um type of sequence you know type of data distribution in DNA the information is far more sparse and over long range it's noisier uh because there's you know Dr biologists there's there's a lot of talk about junk DNA in your genome where people don't know what it does or if it has it might have no function in your genome and so what we observed is that these convolutional models we hypothesize is that they're better out filtering out this noise and being able to parse the signals over the long range is better than attention perhaps because convolutions from signal processing and they're built to filter out information really well and use that analogy but yeah there's other other things about modeling DNA that make it challenging like being able to tokenize at the Single Character level or the single nucleotide level in general we've seen Transformers just do a number of different works that have tried to to model language at the Character level and it usually underperforms and they've not been able to use long sequences for example and so in particular it's important to have this single character tokenization for DNA because a single character change could mean the difference between having a disease or not and so you want that sensitivity and resolution at that level that Transformers I haven't seen I don't think anyone seen a successful large scale language model that's tokenized at the Single Character level yet okay one of the properties that is characteristic of Transformer based models is uh hallucination of course is there anything uh kind of inherent to the hyena models that yields different characteristics in terms of propensity to hallucinate Ah that's that's interesting I I don't think I've played around with natural language enough to be able to identify human interpretable hallucinations I suppos but yeah I mean it's in general it should be SU I imagine to these hallucinations just like Transformers I guess maybe another way to ask the question is is hallucination a function of the architecture you know more so than the probabilistic nature of the approach in general more or the training approach if you have a take on whether approaches like hyena or the hyena approaches you know for some reason less susceptible to it the and I think the the context is that hallucination and language can be bad hallucination in you know DNA generation sounds scary yeah fair enough um yeah I'd say that these machine these systems they're probalistic machines like you're teaching them the probability of the next token and so because it's probabilistic these probabilistic errors could accumulate and then hallucinations can abound and I don't think that's going to change by the architecture or like the layer that we swap in it's still a fundamentally a next token prediction with a probabilistic mechanism um but yeah you're right so in terms of biology this could be scary as one way to think of it but it also could be a source of variability right or like a controlled Evolution which that's usually the analogy that folks in machine learning try to describe these generative models in biology you're controlling it's an opportunity for creativity and to use it for the generative task that you envision is that where you're going yeah exactly at the same time it's still not a deterministic thing right because if if it was then you you wouldn't have much variety right so it could be also viewed as a a feature not not a necessarily a scary thing and so to overcome that fear perhaps we definitely filter out for example in the same way in language you would want to filter out good quality Generations luckily in in biology there's a lot more heris software tools to be able to filter out high quality sequences to perhaps appease some of these concerns in the original hyena work um we talked a little bit about the synthetic tasks uh do you take advantage of uh and you just mentioned some of the existing tools and taking advantage of existing biological tools on the filtering side of things do you also use uh existing simulation tools and synthetic data generation tools is that a part of what you've done with hyena DNA or EVO um yeah that's a good question um we haven't done done so much on the synthetic side actually um but that is something that we're interested in in particular to see if we can create some synthetic um tasks for long range abilities I think that's an ongoing search for you know uh ideal tasks on that front but um I think in biology there you know there seems to be so many different applications that we could try um that we you know we're we're trying to be more um thoughtful and choosing the most relevant or impactful tasks so there's there isn't a shortage of tasks is what I mean so um we haven't had to look for synthetic ones it's more so prioritizing which biological meaningful tasks to go after first and in the note of tasks folks listening may be familiar with crisper which is a gene editing tool that's become popular over the past several years is is there a direct application of this work to uh crisper exactly and so the thinking there is that normally let's say like a drug or Pharma company might try to look for existing molecules in nature by just mining a bunch of genomes or DNA from different microbial life uh what they'll do is just take the exact copy they'll just look for it take that exact copy what we want to do instead is to learn what's out there and then essentially conceptually how I think about it is interpolate between what's seen before to hopefully design something not existing in nature right something new with a desired function you know hopefully that that works better that more efficient that enables in our case for example Gene editing tools that can work on different modalities so maybe not just DNA but RNA you know different different types of biological sequences um different levels of on andof targets like you know does it cut DNA in where you want it versus cut DNA in areas that you don't want it um you know being able to engineer and design these new tools um could open up new therapies for example is that's that's why we're interested in finding new ones and to make sure I'm understanding this correctly you we talk about uh you know crisper as a tool and a family of tools and kind of cutting DNA in places it's not like it's using a laser to cut the DNA it's uh a a chemical sequence and it's a particular chemical sequence in a family of chemical sequences and you could use uh generation here to identify chemical sequences in the distribution of that family uh that may have different properties exactly yeah so stepping back to give folks a sense of what these crispr cast systems are so in general they're they're defense systems so they're they're sequences of RNA and proteins MH and bacteria will have these sequences in their own genome and they'll use it as a defense mechanism for viruses and so what happens when a virus tries to come in and inject its own viral DNA into a bacteria this crisper system essentially will recognize uh a pattern of this DNA and then essentially cut up that DNA and render it innocuous and so that's sort of it this defense system that will both identify DNA and then cut DNA and so as humans we've sort of hijacked that system to to use it for tool for Gene editing essentially and as a therapy and so that you know specifically what it's made up of is RNA and and proteins so the RNA is going to be used to recognize where to cut and the protein is going to what is going to be the thing that cuts and so it's this complex of RNA and proteins that we're trying to generate because okay it's just DNA DNA as well talk about in the case of both hyena DNA and Evo evaluation and evaluation criteria benchmarks performance what have you seen and what are you comparing against yeah so I I'll I'll focus on the EVO One it's most most tough of mind and and and and relevant in this case for the benchmarks that we look at we breaking them down two two two areas or I guess our tasks are brok into two areas the predictive tasks and the generative tasks okay and the predictive tasks are the more Benchmark oriented ones that machine learning folks are used to seeing which is given a sequence can it predict some kind of function or Fitness score associated with that sequence so for example if they protein sequence it in the lab will have a data set that measures some kind of Fitness score associated with the sequence meaning what's its solubility or or stability or binding Affinity metrics related to sort of its like drug properties and so what we'll do for this Evo uh work is to see if the model is is uh sensitive to these uh mutations or perturbations that we introduce and see if it changes the fitness score in some meaningful way and so we'll take for example sequence of proteins and add a mutation into a portion and see if that mutation can be picked up in terms of how the outputs of the model correlate with the fitness score for that molecule and so specifically the data set called protein gym that we apply this to and the nice thing about this sort of task setup is that we can probe the model to see if it's learned useful features zero shot or without fine tuning and so what we saw for for our work or our performance is that a model like Evo was comparable with state-of-the-art protein models in terms of this pertubation and correlation to Fitness score without that was interesting even though it's just competitive or in some cases a little bit better the the more compelling part is that Evo is trained on Raw unannotated DNA sequences whereas protein models are trained specifically on justess protein sequences so a clear advantage and the main difference there is that DNA has both protein sequences and non-protein sequences so if you train your model on the entire DNA or genome your model has to basically tease apart and figure out what's a protein and what's not a protein and the thinking before this was that most models could not do this they had to focus they had to be one model for proteins one model for DNA one model for RNA to be competitive and what we were trying to show is that you can have a single model that trains across all of DNA that can learn all three modalities okay is there something you can say about or did I miss in there you saying in terms of relative performance to existing things state-of-the-art on some number of things so for uh yep so for proteins we were competitive with state-of-the-art so about the same and then for RNA and DNA modalities we we're able to show Evo has a significant performance boost or state-ofthe-art on these zero shot tasks compared to other domain specific models for for DNA or RNA and so we we focus on those three those those are the central dogma of biology what they're known for as and so we we try to have a little bit of Benchmark on each of these domains for the predictive tasks and so just as strong as proteins and then state-of-the-art on DNA and R okay and in terms of compute budget how does this approach compare to the prior State ofth art yeah so when we talk about state-ofthe-art it's more so a comparison of absolute performance versus controlling for parameter size or architecture so they do have different parameter sizes and data sets and so it's hard to make that exact comparison and so yeah it's most yeah it's less so focusing on the architecture comparisons well maybe how would you characterize the efficiency of these models relative to other approaches that someone trying to address these tasks might typically use yeah so so I take that back so we did compare we do a scaling analysis uh of pre-training on DNA across the prevailing DNA architectures so we do comparing with Transformers Mamba and then hyena in its hybrid variant as well and so on that front we do compare performance across architectures mainly on perplexity in terms of how they scale with perplexity and so it was interesting to see that these convolutional based models being particularly strong on DNA over Transformers and again this kind of alludes back to to seeing noting the difference between DNA and natural language you know being sparer in terms of information and also noisy at the same time What Did You observe in terms of zero shot performance versus few shot performance and is fine-tuning a thing that makes sense in this context and if so how so we we focus initially on zero shot tasks in this work because we wanted to Showcase um I guess most work previously in biology usually focuses on fine tuning but we sort of thought that was um uh I don't know taking an off-the-shelf natural language model and trying to fine-tune it to uh be useful in the biological context um I mean that's not necessarily because I think these models still also trained from scratch but I think I think was more so that previous models we observed didn't F didn't focus on generalizability and so they typically would always find tune just because our hunches and what we observe in mini models is that they didn't the zero shot performance wasn't that compelling um and so they they weren't practical for for many applications and so in this particular work because we had scale and because we were focusing on generalizability we wanted to see how well it could generalize zero shot across different modalities in biology um that that being said we did have some fine-tuning results which in in many of the tasks were also very strong quite competitive if not more competitive than the zero shot results we we observed before okay awesome awesome uh talk a little bit about where you see it all going what's next on your research agenda uh in this particular uh line of work and kind of how do you see it evolving more broadly so this is a something we been thinking a lot about um so we're we're certainly interested in both you know biological perspective and a machine learning perspective how to improve Pro and so from a machine learning side we absolutely do think scale is going to be really important that's why we did a set of DNA scaling laws to see do you get some kind of gain observed That You observe in natural language also in DNA and and indeed we do see this similar trajectory for DNA which is something we're going to focus on um from the biological perspective um trying to model more complex organisms is another key area so mammals and humans we definitely want to build Foundation models there because the application for human health and Therapeutics is just huge if we can be able to have a language model that understands the grammar of our own GNA as well as we've learned procaryotes or with microbial life I think that the application space is is far more than what we're aware of right now I'd say and so in general my my sort of philosophy or thinking or hope is that in similar ways that we've seen folks like open AI uh merge modalities in language and vision and gave us things like Sora and Dolly I think there's a potentially bigger opportunity in biology um to merge modalities into into designing sequences or Therapeutics or different drugs because there's so many different things in in uh assays or uh types of sequences that try to measure similar things in our body and so being able to merge those signals to design or or predict useful things in DNA or other biological sequences I think the potential is is far greater and potentially far more impactful and does uh do you see the hyena architecture as uh getting you there in the same way that Transformers are uh applied to multimodality or uh will it need significant Evolution to accommodate that uh I mean there's always room for improvement I'd say I think it's a place where we've put our bets because in a resource constrained World we've gotten good results in terms of longer sequences but is it enough to for example have a single ginormous biological Foundation model that will solve biology no it's not enough to get us there and so I do think there's I do think there's room for for Innovation on uh many many fronts including uh architectures down the road awesome awesome well well Eric thanks so much for taking the time to share with us a bit about what you're working on and hyena and hyena DNA and Evo in particular it was a great conversation really appreciate it awesome super fun really fun to be here thanks Sam\n"

Long Context Language Models and their Biological Applications with Eric Nguyen - 690

Random Videos