The Future of Deep Learning Research

The Future of Artificial Intelligence: A Research Agenda

As we continue to advance in the field of artificial intelligence, it's essential to explore new and innovative approaches that can help us create more sophisticated and intelligent machines. In this article, we'll discuss seven research directions that I believe have the potential to revolutionize our understanding of intelligence and its applications.

Using Evolutionary Strategies to Improve Performance

One area of research that I think holds a lot of promise is the use of evolutionary strategies to improve performance in machine learning systems. By leveraging these strategies, we can create machines that are better equipped to learn from trial and error, adapt to new situations, and generalize to unseen data. This approach has already shown impressive results in areas such as deep reinforcement learning, where algorithms like AlphaGo have demonstrated unparalleled success.

Reinforcement Learning: The Key to Unlocking Intelligence

Reinforcement learning is a type of machine learning that involves training an agent to take actions in an environment with the goal of maximizing a reward signal. This approach has proven particularly effective in areas such as robotics and game playing, where the agent must learn to navigate complex environments and make decisions in real-time. By using reinforcement learning, we can create machines that are capable of learning from their mistakes and adapting to new situations.

The Limitations of Current Hardware

Another area of research that I believe holds a lot of promise is the development of more advanced hardware for machine learning systems. Traditional computers use a serial approach to processing information, with each transistor serving as an on/off switch in a specific sequence. However, this approach has limitations, particularly when it comes to creating machines that can think and learn like humans. To overcome these limitations, researchers are exploring the development of neuromorphic chips, which mimic the structure and function of the human brain.

A New Approach: Neural Networks as Hardware

Neural networks have revolutionized the field of machine learning, but they are also limited by their reliance on traditional computing hardware. By rethinking the design of neural networks as hardware systems, rather than software programs, we may be able to create machines that can process information in parallel and at speeds that are previously unimaginable. This approach has already shown impressive results, with researchers developing chips that can mimic the behavior of large neural networks.

The Importance of Multi-Agent Systems

Multi-agent systems involve multiple agents interacting with each other in complex environments. By creating simulated environments that encourage cooperation and communication between agents, we may be able to create machines that are capable of learning from each other and adapting to new situations. This approach has already shown impressive results in areas such as robotics and game playing, where teams of agents have demonstrated unparalleled success.

The Cognitive Toolkit: A New Framework for Intelligence

Recent research has highlighted the importance of understanding intelligence as a cognitive toolkit, comprising multiple aspects such as attention, working memory, long-term memory, knowledge representation, emotions, and consciousness. By creating simulated environments that incentivize the emergence of this toolkit, we may be able to create machines that are capable of thinking and learning like humans.

The Exploration-Exploitation Dilemma

One of the biggest challenges in AI research is the exploration-exploitation dilemma, where agents must balance the desire to explore new ideas with the need to exploit existing knowledge. By allocating a larger percentage of resources to exploration, we may be able to create machines that are more innovative and capable of solving complex problems.

The Future of AI Research

In conclusion, I believe that these seven research directions hold a lot of promise for advancing our understanding of intelligence and its applications. By exploring new approaches to machine learning, developing more advanced hardware, creating multi-agent systems, and understanding intelligence as a cognitive toolkit, we may be able to create machines that are capable of thinking and learning like humans. As researchers, it's essential that we continue to explore new ideas and push the boundaries of what is possible in AI research.

As Andrew Ng once said, "The future of AI is not just about the technology itself but about how it is used to improve human life." By continuing to invest in AI research and development, we can create machines that are capable of solving some of humanity's most pressing problems. The possibilities are endless, and I'm excited to see where this journey takes us.

The concept of intelligence is complex and multifaceted, encompassing various aspects such as attention, working memory, long-term memory, knowledge representation, emotions, and consciousness. Research has shown that these different components work together to define intelligence or intelligences. To create machines that are capable of thinking and learning like humans, we need to create environments that incentivize the emergence of this cognitive toolkit.

One way to do this is by using multi-agent systems, where multiple agents interact with each other in complex environments. By creating simulated environments that encourage cooperation and communication between agents, we may be able to create machines that are capable of learning from each other and adapting to new situations.

Another approach is to use reinforcement learning, which involves training an agent to take actions in an environment with the goal of maximizing a reward signal. This approach has proven particularly effective in areas such as robotics and game playing, where the agent must learn to navigate complex environments and make decisions in real-time.

The development of more advanced hardware for machine learning systems is also crucial. Traditional computers use a serial approach to processing information, with each transistor serving as an on/off switch in a specific sequence. However, this approach has limitations, particularly when it comes to creating machines that can think and learn like humans. To overcome these limitations, researchers are exploring the development of neuromorphic chips, which mimic the structure and function of the human brain.

In conclusion, I believe that these research directions hold a lot of promise for advancing our understanding of intelligence and its applications. By continuing to explore new ideas and push the boundaries of what is possible in AI research, we can create machines that are capable of thinking and learning like humans.

The future of AI research is exciting and full of possibilities. As researchers, it's essential that we continue to invest in AI development and explore new approaches to machine learning, hardware, multi-agent systems, and the cognitive toolkit. By doing so, we can create machines that are capable of solving some of humanity's most pressing problems and improving human life.

The exploration-exploitation dilemma is a significant challenge in AI research, where agents must balance the desire to explore new ideas with the need to exploit existing knowledge. By allocating a larger percentage of resources to exploration, we may be able to create machines that are more innovative and capable of solving complex problems.

In addition to exploring new approaches to machine learning, hardware development, multi-agent systems, and the cognitive toolkit, it's also essential to continue advancing our understanding of intelligence as a cognitive toolkit. By doing so, we can create machines that are capable of thinking and learning like humans.

The possibilities are endless, and I'm excited to see where this journey takes us. As researchers, it's essential that we continue to push the boundaries of what is possible in AI research and development. By doing so, we can create machines that are capable of solving some of humanity's most pressing problems and improving human life.

In conclusion, I believe that these seven research directions hold a lot of promise for advancing our understanding of intelligence and its applications. By continuing to explore new ideas and push the boundaries of what is possible in AI research, we can create machines that are capable of thinking and learning like humans. The future of AI research is exciting and full of possibilities, and I'm excited to see where this journey takes us.

"WEBVTTKind: captionsLanguage: enHello world it's Suraj and what is the future of deep learning? That's the topic for today?Inspired by something that Geoffrey Hinton recently said this talk is gonna be divided into three parts. I'm gonna talk about howbackpropagation works what the most popular deep learning algorithms are right now andFinally seven research directions that I have personally hand-picked, okayThat's the talk in three parts, so let's get started hereSo this whole video was inspired by what Geoffrey Hinton recently said in an article soGeoffrey Hinton is the Godfather of neural networks?He's the guy who invented the backpropagation algorithm back in 86Which is the workhorse of all almost almost all deep learning, okay?So without backpropagation all the great things. We're seeing in deep learning would not be possible today self-driving cars image classificationlanguage translation almost all of it is because of that propagation so this what Hinton recently said iscausing shockwaves in the deep learning communityHe said that he was deeply suspicious of backpropagationhis own algorithm and said my view is to throw it all away and start again, and I have to say that IAgree with Hinton. I know it's crazy right because back propagation has just given us so muchBut if we really want to get to artificial general intelligenceWe've got to do something. That's morecomplicated or just something else entirely becauseIt's not just about stacking layers, and then back propagating some hair gradient recursively. That's not gonna get us to consciousnessthat's not gonna get us to systems that are able to learn aHuge variety of tasks everything from playing games to piloting an aircraft to figuring out the most complex equations in the universeIt's got to be about more than justgradient based optimization, soLet's first start. Let's first talk about how back propagation worksOkaySo the billion dollar question is this probably multibillion-dollar actually how does the brain learn so well from sparse?Unlabeled data. That's how we learn we don't sit thereWe don't have labels for everything we're learning if you look at a babyIt's incredible how much it learns without some kind of supervisionRight it can learn how to do all these different tasks stack blocks and all these theyLearn how to speak and there are no labels per se in the sense that we use them in deep learningIt's all happening unsupervised and when a sparse means few right so it's not about very denseDescriptive data. It's sparse right. There's a lot of zeroes in it and yet we can still learn from itSo how does it do this well, let's first understand how back propagation works so inBack propagation so first of all a neural network is a huge composite function. It's a functionConsisting of other functions and these functions are all in the form health layers right so you've got a input layer a hidden layermaybe multiple hidden layers, and then finally an output layer andYou can look at it as this four-step process that I've got right hereSo the first step is to receive a new observation X in a target label Y so this could be some image ofCancer cell and then the label cancer right and they could be either cancer or not cancerand so you'll take that input and you already have the label right, that'sThat's how back propagation works as long as you know that labelYou are golden, but you've got to know the label so you take that inputIt's an image think of it as a series of pixels, so it's just a huge group of numbersSo it's a vector right so you take that group of numbers and youThen you go to step 2 you feed it forward through the layers. What do I mean?What do I mean by feeding it forward you you continually take that input?Multiply it by some weight value add a bias value and then activate it by applying a non-linearity to it and youContinually do that over and over again until you have an output prediction, and I'm gonna go over this more in a secondwe're gonna look at the code, butonce you have that output prediction you compare it to your actual prediction your real label by doing this adifference the subtraction right you you're subtracting the actual from the from thethat right and that that difference because it's these are all just numerical values that difference is your error andThen you go to the last part step for backPropagating that error so once you have that error you're gonna compute the partial derivative of that error with respect to each waitRecursively for every layer so you'll compute the partial derivative with respect to the layer before and then you take that errorGradient that you that you've just calculated and use it to compute the partial derivative with respect to the next layerand you'll keep doing that every time and then what happens is you're going to have aSet of gradient values that you can then use toUpdate all the weights in your network for as many as there are so the process is input feed-forwardGet the error back propagate update the weights repeat feed-forwardGet the error back propagate the peatOver and over and over and over again hundred thousand million times right and that is the back propagation algorithmI'm gonna go into it in more detail, but that's at a high level, okay, soThe paper that I'm talking about where Hinson released this I've got it linked to right here, but it's a very old paperIt's it was it was done in 86 and the reason it was created in 86 and the reason that Hinton is such a geneIs because everybody was telling him. This is not gonna workYou've got a thing of something else, but Hinton held strong to this belief okay, and I think that is the mark of a goodResearcher if you really believe in somethingTo not let anything else influence what you believe in right stick to your belief if you're wrong you're wrongBut at least you stuck to what you believed in and you and you listened to other opinions as wellBut you really you really believe in something so the reason that it works nowAnd it didn't work in the 80s is because now we have the computing power and the data necessary to have these huge amazingclassifications and these amazinggenerations right for classification and generative models for bothSo let's just look at this canonical example of a very basic neural network that uses back propagationScratch out the input number for it's just three inputs right so we have it's a 3 layer Network input hidden and outputI'm gonna go through thisKind of fast because we're gonna get to what really matters in a second hereAnd if already if you already know how backpropagation works just skip forward probably I'm gonna estimate five minutes from nowSo we have this very simple basic neural network and the goal is to take some input dataAnd then predict the output label right, so we've got some input dataWhich is a series of triplets 0 0 1 0 1 1 you know etc and then we have a?set ofAssociated labels 0 1 1 0 so for 0 0 1 The Associated output label Y is 0 and then etc etcAlright, so that those are our inputs and our outputs. We have this non-linearity which is a sigmoid function. It's an activation function andI'll talk about that in a second, but we'll take our input dataWe will take our output dataAnd we want to learn the mapping function right so then given some new output 0 1 1 or 1 0 1 some arbitrarytriplet we'll be able to correctly predict theOutput label as it should beRight so the first step is first to initialize our weight values so our weight values are a set are both matricesThey are they are randomly initialized matricesAnd so what happens is when we have our input right so again remember just scratch out the fourth oneIt's because there's only three. It's a tripletWe'll take that input triplet multiply it by the weightsAnd those are the matrices and so that's why you see these arrows right the reason we're usingwe they say we need linear algebra in deep learning is becauselinear algebra takes the standard algebra operations like multiplication and division and additionAnd it applies it to groups of numbers that are called matricesSo linear algebra defines operations that we can apply to groups of numbersMatrices for example the dot product, which is used heavily in deep learningThat's in fact that is the multiplication we use in all types of deep neural networksIt's a way of multiplying groups of numbers together, which is what we're doing right?We're taking our input multiplying by a weight matrixAnd then we take that result and we add some bias value and a bias acts as our anchorIt's a way for us to have some kind ofBaseline where we don't go below that in terms of a graph think of like y equals MX plus Bit's kind of like the y intercept for this function that we are trying to learn right and once we'veMultiplied our input times our weight value added a bias and then applied some activation function to it which is ourNon-linearity the sigmoid that's going to give us an outputAnd we just take that output and do the same process for the next layer and the next layer and however many layers we haveOkay, so that's what we're gonna do using the dot product and then we're gonna back propagate the error once we compute it so backpropagationYou don't need to know all of calculus to understand back propagationYou only need to know really three concepts from calculus the derivative the partialDerivative and the chain rule which I'll go through in a second so firstI'll go through the derivative so the derivative is the slope of the tangent line of a function at some point andThough an easy way to compute the derivative for some function like say y equals x squared or any function is to compute the powerRule which I have right here, so you'll take the exponent and you'll subtract one from itAnd you'll take the original exponent and move it to the coefficient so for y equals x squaredYou take the to move it to the coefficient subtract one so then it becomes two X to the first which is two xSo the derivative of y equals x squared is 2x so the reason and so the derivative tells us the rate of changeIt tells us. How fast some function is changing andWhat what is happening for for gradient based optimization in neural networks in all of deep learning?Most of deep learning is we have some if if we were to map out all the possible error values on the x axis soJust imagine these are all errorsAnd then all the possible weight values on the y axis it would come out to be a parabolaJust like this and what we want to do is we want to find that minimum error valuewe want to find those weight values such that it's gonna give us the minimum error value andWhat that means is we want to find the minimum of that parabola and the way we're going to find that minimum of the parabolaIs by computing the derivative which tells us the rate of change of wherever we are and then we're going to use it to updateOur weights such that we are iterative ly incrementallycontinuously moving closer and closer and closer and closer to that minimum pointAnd once we have that minimum point that is our optimal point where the error is smallest and the weights are at their most optimalValues such that the error is in the me the smallest every time we make a predictionrightThat's why it's called gradient descent in general right so when we take this very very popularOptimization formula gradient descent which I just described and we apply it to deep neural networksWe call it back propagation right because we are back propagating an error gradient across every layer that we haveand soThe reason we know we need to know the derivative is because we're you because we're going to what we're actually computing is the partialDerivative it's derivative because a neural networkDon't just have one variable it has several variables right for however for however complex your function is soWe want to compute the partial derivative of the error with respect to each weight right soWhen I say with respect to I'm talking about that way and none of the othersSo you can think of a partial derivative as saying okay? Well what is a partial derivative with respect to X for this equation?What that means is we are only computing the power rule for X and we are ignoring everything elseSo Y to the fourth is ignored and when we compute the derivative of XIt's going to be 1 right so then we are left with 5yNow if we're doing the partial derivative with respect to Y, then we don't care about Xwe only care about Y so we do the power rule for Y to the 4 so it's 4y cubed plus 5x because YThe derivative of Y has 1 so then the X remains so that's what we're computing we're computing the partial derivativeAnd that's what's going to give us our error gradient the gradient tells us how which direction to move on that parabolaTo get to that the optimum of that minimum point gradient descent and the last part is the chain rule right?Because a neural network is a giant composite function right look. What did I just subscribe I described taking an input valueMultiplying it so of input times way add a bias activate, rightWe've talked about this before input times weight added bias activate that is the formula that is happening that is the function right?That is happening at every layerAnd these layers are nested so every time you add a layerYou are adding a nested function inside of this giantComposite function that is the that is the neural network so the chain rules tells us how to derive a composite functionWhat you would do for a composite function is derive the outside keep the inside and multiply it by the derivative of the insideSo that is a rule and that rule apply for cursive leafOr as many nested functions as you haveSo that's the chain rule and so now that we understand that we can do back propagation that there's there's your calculus primer on doingBack propagation so the rest of this very canonical example is saying for 16,000 iterations. Let's feed forwardThat input data through each layer, and so what we do for each layer is say, okay. We've got K. 0That's our input multiplied by the first synapse matrixand by multiply I'm talking about dot product Thank You numpy apply the activation function ornon-linearity to it so the activation function the reason we do that is because a neural network is a universal function approximator, I'mTelling you a lot right now, so just don't worry if you don't understand everythingThere's a lot more to come and then re watch this video, and I've got a million other videos on this stuff as wellSo I'm very excited right nowWhere was I?Okaysowe were taking the input times the weight were atInput time and so in this case we don't have a bias right because this is very basicBut usually we have a bias, so we're doing input times weightActivate okay bye times. I'm talking about dot product so we say okayAnd then you repeat that again for the last layer, and then k2 is gonna be our output predictionAnd then we compute the error by finding the difference between our actual output in our predicted output, then we perform back propagationWe take that error weighted gradientand we see in what direction is this target value by computing the activation of that output value andmultiplying it by the error and that's going to give us the gradient value the Delta the change right and that Delta is what we'reGoing to use to update our weights in a second, but we've computed the Delta the gradient for this layer right thoughThe hidden lair let's get itLet's compute the gradient for the next layerso recursivelySo we'll use the k2 Delta to can see how much the K 1 valueContributed to the k2 error and once we've got that K 1 errorwe'll we'll do the same exact problem process again toCompute the K 1 gradient, so the first layer is gradient and once we have both gradientsThen we can up update both of those weight values using those gradesAnd we just do that over and over again 60,000 iterations that is back propagationSo I wanted to go into a tangent no pun intended to talk about derivatives and ingredients and how back propagation worksBut that propagation is the workhorse of deep learning?And this is a great chart the neural network suit that shows many different types of neural networksThere are so many types of neural networks out there. It's not just one. There's a lot right and back propagation is theOptimization strategy of choice for almost all of them right almost all of them use labeled dataAnd then back and then back propagation has an optimization strategy to learn some mapping function right everything is a function in lifeEverything is a functionlove is a function emotions are a function that the sound of theairplane above and then relating that toHow fast velocity and you know all these different variables?It's all you can represent everything as a function math is everywhere math is all around us math is beautiful. It's beautiful seriously. OhMy god, it's awesome anywayeverything is a function right so we're trying to learn the function andSupervised learning using back propagation is a way for us to do that soHow do artificial and biological neural networks compare so this is a verybasic view of how they compare the ideaIt's such a roughIt's such a rough the initial perceptron the initial neural network were so roughly inspired byBiological neural networks. It wasn't like they were saying well. Let's let's implement a neuro trend let's let's implementYou know dopamine and dendrites in all of their details. I mean neurons are these very complex cellsIt's very basic all the only inspiration is saying you have some neuron it's got a set of dendrites that receive some input itPerforms some kind of activation from some kind of activation on that neuron what what that means is it decides whether or not to?Propagate that that signal onward or not using some function and if it decides to then it sends it out. That's itThat's that's the extent ofthe inspiration between artificial and biological neural networks, rightBecause we have some input we compute some activation function like riilu or sigmoidOr you know there's there's many of them out thereAnd then we output the value right so the brain has a hundred billion of these neuronsNumerous dendrites and it commits it uses parallel chaining so each neuron is connected to ten thousand plus othersCompare those two computers right computers don't have neurons in terms of hardwareThey are made of siliconAnd they are serially changedWhich means these transistors on or off switches are each connected to two or three others and they form logic gates?So with and they are great at storage and recall even though. They are not as parallelized as our brainThey are still better than at some things we got to admitThen we are like it's better at calculating numbers in in memory right we can't compute a million times a millionBut uh, but a computer can however what our brain is really good at that computers are not isCreativity right we are able to take some idea that is completely unrelated to another idea and apply it and thenIt results in some amazing innovation, or task and we are great at connecting different concepts togetherWe are great at being able to learn many different things and apply our knowledge to many different tasksAnd that's what we should be trying to do with AIand so there are some really key differences between our brain andArtificial neural networks first of all everything in the brain is recurrent that means there is always some kind of feedback loop happenIn any type of sensory or motor system right not all neural networks are recurrentThere's a lot of lateral inhibition, which means that neurons areInhibiting other neurons in the same layer. We haven't seen a lot of that in deep learningThere is no such thing has a fully connected layer in the brain connectivity is usually sparse although not randomUsually we have fully connected layers at the end of our networks like say for convolutional networks, but in the brain there are none, right?Everything is sparsely connected, but it's it's smartly sparsely connectedBrains are born pre-wired to learn without supervisionSo we talked about this a little bit right now babies can know things even though they don'tThey learn there aren't given labels or any kind of supervision and lastly the brain is super low-power at least compared to deep neural networksRight the brain's power consumption is about 20 watts compare that toarguably one of the most advanced AIS today alphago it used about 1200 CPUs and176 GPUs not to trainBut just to run just imagine how much how many watts that takes that's like an order of an order of magnitudeMore power than our brain takes, which is which is annoyingly inefficient right so we canDefinitely definitely definitely improve on thatThere's this great book by this Harvard psychologist Steven PinkerWhich I've read and I would highly recommend it called how the mind works, and this book is from a neuroscience perspectiveNot a machine learning perspective, but we need more of thatWe need more of that because there are certainly a lot of Secrets here that we haven't figured outBut we're trying so this is a great book to read and it's a there's a great quote from that book that I'm gonna readOut to youWhich I particularly like the quote is the brain is not a blank slate of neuronal layers waiting to be pieced together and wired upWe are born with brains already structured for unsupervised learning in a dozen cognitive domains some of whichAlready work pretty well without any learning at all right evolution has primed us to be able to do certain thingseven though we don't have anyReal-time learning happening. It's just wired into us right so there is something to be about structure versus learning everythingAnyway, okay?So we've talked about that so where are we today right so that was the first part here the second partAnd then we'll get to the third part research directions, so where are we today in?Unsupervised learning we know where we are with supervised learning that means when we have labelsBut what if we don't have labels well we can divide machine learning into two types besides supervised and unsupervised classificationand generationRight these are two tasks and one meta way of looking at it as is as creativity and discoverywhen everything else is automated for us when all of theYou know all of the brainless labor that we don't care about when all that is automated what's gonna be left for us humans isour two tasksCreativity and discovery right what can we create?What can we discover and we're and we and we can frame those things as classification discovery and creativity?Generation so for classification what is something?clustering right clustering is perhaps the most popular technique when it comes to classification andThere are many ways to cluster data, right if you don't have the labels, but you do have the dataMaybe you can learn clusters for all of these labelsSuch that they're that you'll be able to know what groups each cluster are in so it's like learning without labels right there are severalstrategies to learnclusters from data k-means is perhaps the most populardimensionality reduction techniques like T distributed stochastic neighbor embedding or th t-sne orPrincipal component analysis, there's an anomaly anomaly detectionBut most of them still used some sort of supervised learningAnd the ones that don't use back propagation are not necessarily betterThere are actually very simple algorithms like k-means is just you know these four steps right hereIt's very simple. There's it's just basic arithmetic andThat's where we are right nowThere's also Auto encoding right auto-encoders are really popular for unsupervised learning the idea is that?If you are given some input try to reconstruct that inputThrough outputs you have an input you learn some dense representation and you try to reconstruct it from thereAnd this is great for dimensionality reductionAlerting some feature some features etcFor generation perhaps the most popular right now is the generative adversarial Network, so I met the creator he and good fellowWe had a good conversation in San FranciscoWe had you know he's a really smart guy and reallyI mean the idea was so basic right it was such a basic very intuitive ideaYeah, it is the reason behind a lot of hype and deep learning right nowThe idea is to have two networks one tries to fool the other right you have a discriminatorAnd you have a generator and so what happens is you have some data set let's say some imagesAnd you want to generate new images that look very similarBut they're new so what you do is you take one networkAnd it's it takes in an input of one image it applies some distributionfunction to it right in latent space so what that means is like a Gaussian or something like that so take some Gaussian distributionMultiply it by that imageand so the image is basically aGroup of numbers right pixel values and when you apply some distribution value to it you change those numbers ever so slightlySo then if you look at it as a picture. It's thatIt's it's a slightly modified picture and that picture is then fake and it only does this sometimes sometimesit shows the real one it shows a fake one andThe discriminator is a classifier it rightSo you know what the real image is and you know what but you don't know the fake image is fakeOr not right the classifier doesn't know so it's got itSo it tries to classify the fake image?And if it gets it right or wrong you could take itsPrediction compute an error value between the real and the fake and then again back propagate an error gradient valueRight so you are still using back propagationAcross it so the whole thing is what's called end-to-enddifferentiable because we can differentiateEvery weight value in the in this in this system so even though there are no explicit labelsWe are still using back propagationIt's self supervised so it's like we are creating the labelsAnother great example our variational auto-encoders what we are embedding stochastic city inside of the model itselfThat means inside of the layers we have a random variable what that means isThe the neural network is not deterministic. It's stochastic you cannot predict. What the output is gonnaBe that means that if you have some input you feed it through these layersOne of them is a random variable so it's a distribution that's applied to that inputWhat happens is the output is going to be some unpredictable?New output that you didn't predict before which is what you're trying to generate, right?andLastly, and these are the bleeding edge of unsupervised learning models by the way and lastly is the differentiable neural computerSo I I am gonna go out on a limb, and I'm gonna say that the DMC isthe most advanced algorithm currently that usesThat propagation out thereMaybe maybe alphago is better, but we haven't seen the source code for that so I wouldn't know but in terms of openly availableSource code. The dnc is is isAmazing it's also highly complex. There are so many moving parts in the differentiable neural computerAnd I have a video on this just search DNC SirajBut there are so many moving parts here. You've got read and write headsbut basicallyYou are separating memory from the network itselfRight so you have memory and the analogy fit that they made was betweenDNA and the brain right so you have DNA these this is encoded external memory so you have an external memory storeand then you have your your internal controller right and so the the net theThe controller is pulling from the memoryAnd there are read and write heads between the controller and the memoryBetween there there are links between different rows in the memory basically you have let me show you this. Let me show you this youHave so manydifferentDifferentiable parameters all of these you have you have read and write heads you have Ellis TM cells every single one of these matricesare andEverything one of these major trees are differentiableSo this is a gigantic very complex system and everything is differentiable right so that there's that and so nowAnd what they did was for the DNC was they?generated a random graph and ofDifferent subways and they use it to try to predict where someone was gonna go based on some questionsWhich is just incredible? They also trained it on family trees and a bunch of other things butBasically the bestunsupervised learning methods still require back propagation so my point here is that back propagation really is the workhorse ofdeep learning even in the unsupervised setting butAnother thing. I want to say is that a lot ofDeep learning research is all about making small incrementalimprovements off of existing ideas and a lot of times academia kind of pushes us in that directionIt pushes you to make incremental changes, maybe like tweaking one hyper parameter or adding some new layerType or maybe new some new cell type like at GRU or whatever, but if you havebut if you if you if you think of a radically new ideaYou can really shake things up seriously, and the idea it doesn't even have to be that difficultIt really does it it doesn't even have to be that complex like think of games like think of generative adversarial networksIt's such a simple idea you have two networks one tries to fool the other that's itIt's just two neural networks one tries to fool the other and Jana Kuhn said this is the hottest idea in the past20 years in deep learning andLook at this. I mean this idea was inventedJust two years ago look at the number of Gans that have been have been inspired by that firstPaper there are so many and this is in two years all of these different. I could go on you could make an entireFour month course on all the different types of Gans out there so my point isAnyone can think of a really good idea when it comes to D?Pointing the the playing field is is level for everyone, so let's get to their future research directions, okay?So the first one so my thesis is this is that unsupervised learning and reinforcement learning?Must be the primary modes of learning because labels mean little to a child growing right so we need to use moreReinforcement learning more unsupervised learning, and then we're gonna get to somewhere somewhere better than where we are right nowso the firstWe research Direction is Bayesian deep learning, which is not discarding backpropagation is just making it smarterWhat do I mean by this Bayes Bayesian logic is all about having some prior assumption about how the world worksversus frequentist, which just assumes thatWe have no assumptions right so when you take Bayesian reasoning and apply it to deep learning you can haveAmazing results and this has been proven in the case of variational autoencodersBut deep learning struggles to model this uncertainty so when I talk at what I want in what I?Specifically mean when I say Bayesian deep learning is smarter weight initialization and perhaps even smarter hyper parameterinitialization right and this kind of relates back to a child and howEvolution has primed us to know certain things before we've learned them in real time right there are certain learningsWe already have we are weights in our head are not initialized randomly when we start learningWe have some sort of smarter weightInitialization so Bayesian logic is is it is a great direction is a great research direction just combining those two fieldsBayesian logic and deep learningThe second one is called spike timing-dependent plasticity and a great analogy for this is sayingYou know you're trying to predict if it's gonna be raining or not you can go out thereAnd you can see if it's going to rain literally with your own eyesOr you can look at your roommate who tends to take an umbrellaEvery time he goes out and every single time he walks out with an umbrellaIt happens to be raining so rather than try to go out there yourself look at it's raining or notInstead you just look at your roommate see if he picks up an umbrella, and if he does you know that it's gonna rainSo you take an umbrella so the analogy applies to spike timing-dependent plasticity?Because you can't properly back propagate for weight updates in a graph based network since since it's an asynchronoussystem, so we trust neurons that are faster than us at the task so it's all about timing looking at neurons andFaster firing and using those neurons as a signal as a signal for how we learn so supposeWe have two neurons a and BAnd a synapses on to beThe stdp rule states that if a fires and B fires after a short delay the synapse will be potentiated, okaySo the magnitude of the weight increase is inversely proportional to the delay between a and B firingSo we're taking timing into consideration which D pointing currently does not do the time of firing?The third idea is ourSelf-organizing maps so this is not a new idea at all, but that's okay. That's another thing that I want to mentionThere is so much machine learning and deep learning literature out thereThere is a lot and a lot of times the best ideas are forgottenThey are lost in the mix because there's so much. Hype around certain ideas and sometimes. It's unnecessaryHype around certain ideas and some of the best ideas could have been invented 20-30 years agoI mean look at deep learning right so it's just all about finding those ideas andself-organizing maps are one of those ideas whereYou know this is an older idea, but it has a lot of potential and not many people know how these works how these workBut this is a type of neural network that is used for unsupervised learning so the idea is that?We have we we randomized the node weight vectors in a map of them, so we have some weight vectorsAnd then we pick some input vector, that's our that's our input dataAnd we traverse each note in the map computing the distance between our input node and all the other nodesAnd then we find the node that is closest the most similar to our input node that is the best matching unit the B muthen we update the weight vectors of the nodes in theneighborhood of the B mu by pulling them closer to the input vector and what happens is this creates aself-organizing map and you can visualize it as different colors, but it's a basically clusters of different data points, soIt's basically clustering, and I think this is a great idea. It doesn't use it that it doesn't use back propagationAnd we should look more into thatthe fourth idea the fourth for the fourth direction or synthetic gradients, soWho Andrew Trask has a great great blog post on this that I highly recommend you check outIt's really in-depth, but this idea came out of deep mind this idea came out of deep mindAnd it's basically it's a much faster version of backBackpropagation in which you are not waiting as long to update your weightsso individual layers make a best guess for what they think the data will say then they update their weights according to that guess andThey call this best guess the synthetic gradient because it's a prediction of what the gradient will be not what it actually is and thatData is only used to help update each layers guess or synthetic gradient generator, and what this does is it allowsindividual layers to learn in isolationWhich increases the speed of training individual layers can learn without having to do a full forward and backward passSo that synthetic gradients, and I think and and it's weird because even in the machine learning subreddit people who are talking about synthetic gradientsbutSome of the questions were hey we need more of thisWhy hasn't why hasn't this been talked about more and people don't know right?So this is a great idea came out of deep mind and definitely learn more about synthetic gradients the fifth research direction is ourEvolutionary strategies so open AI had a great blog post on thisEvolutionary strategies as a scaleable alternative to reinforcement learningBut evolutionary strategies have not given us a lot of success so far, but that's okayJust intuitively they make a lot of sense right trying to resembleevolution you have Fitness you have a fitness function that determines how fit some individual is and these individuals mates right so there'sCrossover and you know it's basically survival of the fittest you haveyou have mutation selection and crossover via of Fitness function andYou can do this with a lot of games right so you can have several neural networksAnd you can use evolutionary strategies to have the best one win or or survive longer than the restso I think there's a lot of potential for that and it's very similar toreinforcement learning so if I if I were to pick the lowest hanging fruit right the lowest hanging fruit in terms ofRevolutionary ideas to come to the table of really radical changesIt would be in reinforcement learning deep reinforcement learningReinforcement learning is all about learning from trial and error right you have some you are some agent in some in some environments, right?It's called the agent environment loop you perform an action in that environment you get a rewardYes or no and then based on that reward you update your state your learnings, and you continue that processso alphagoUsed reinforcement learning deep reinforcement learning to get really good at its gameAnd there are so many low-hanging fruits and deeb reinforcement learning how do we learn the best policy?Just there are so many unanswered questions so reinforcement learning in general is a great place toDo to just focus on in terms of research and the last one is the most capital intensive and perhaps the hardestbut I just had to mention it right we talked about how transistors are on off switches andThey are chained together serially to for to perform to form logic gates whereas neural networks areAre parallel in their construction so they're connected to ten thousand other ones so?Perhaps instead of trying to replicate the rules of intelligence inSilico or at least on the current types of chips we have let's just change the hardware completely right at the hardwareLevel and IBM's neuromorphic chips are a good example of going in this directionGoogle's TP use tensor processing unit but basically the idea is to wire up transistors in parallel like the brainReally, I think I think anyone can can do this. You know you if you have some ideaI mean think about it the brain is only running on 20 WattsSo it can't be that expensive right in terms of hardware or wetware right so thereSo if you have some idea you can crowd fund it for whateverHardware you want to build and then you know use we funder or Kickstarter and yesI think you can even have a start-up for hardware for for machine learning for deep learning, so what is my conclusion?those are my seven research directions that I wanted to talk about today as a way as aResponse to Hinton talking about backpropagation, so what is my conclusion?What do I think I think and so I agree with Andrey Carpathia? Who is one of the best deporting researchers out there?he's a director of AI a Tesla now andThe conclusion is this let's create multi agents simulated environments that heavily rely onreinforcement learning and evolutionary strategiesCarpathia this great talk at Y Combinator, which I didn't attend, but I the slides are online, but check this outHe had this one slide that said intelligence the cognitive toolkit includes, but is not limited to all of these different aspects of intelligenceattention working memorylong-term memory knowledge representation emotions consciousness there are somanydifferenttopics that encompass learning it's this Orchestra of different of differentConcepts and they all work together to define intelligence or intelligenceso theConclusion is we need to create environments that incentivize the emergence of this cognitiveToolkit so doing it the wrong way is to use thisEnvironment what does this incentivize?incentivizes a lookup table of correct moves right for pongBut what is doing it right this to agents in this world. There's some food. There's some survivalThere are learning to adapt to each other. It's much more like real life itself right and that incentivizes a cognitive toolkitcooperation attention memoryemotions even right with more complexity soIt comes down to the exploration versus exploitationDilemma from reinforcement learning how much do we want to exploit existing?Algorithms back propagation by making incremental improvements versus how much do we want to explore entirely new ideas?And we need people doing both we need people improving the deep learning algorithms because there's still a lot to be improved uponBut we also need people working on exploration like entirely new ideas in factI think we need more peopleFocusing on that that we have currently so if I were to you know take some awayI would say let's take 20% and put them 20% from the exploitation and put them in the exploration categorysoYeah, it's just something to think about I hope this video helped you think more about all these concepts and where we're headed and whereWe should goI hope it gave you some ideas for what you might be more interested in and I'm gonna keep making videos like this soYeah, please subscribe for more programming videos and for now I've got it evolved so thanks for watching\n"

The Future of Deep Learning Research

Random Videos