Hello, World: Exploring Variational Autoencoders with Siraj Raval
Hello, world! It's Siraj, and today we're going to explore a really cool model called a variational autoencoder. What happens when we encounter data with no labels? Can we still learn from it? These are the questions we'll be answering as we delve into the world of unsupervised learning.
Most recent successes in deep learning have been due to our ability to train neural networks on huge sets of cleanly labeled data. If you have a set of inputs and their respective target labels, you can try and learn the probability of a particular label for a particular input. Intuitively, this makes sense. We're just learning a mapping of values. But the hottest area of research right now is learning from raw, unlabeled data that the world is absolutely brimming with. Unsupervised learning.
One type of model that can do this is called an autoencoder. Autoencoders sequentially de-construct input data into hidden representations, then use those same representations to sequentially reconstruct outputs that resemble their originals. Autoencoding is considered a data compression algorithm, but they're not perfect and can distort the original data in certain ways.
The concept of unsupervised learning may seem daunting at first, but it's actually quite straightforward once you understand the basics. Unsupervised learning involves training a machine learning model without any labeled data. The goal is to find patterns or relationships within the unlabeled data that can be useful for making predictions or generating new data.
One way to approach unsupervised learning is through the use of autoencoders. An autoencoder is a type of neural network that consists of two parts: an encoder and a decoder. The encoder takes in a input, processes it using a series of transformations, and produces a compressed representation of the input. The decoder then takes this compressed representation and expands it back into its original form.
Variational Autoencoders (VAEs) are a type of autoencoder that adds a new layer of complexity to the traditional autoencoder design. In addition to producing a compressed representation of the input, VAEs also produce a probability distribution over the input space. This distribution is known as the latent space, and it represents all possible versions of the input data.
The key advantage of VAEs is that they can be used for both dimensionality reduction and generative modeling. In dimensionality reduction, the goal is to reduce the number of dimensions in a dataset while preserving its essential characteristics. In generative modeling, the goal is to generate new data that is similar to the existing data.
So, how do VAEs work? The process begins with training the encoder network. This involves feeding the input data into the encoder and adjusting the weights and biases of the layers until the output matches a pre-defined target distribution. Once the encoder is trained, we can use it to generate new data by sampling from the latent space.
The next part of our journey will delve deeper into how VAEs work by examining the components that make them up.
"WEBVTTKind: captionsLanguage: en(siraj raval) Hello, world.It's Siraj.And what happens when weencounter data with no labels?Can we still learn from it?We're going to use areally cool model calleda variationalautoencoder to generateunique images after trainingon a collection of imageswith no labels.Most of the recentsuccesses in deep learninghave been due toour ability to trainneural networks on huge setsof cleanly labeled data.So if you have a set of inputsand their respective targetlabels, you can try andlearn the probabilityof a particular labelfor a particular input.Intuitively, this makes sense.We're just learninga mapping of values.But the hottest areaof research rightnow is learning fromthe raw, unlabeled datathat the world is absolutelybrimming with, unsupervisedlearning.One type of model that can dothis is called an autoencoder.Autoencoders sequentiallyde-construct input datainto hiddenrepresentations, thenuse those same representationsto sequentially reconstructoutputs that resembletheir originals.Autoencoding is considered adata compression algorithm,but they're data specific.So it can only compressdata similar to whatit's been trained on.They're also relatively lossy.So the decompressed outputswill be a bit degraded comparedto the original inputs.Usually a well knowncompression algorithm like jpegdoes better.So what are they useful for?Absolutely nothing.Just kidding.One major use caseis data denoisingwhere we train anautoencoder to reconstructthe input from acorrupted version of itso that given some similardata that's corrupted,it can denoise it.Another is to generatesimilar but unique data, whichis what we'll do.There are a bunch ofdifferent autoencoder types,and each has its own uniqueimplementation details.But let's dive into one Ifind particularly interesting,the variationalautoencoder, or VAE.While deep learningis a great wayto approximatecomplex functions,Bayesian inferenceoffers a unique frameworkto reason about uncertainty.It's an approach tostatistics in whichall forms of uncertaintyare expressedin terms of probability.At any time thereis the evidencefor and against something,which leads to a probability,a chance that it is true.When you learnsomething new, youhave to fold this new evidenceinto what you already knowto create a new probability.Bayesian theory describesthis process mathematically.The idea for the VAE camewhen these two ideas merged.So looking at it througha Bayesian likeness,we treat the inputs,hidden representations,and reconstructedoutputs of a VAEas probabilistic,random variableswithin a directedgraphical model.So it contains a specificprobability model of some data,x, and latent orhidden variables, z.We could write outthe joint probabilityof the model this way.Given just a characterproduced by the model,we don't know which settingof the latent variablesgenerated the characters.We've built variationinto our model.It's inherently stochastic.Looking at it througha deep learning lens,a VAE consists of an encoder,decoder, and loss function.So given some input x--let's say we've got 28by 28 handwritten digitimage, which is 784dimensions whereeach pixel is one dimension.It will encode it into a latentor hidden representation space,which is way less than 784.We can then sample from aGaussian probability densityto get noisy values ofthe representations.Let's start writing thisout programmatically.After importing thelibraries and findingour hyperparameters, we'llinitialize our encoder network.Its job is to map inputs to ourlatent distribution parameters.We'll take the inputdata and send itthrough a densefully connected layerwith our choice of non-linearityto squash dimensionality,which will be ReLU.Then we convert the inputdata into two parametersin a latent space.We predefine the size usingdense, fully connected layers,again z mean and z log sigma.The decoder takesthe z as its inputand outputs the parameters tothe probability distributionof the data.Let's say each pixelis either 1 or 0since it's black and white.We can use aBernoulli distributionsince it defines asuccess as a binary valueto represent a single pixel.So the decoder gets the latentrepresentation of a digitas input and outputs 784Bernoulli parameters,one for each of the pixels.We're going to use thesetwo variables we have hereto randomly sample new similarpoints from the latent normaldistribution by defininga sampling function.Epsilon refers to arandom normal tensor.Once we have z, we canfeed it to our decoder.The decoder will map theselatent space points backto the original input data.We'll initialize it withtwo fully connected layersand their own respectiveactivation functions.And because thedata is extractedfrom a small dimensionalityto a larger one, some of itis lost in thereconstruction process.But how much?We need a measure of this, andthat will be our loss function.I want a generationdata unpredictably,and so my model isstochastic probability.For a VAE, the lossfunction looks like this.The first term measuresthe reconstruction loss.If the decoder output is bad atreconstructing the data well,that would consequentlyincur a large cost here.The next term isthe regularizer.Regularizing in thiscase means keepingthe representations of eachdigit as diverse as possible.If a two was written bytwo different people,we could end up with verydifferent representations.That's a bad thing.We only want one since theyboth wrote the same number.We penalize badbehavior this way,and it makes sure similarrepresentations are closetogether.So our total lossfunction is definedas the sum of our reconstructionterm and the KL divergenceregularization term.So this is the dopest part.The way we normallywant to train this modelis to use gradient descent tooptimize a loss with respectto the parameters of boththe encoder and the decoder.But wait, how are we going totake derivatives with respectto the parameters of astochastic or randomlydetermined variable?We've built randomnessinto the model itself.Gradient descent usually expectsthat a given input alwaysreturns the same output fora fixed set of parameters.So the only source ofrandomness would be the inputs.What do we do, cowboy?We're going to reparameterize.We want toreparameterize samplesso that the randomness isindependent of the parameters.So we define a function thatdepends on the parametersdeterministically, andwe inject randomnessinto the model by introducinga random variable.Instead of theencoder generatinga vector of real values, itwill generate a vector of meansand a vector ofstandard deviations.Then we can take derivativesof the functions involvingz with respect to theparameters of its distribution.We'll define ourmodel's optimizeras rmsprop, which performsgradient descent and its lossfunction as the one we defined.Then we'll train it byimporting our digit imagesand feeding them into our modelfor a given number of epochsand batch size.For our demo, we can plotout the neighborhoods,the different classeson a 2D plane.Each colored cluster representsa digit representation,and close clusters are digitsthat are structurally similar.We can also generate digitsby scanning the latent plane,sampling latent pointsat regular intervals,and generating the correspondingdigit for each of these points.So the three keytakeaways here arethat variationalautoencoders allowus to generatedata by performingunsupervised learning.They are a product of twoideas, Bayesian inferenceand deep learning.And we can backpropagatethrough our networkby performing areparameterizing step thatmakes randomness independentof the parameters wederive our gradients from.Mike McDermott is last week'scoding challenge winner.He cleverly applied themulti-armed bandit problemto the stock market.Each BOC he created used adifferent investment strategybased on reinforcementlearning to profit,and his IPython notebookis a great exampleof what well-documentedcode looks like.Wizard of the week.And the runner-upis S. G. Eshwar.He successfully took stateinto account, which made itthe contextual bandit problem.Dope submissions, guys.I bow to both of you.The coding challengefor this videois to use a VAE togenerate somethingother than digit images.Details are in theread me Github Links,go in the comments,and winners are goingto be announced next week.Please subscribe if you wantto see more programming videos.Check out this related video.And for now, I've gotto deal with my hair.So thanks for watching.\n"