Torch Tutorial (Alex Wiltschko, Twitter)

**The Future of Deep Learning: A Conversation with Alex**

In this conversation, we had the opportunity to sit down with Alex, a leading expert in deep learning and the creator of Torch, an open-source deep learning framework. We discussed various aspects of deep learning, including its applications, challenges, and future directions.

**Using Torch for Deep Learning**

When it comes to choosing a deep learning framework, many developers reach for Torch because of its ease of use and flexibility. However, some may wonder if there's any reason to use Torch when other frameworks like Karros and TensorFlow are already popular choices. According to Alex, the main advantage of Torch is its ability to reason easily about performance. "People tend to reach for torch when they want to be able to reason very easily about performance," he explained. "The kind of compiler infrastructure that gets added to a deep learning environment can make it harder for the end user to reason why something is slow or not working."

**Accessing Torch Models**

One common question Alex received from developers was how to access Torch models in other languages, such as Java or Android. According to Alex, there are several ways to do this. "Normally, all web services and production applications are another fast-based application in Python or Java-based Web Services," he explained. "How you call these models that were trained in torch is a couple of different ways." One approach is to simply write the deep learning code and load up the weights using a native deep learning library.

**Serializing Weights and Loading into Other Languages**

Another way to access Torch models is to serialize them and then use other languages to load them. "You can just realize your model and then try to read it," Alex said. This approach can be useful when working with constrained environments or languages that don't support deep learning out of the box.

**The Impact of Latency on Model Deployment**

Latency is an important consideration when deploying models in production environments. According to Alex, latency refers to the time it takes for a model to make predictions, not the time it takes to ship the model. "If you're calling torch from C code, the latency is not appreciable over if you're just running Lua code," he explained. However, using wrappers like j'ni can incur an overhead that reduces latency.

**The Torch Community and Future Directions**

One question Alex received was about the future of Torch. According to him, the Torch community is not centralized, which means that people could be working on complementary solutions without being aware of each other's work. "We're constrained by machine learning complexity and latency," he explained. "We are not constrained by overhead of like figuring out how to actually get those predictions." This freedom allows for a wide range of approaches to deep learning.

**Conclusion**

As we conclude this conversation with Alex, it's clear that the future of deep learning is bright. With its ease of use and flexibility, Torch is an attractive choice for developers working on deep learning projects. Whether you're using Torch or another framework, the key to success lies in understanding the challenges and limitations of deep learning and finding ways to overcome them.

**A Note on the Future of Lu**

According to Alex, the Lua virtual machine has been around for 15-20 years and is still widely used today. "Lua is in like microwaves for instance," he explained. The Lua binary is very small, which makes it a good choice for constrained environments. "There's 10,000 lines of code so when it compiles down on small it's like kilobytes," he said.

**The Importance of Reasoning about Performance**

One of the key benefits of Torch is its ability to reason easily about performance. According to Alex, this is an important aspect of deep learning that can make a big difference in the success of projects. "People tend to reach for torch when they want to be able to reason very easily about performance," he explained.

**Debugging and Optimization**

When working with Torch models, debugging and optimization are crucial. According to Alex, this is an engineering-dependent task that requires careful consideration of various factors. "You will incur an overhead if you use a wrapper like through the J&I or something like that," he said.

**The Role of j'ni in Model Deployment**

j'ni is a Java Native Interface (JNI) that allows developers to call Torch models from Java code. According to Alex, this can be useful when working with constrained environments or languages that don't support deep learning out of the box. "We've engineered a system where we actually have Lua virtual machines running inside of Java and we talked over the j'ni," he explained.

**The Need for Interoperability**

As the deep learning community continues to grow, the need for interoperability between frameworks becomes increasingly important. According to Alex, this is an area that requires careful consideration and planning. "If you're using standard model architectures, might try to serialize your weights and then use the native deep learning library that exists to load up those weights," he suggested.

**Conclusion**

As we conclude our conversation with Alex, it's clear that the future of deep learning holds many exciting possibilities. With its ease of use and flexibility, Torch is an attractive choice for developers working on deep learning projects. Whether you're using Torch or another framework, the key to success lies in understanding the challenges and limitations of deep learning and finding ways to overcome them.

**A Final Note on Torch**

Torch is an open-source deep learning framework that has been gaining popularity in recent years. According to Alex, it's a great choice for developers who want to build deep learning models quickly and efficiently. "It's easy to use and flexible," he said. "The kind of compiler infrastructure that gets added to a deep learning environment can make it harder for the end user to reason why something is slow or not working."

"WEBVTTKind: captionsLanguage: enso I'm gonna tell you about machine learning with torch and with torture Auto grads so the the description of the talk isn't entirely correct I'm gonna do practical stuff for the first half and then what I want to do is dive into torch Auto grad and some of the concepts that are behind it and those concepts also happen to be shared amongst all deep learning libraries so I really want to give you a perspective of the common thread that links all deep learning software you could possibly use and then also talk a bit about what makes each of the libraries different and why there's I will I will hypothesize why there's so many and the different choices so one thing I want to try there's been a lot of questions and we've gone over time but if there's not questions that go over time in the room there's a lot of people watching online and if there's extra time we'll of course prioritize people here but if you ask a question with the DL school hashtag or if you tweet at me directly I will try to answer those questions from online and I'll certainly answer them offline as well so ask if you're watching at home maybe that will kind of increase you know meaningful participation for people watching through the stream that aren't here today umm a lot of this material was developed with sumus chintala at Facebook he's kind of the Czar of the torch ecosystem these days and Hugo la rochelle who you heard from yesterday and also Ryan Adams who's at Twitter with us and all this some material is available on this github repository that you got actually on a printed sheet for installing torch so all the examples that I'll show you will be in in one notebook and then there's a separate notebook which it actually won't reference in the talk that's a full end-to-end walkthrough of how to train a convolutional neural network on CFR 10 so that's kind of a self-paced tutorial notebook that you can work through on your own time but I'm going to focus on the basics on the fundamentals and hopefully give you some of the concepts and vocabulary that you can use to really dive into torch on your own time so let's let's get going so torch is an array programming language for Lua right so it's like numpy it's like MATLAB but it's in the Lua language so torch is - Lua as numpy is - pi right so what you can do in torch you can do in you know any language this is the absolute minimum basics you can grab strings and print them you can put things in associative data types in Python there's tuples and lists and sets and dictionaries in lua there's just one data type called a table so you'll see that a lot but you can do all those things that I mentioned before with with a table and you got four loops and if statements the core type of torch is the tensor just like in in numpy when you have the ND array which is a way of shaping sets of numbers into matrices or tensors we have the tensor and you can fill it up with random numbers you can multiply them standard stuff but the tensor is the core data type of torch and we've got plotting functionality going over at a very high level I'll show you some more specific code in a moment so you can do all the kind of standard stuff that you'd do in any other array based language there's all the tensor functions that you'd like to like to use including all the linear algebra and convolutions and and you know blast functions and I'm leaving this link here when the slides get uploaded you can follow this and kind of dive into the documentation and see exactly what what kind of tools you have at your disposal in in the notebook and the eye torch notebook which is something that seumas put together you can prepend any torch function with a question mark and that gives you the help for that function so it makes it really nice to discover functionality in the torch library in the notebook so why is it in Lua alright it's kind of a maybe a strange maybe esoteric language to write things in Lua is is unreasonably fast for how convenient it is to use especially a flavor of Lua called Lua jet for loops in Lua jet are basically the same speed as C so this for loop here is actually in production code in master and torch it's not C code but this is perfectly fast enough right so that's a really nice aspect of Lua is you can depend on super high-performance c-code and then on top of it you've got this very convenient glue layer but you don't pay much of a speed penalty to use that glue layer so that's one of the reasons why we've used Lua another advantage that some people might see as a plus is the language itself is quite small so there's 10,000 lines of C code that define the whole language of Lua so you can really sit down with the manual in an afternoon and understand most of the language on your own that same day another aspect which is pretty critical for deep learning but also for other fields is that it's really easy to interoperate with C libraries it was designed originally to be embedded so Lua was a language that was designed to run inside of another C program but have a little scripting layer inside of it so it's very easy to call indicee it's very easy for c to call into Lua so this is another reason why it's kind of an appropriate choice for deep learning libraries the FFI for like the FF I call signature and the idea has been copied into many other languages so C FF I and Python is a Python version of the Lua FF I julia has something similar as well and as I mentioned it was originally designed to be embedded and it's in all kinds of crazy places that you maybe wouldn't expect Lua to be so in World of Warcraft all the graphics are in C++ or whatever they wrote it in but like the boss battles or the quests so like when you go give the gem to the blacksmith or whatever and they give you back the magic sword the scripting of those events that happens in Lua and if you write scripts for world of warcraft to make your own quests that's Lua Adobe Lightroom is a photo processing app all the image processing is done in C++ but all the UI and everything was done in Lua so again it was used to bind together high-performance code with a with kind of a scripting layer and Redis and nginx which are kind of workhorses in the field of web development are both scriptable with Lua and in fact if you go to github pages like my page github I oh if somebody's hosting a web page on github that's served in part by Lua the apocryphal story of why I was originally chosen maybe you could correct me is klimova Oh BAE was trying to build an embedded machine learning application some device he could whereas helmut and classify the world with the CNN when he was a young student and he was trying to do this with Python and it's incredibly frustrating to get Python to run on embedded chips maybe it's easier now with raspberry pi but that just wasn't the case and then he stumbled upon Lua and turns out people had been building Lua into embedded applications for years before that and so that kind of was the snowballing effect so that's that's the hearsay for how we arrived at Lua but maybe there's there's another story another really nice feature of torch is we have first-class support for GPU computation interactive GPU computation so it's very very easy to get some data from the CPU to the GPU and then everything that you do with that data happens on the GPU without you having to worry about writing CUDA kernels right so this has been a feature of Lua torch which is becoming maybe a little bit less unique now but this was this was a pretty solid feature when it first came out so interactive GPU computing and I'll go very quickly over some of the basic features and all of these examples again are in a notebook which you can do kind of at your own pace if you'd like so there's all the basic arithmetic like creating matrices and and doing arithmetic between them taking maxes of numbers and arrays clamping building tensors out of ranges boolean operations over entire arrays special functions this is supported through a wrapper around the Cepheus library this is what numpy uses to support things like 10h and atan2 and other kinds of functions that I guess are in the special class and then sumif again has wrapped the Bocage a/s library which is originally just for python but it provides really nice and beautiful plots in the eye torch notebook and so we can you know draw random numbers from our favorite distributions and make nice histograms of these so you can do nice data exploration in the eye torch notebook along with deep learning so one feature that is attractive to some folks but just an interesting feature of the torch ecosystem is that although there's a lot of industries support it is not industry owned so at Twitter and at Facebook air research in at Nvidia we all contribute a lot to the torch community but we don't own it we can't really steer it to go one way or the other definitively and there's a ton of other people that participate academically in this ecosystem and that's a really nice feature and along with I guess because of the really nice habits of people in deep learning when a paper comes out there's often a high quality code implementation that follows it not not always but but very often at least compared with with other fields and torch is one of the environments in which you'll often see high quality implementations of really cutting-edge stuff so if you just browsed through github and you kind of follow researchers on github you can see really high quality implementations of image captioning of neural style transfer so you can just clone this github repository and run this yourself seek to seek models kind of the what is whatever is the state of the art there's usually a torch implementation of it some of the recent work in generating very realistic synthetic images with generative adversarial networks also has great torch code implementing it so given that there's this active community on github in deep learning for torch how does that stack up against other communities just to give you some context so the Python data science community is is pretty enormous and its focuses are also very very varied if you enter into the data science community in torch and lua you'll likely find deep learning people but not a lot of other people so it's strengthened deep learning compared to its size is actually quite enormous and for those that are kind of thinking of switching between Python and Lua and giving torch a try the effort to switch from Python to Lua you can probably do that in a day if you've tried some Python programming so I was a Python programmer for a while and getting started on Lua took took me maybe a couple days and I was you know actually productive at work and maybe a week or so but you can actually run your code and understand and write new things pretty quickly if you've worked in a scripting language like MATLAB or or Python so if you were intimidated or waiting to try it you should just dive in so how does torch compared to other deep learning libraries specifically as opposed to languages and the first thing I'll say is there's really no silver bullet right now there are a lot of deep learning libraries out there I say tensorflow is by far the largest and this is a plot that was made by a colleague of SU myths and I wish it kind of had confidence intervals on it because it's not strictly that these are like you know points in in deep learning space but maybe this is a good guess of where things kind of fit it seems as if tensorflow was engineered to be very good in an industrial production setting and it seems like it's really fulfilling that Theano seems to have always had a research goal in mind and has been really awesome in the research community for some time Torche tends to be more towards research than industry I think Twitter maybe has pulled it a little bit towards production we maybe are the only example I'd love to learn of others but were maybe the only example of a large company that uses torch in production to serve models so every piece of media that comes in to Twitter goes through a torch model at this point so we're really dealing with an enormous amount of data in a live setting the development of torch just to give you a sense of how we think about how it was built and how we're extending it is there's some kind of tenets of our core philosophy and if really the first is things should be as not to this isn't necessarily good or bad this but this is our choice whenever you hit enter on a particular line and your I torch notebook or on the command line you should get an answer back and this is something that we've we've tried to stick to pretty pretty tightly so no compilation time imperative programming right so just write your code and you know each each line of code executes something and passes it to the next line and minimal abstraction what I mean by minimal abstraction is if you want a reason about how your code is performing it shouldn't take you that many jumps to go to the C code that's actually being run in fact it usually is one or two jumps from the file that defines the function that you care about to the actual C code so if you want a reason about performance or really understand what's going on it's it's it's quite easy to do so in torch I want to take a little bit of a detour and tell you about how torch thinks about its objects how it thinks about the tensor because this can help you also reason about performance a lot of the reason why people come to torch is to build high-performance models very quickly and easily so I mentioned tensors before so attentional tensor a tensor is an N dimensional array and a tensor is actually just a pointer it's a view into your member into your data that's sitting in memory all right so it's just a it's a shape it's um it's a view into into what's actually being stored in your RAM and it's stored in a row major way so that means if I go to the first element of my tensor in memory and I move over one I'm moving over one in a row and not one in a column column major memory storage does exist it's just less common today so you often see row major so this tensor is defined by its link to some storage and it's size 4 by 6 and it's tried six by one and six by one means if I move one down in the column direction I actually have to skip six elements in memory right whereas the one here means if I move over one in the second axis the row axis I have to go over one in memory so if I take a slice of this tensor using the Select command so I select along the first dimension the third element what he gives me back is a new tensor it doesn't give me a new memory this is a thing that that happens a lot in torch is you'll deal with views into memory you won't do memory copies right so usually working with kind of the raw data in RAM and so this creates a new tensor with the size of six because there's six elements astride of one because we've pulled out a row not a column and an offset of 13 that means I have to go 13 elements from the beginning of the original storage to find that piece of memory so if I pull out a column then something different happens which is they still have or I have a size of four here and my stride is now six because in order to grab each element of the column I have to skip six and then the offset of three is because I grab the third element there all right so that's kind of a view of the of the memory model and if we act run something like this like we instantiate a double-a tensor of double of foot double values inside of the tensor and fill it with you know uniform uniform distribution and print it we can see the values here and then we grab a slice B and print it it's just this row and then we can fill B with just some number and print it now it's filled with that number now if we go back and print a we've actually overwritten the values there so this is something you see a lot in torches is working on one big piece of shared memory and as I mentioned before working with CUDA is really really easy so if you just require ku torch which is installed automatically if you have a CUDA GPU using the instructions on the github repository you can instantiate a tensor on the GPU and do the same thing and it will just work so now I want to talk a bit about the frameworks that you'll use to actually train neural networks in torch so this is a schematic kind of cartoon of how we of the pieces we typically need to train a neural network so we've got our data stored on you know hard drive or on a big distributed file system and we have some system for loading that data off of that file system which goes into a nice queue and then some training code which orchestrates a neural network so the thing actually making the prediction a cost function which is a measure of how good our neural network is at any point in our training and an optimizer which is going to take the gradient of the cost with respect to the parameters in the neural network and try to make the neural network better so in the torch ecosystem we've got some packages that tackle each one of these separately so I won't talk about threads here there's actually several different libraries that will do this there's actually several different libraries that will do each one of these things but this one is maybe the most common or the easiest to start with and and then here we'll cover both the specification of the neural network and the cost function as well as the mechanisms to push data through the neural network in the cost function and pull the gradients back from the cost to the parameters and then the optimizer which is we've heard mentioned several times today is to cast a gradient descent or we're outta grad so let me talk about NN first give you a flavor of kind of how it works and what the pieces are so NN is a package for building feed-forward neural networks mostly feed-forward neural networks but kind of clicking Lego blocks together right so you might start with your input and then click together a fully connected layer and then another fully connected layer and then maybe some output right so here I've defined a sequential container which is going to be a container for all my Lego blocks and then I might click in a spatial convolution so I'm going to be working with images maybe a non-linearity some max pooling some other layers as well to kind of complete the whole neural network and then I might add a log softmax at the end to to compute class probabilities so this this kind of the structure that you'll build neural networks with in NN is define a container and then one by one add pieces down a processing hierarchy and I mentioned the sequential container which is starting from inputs and then proceeding linearly there's two other types of containers that you might use but generally NN shines when your architecture is linear right not when it's got some crazy branches or anything like that the there's not a lot of API to the NN package so if you if you learn these couple functions which will be in the slides for later if you want to refer to them back you will understand all the mechanisms that you need to know to push data through a neural network and then to push it through a criterion or a loss function and then to pull those gradients back in order to make a gradient update to your model so these are really the API is the levers that you need to know to kind of drive your neural network and of course we have a CUDA back-end for n n so in the same way that you'll just call CUDA on some data you can call CUDA on a container and that will move the whole model onto the GPU and then anything that you do with that model will occur on the GPU so it's kind of a one-liner to start training models on a graphics processor so for doing feed-forward neural networks n n is pretty great but for starting too weirder architectures like richard social yesterday mentioned a pretty complicated NLP model that starts with glove vectors which are kind of like shallow neural networks and then a recursive neural network and then an attention mechanism and all these things were interacting in strange ways that's actually pretty hard to specify in NN at Twitter we have a package called torch Auto grab which makes these kinds of gluing different model pieces together really easy and in fact the pieces can be as small as addition division multiplication and subtraction so you can glue together any size piece of computation and still get a correct model out and I'll talk more about that in a moment the optin package is what you need in order to train models with like stochastic gradient descent or a degrade or out of delta whatever your optimizer is that you that's your favor the API is pretty straightforward but maybe a little bit different for people kind of coming from the Python world it's got a bit of a functional approach where it will actually you'll you'll pass a function to opt in that will evaluate your neural network and pass back the gradients so that's just something to be aware of it's a little bit of a different style another gotcha with optin that you might run into and you'll see in some of the notebooks that are online is your parameters should be linear in memory so if you want to optimize to neural networks that are interacting in some way you actually need to first bring their parameters together into one tensor and then pass that to opt in there's just something to be aware of so I want to talk for the rest of the talk about torch Auto grad but also about some of the ideas that are behind torch Auto grad and how those link all the deep learning libraries that you possibly could choose so first I want to take a step back and say that just appreciate the wonderful stable abstractions that we have in scientific computing right so Fortran you know back in 57 I don't think anybody uses Fortran 57 but people might actually still use Fortran 90 the idea of an array was didn't exist on a computer and it really took some pretty crazy thinking I think to build a system that made arrays something we take for granted same with linear algebra over about a 20-year period starting in the late 70s people decided oh maybe we should think about linear algebra in a systematic way and now we don't really worry about this if you want to multiply two matrices that used to be you know a phd's worth of work to do that at scale and now we just you know we don't even actually import Blas there's so many wrappers of blasts that we don't even think about this anymore so this is another abstraction and also the idea that we should have all of the routines that we would possibly want to call in one place available that we don't have to write that was kind of invented I would say by MATLAB in the mid-80s and then really popularized in the open-source community by numpy and we should take them for granted we should totally forget about them that because they make us faster they make us better for us to assume these things will work so machine learning has other abstractions besides these computational ones that we take for granted all gradient based optimization that includes neural nets as a subset relies on automatic differentiation to calculate those gradients right and and I like this definition from Barack Perlmutter automatic differentiation mechanically calculates derivatives as functions expressed as computer programs right so it doesn't derive things are right on a piece of paper with a pencil it derives computer programs app machine precision and with complexity guarantees those last two clauses differentiate it from finite differences where you take the input to a program you perturb it slightly and you measure the gradient that way that's a very bad way to measure gradients it's it's numerically very unstable and it's not symbolic differentiation so it's not writing down the symbolic expression of a neural network putting it in Mathematica or maple and then it asking for the the derivative because your expression might go from this to this so you get expressions well when you do naive symbolic differentiation and you don't get that with automatic differentiation so automatic differentiation I would say is the abstraction for gradient based machine learning it's been rediscovered several times there's a review by Woodrow and there I think the first implementation where it actually operates on a computer program was by Bert's bill pending in 1980 although it has been described back you know in 1964 by Wengert in in neural networks rumble heart is the one that I suppose popularized it as back propagation although back propagation is a special case of auto-da-fé this this I think is important in nuclear science and computational fluid dynamics and in weather modeling these people have been using auto-da-fé for years decades and their tools in many ways are much more sophisticated than we have in machine learning there's a lot of ideas that we have yet to import from people that model the weather that would really benefit our ability to train larger and larger models and I would clarify that our abstraction and machine learning is actually reverse mode automatic differentiation there's two different types two extremes I should say forward mode in Reverse mode you never hear about forward mode and you never hear about forward mode of machine learning because it's a very bad idea to try forward mode and machine learning and I'll show you why so here is a cat picture from the internet and my job at my job is to decide that that is in fact a cat picture this is actually something that we do do at Twitter what I am doing is passing this cat through successive layers of transformations than eventually producing a probability over classes I'm getting it wrong my classifier thinks it's a dog so I'd like to train my neural net to think it's a cat so I have a loss a gradient of my loss and I have it with respect to my parameters and this is my gradient that will let me update my parameters and it is composed of multiple pieces and using the chain rule I know that I can fold this together to actually compute the loss I want which is the gradient of the law through the respect to the parameters the issue is I can do it either left to right or right to left so going from left to right looks like this whoops that was very fast okay I'll do two big matrix matrix multiplies so this is bad this is not good because we had these huge matrix matrix products that we're keeping around it's actually worse than this and I'll show you in another view of forward node so see I have a computer program so no longer a symbolic representation of a neural net this is just some computer program and let's say I'd like to optimize a write a is the single parameter of my neural net it's a very silly trivial example but I think it will help illustrate the point so I can execute this program and look at all of the arithmetic operations that occur and build what's called a trace so I'll define say a is 3 I'll define B is to C is 1 and then I'll start executing the code I'm actually going to look if B is greater than C and choose a branch to operate on but then ignore it in my trace so I've chosen one of those traces that one of those branches which is the first because B is greater than C and I have some output value D and I'll return the output value all right so this is a trace execution of my program given some inputs so to calculate in forward mode the derivative of my output D with respect to a I'll define a is 3 and then initialize a gradient of a with respect to itself and the idea is I eventually want the derivative of D with respect to a and I'll build it up sequentially da da and then I'll do D be da and then Dissidia in ddd a so I'm moving from the left to the right building up my gradient I can't do much about the derivative of B with respect to a right now so I'll define C and remove C with respect to a and then I have my value D and then I can define my target value which is the gradient of D with respect to a so if I wanted the gradient of D with respect to B so if I had a two parameter neural network and I wanted optimize both at once I would have to execute this whole thing again and initialize this guy here as DB DB has one right so if you have a million parameters in your neural network or tens of millions if you have to do a million evaluations of forward mode or tens of millions of evaluations of fort mode so it is a very bad idea to try forward mode automatic differentiation on neural network and that's why you probably never heard of it so now you can forget about it but the alternative is reverse mode and that's starting from the right to the left so now I've got this nice matrix that your products which are much smaller and the complexity is much better and there's an interesting difference when I actually go to do this in computer code and you'll see these words are closer together and that's because for reverse mode I actually have to evaluate the whole program before I can start deriving because I'm starting with the derivative of D with respect to D and then decrementing derivative of D with respect to C with respect to D with respect to a so I'm going the other way but I have to have all the information first before I start that so now I can initialize derivative of D with respect to D and I can walk backwards and return both the value and get gradient what's really nice about this is you'll notice here I actually have all the information I need to calculate the derivatives of D with respect to these other parameters so that's why we really like reverse mode auto-da-fé aka back propagation for neural nets is if you have a million of these guys you really want to be ready to compute them all at once right and doing these with matrices is very efficient thing to do on the computer so we've implemented this trace based automatic differentiation in a package called Auto grad and this is the entirety of a neural network so this is how you would specify and train a neural network and autocrat so I'll initialize my parameters we'll just be some random numbers and then here is my neural network function I'm multiplying my you know image that I'm passing in by my white matrix and adding a bias non-linearity doing it again and then returning some probabilities and I have a loss which will take in an image and return a prediction so just using this function and then I'll just take the mean squared error or it's the sum squared error in order to get the gradients of this function the derivative of the loss with respect to these parameters all I have to do is import this autograph package and then call grad on this function this returns a new function that returns the gradients of my original function so it's a what's called a higher-order function it's inputs and its outputs are a function so whenever you see that Noblet that upside-down triangle grad triangle this is the coding equivalent of that and then to Train we'll just call our D loss function on our parameters our image and our label which I'm just pretending like you already have a system to get here when we have our gradients and then we're updating with stochastic gradient descent here all right so it's a very thin it's it's really just this this is the interface with which you talk with Auto grad so what's actually happening so here's my simple function as we evaluate it we're actually keeping track of everything that you're doing in order to be able to reverse it so we're actually building that trace list that I described before and keeping track of it internally so we'll start online I guess that's five so we'll multiply some things we'll keep track of the fact you multiplied and the inputs will keep track of the addition and the inputs and also the output of addition will keep track of inputs outputs in the function every time and we'll kind of walk down this function and build your compute graph just in time so as you're running your code we're learning what you've done and the way we track that and I won't go into details we actually replace every function and torch with like a like a spy function so instead of just running torch dot some our spy function says oh I hear you're running torch dot some let me remember the parameters you gave me let me run some on those parameters remember the output and then return it like nothing happened but internally we're remembering all those things and the way we do this to actually compute the gradients is we're walking back this list like I described before and every time we get to a point where we need to calculate a partial derivative we look it up so we've written all of the partial derivatives for Torche functions and it really every neural network library is going to do this at some level of granularity so let me walk you through another couple examples just to show you what it could do so this is kind of a pretty vanilla one we can you know add and multiply scalars and get the correct gradient this is where things get a little bit more interesting if there's an if statement all right so this control flow can be a little bit difficult or awkward and a lot of existing deep learning libraries because we just listen to what era medic functions get run we ignore control flow so we just go right through this stuff all right so we can get the correct gradient even with if statements we actually care about tensors when we're doing optimization or machine learning so everything I've shown you that works with scalars also works with tensors just as easily this is in the notebook that is on the github repository if you want to play with it this is where things get a little bit interesting for loops also work just fine and not just for loops that have a fixed length which is something that is perhaps easy to unroll but for loops whose duration can depend on data you just computed right or while loops whose stopping condition can depend on a computation that occurs in the while loop we don't really care we're building your graph dynamically and when it's done and when you return some value will calculate the derivative derivatives of the graph that we have you can turn any for loop into a recursive function this is kind of wacky I mean I don't know how you would actually use this in practice but you can cook up a lot of crazy things you might try with autograph and they just work so here we have a function f if B is at some stopping condition will return a otherwise we'll call F and we're gonna differentiate this right so we're gonna differentiate a fully recursive function and it works just fine another aspect which is coming up more and more as papers are coming out that basically disrespect the sanctity of the partial you know of the derivative of the gradient and people are computing synthetic gradients they're you know adding they're clipping two gradients or people are messing with kind of the the internals of back propagation or of auto-da-fé it's actually pretty easy to start to engage with in Auto grad so say I'm going to sum the floor of a to the third power so the floor operation is piecewise constant so the derivative is zero almost everywhere except for where it's undefined why would I want to do this for instance if you wanted to build a differentiable JPEG encoder or differentiable MPEG encoder in compression algorithms like that there's often a quantization step that will floor around or truncate numbers and if you wanted to differentiate through that to build like a neural Jake algorithm or something you need to pass gradients through something that ordinarily does not and so if we look at what the gradient is at zero everywhere I won't go into the details but you can ask Auto grad to use your own gradient for anything so if you have a new module that you want to define and either you've written high-performance code for it and you want to use it or you want to redefine or overwrite you know the gradients that we have there's a pretty easy mechanism for doing that and then when you call your special dot floor you can propagate gradients through it right and here I was just saying basically ignore the gradient of floor so this is a toy example but there are real places where you have a non differentiable bottleneck inside of your computer off and you want to either hop over it or find some approximation and auto grad has a mechanism for very easily plugging those types of things in so that's a bit of what auto grad is and what it can do and I want to turn our attention to how autograph relates to other deep learning libraries and maybe how they're common and how they're similar and how they're different so one big difference that I found between different deep learning libraries is the level of granularity at which you are allowed to specify your neural network so there's a lot of libraries where you say you get a confident or you get a feed-forward neural network and that's it right so the menu is two items long and that's fine I think Andre I really hit it on the head where if you want to solve a problem don't be a hero use somebody else's network so maybe this is vgg that you've downloaded from from the model Zoo or something like that right so this is the don't be a hero regime on the left in the middle there's a lot of really convenient neural net specific libraries like torch and n and Karras and lasagna and you get to put together big layers and you don't really get to see what's inside those layers but you get to click together linear layers or convolutions and usually that's kind of what you want to do and on the far end of the spectrum the things you can click together are the function the the numeric functions in your kind of host scientific computing library right like add multiply subtract and these are features of projects like Otto grad and Theano and tensor flow and the reason why these boundaries are made is because the developers have chosen to give you partial derivatives at these interfaces all right so this is how they've defined their api's and these are the interfaces with you know across which you as a user cannot pass if you want to new one of these modules for the type on the left or the type in the middle you have to go in and build a whole new model and actually implement the partial derivatives but with the types of libraries on the right you can build your own models by modules by composing primitive operations all right so that's one difference that you can find in practice how these things are implemented under the hood usually means this is the totally shrink-wrap stuff and maybe they implemented this whole thing by hand usually these guys in the middle are rappers they're rapping some other library and the guys on the right are usually actually implementing automatic differentiation so Auto grad in theano and tensorflow all implement auto death and the guys in the middle are taking advantage of that to make more convenient wrappers so another aspect that's different is how these graphs are built so I'll remind you in Auto grad we build these things just in time by listening to what you're doing and recording it but that's not how all neural network libraries are built and this is an axis along which I think that they are differentiated meaningfully so there's a lot of libraries that build these graphs explicitly where you say I'm going to click this Lego block into this Lego block where I'm going to give you this yamo specification file the graph is totally static and you really have no opportunity for compiler optimizations there and then there are the just-in-time library so Auto grad and chain ER is another one where you get any graph the graph can be anything it can change from sample to sample it can be you know to the length of the graph can be determined by the compute that occurs in the graph you have very little opportunity for compiler optimizations there so speed can be an issue sometimes and in the middle there's a head of time libraries like tensorflow and Theano where you construct your graph using a domain-specific language you hand it off to their runtime and then they can do crazy stuff to make it faster the problem with that is it can be awkward to work with I guess that got cut off it can be awkward to work with control flow and I think there's a reason why it can be awkward to work with control flow and it's because of the types of graphs that these libraries are actually manipulating so we say compute graph a lot we say data flow graph a lot data flow graph has a pretty restricted meaning and it means that the nodes in your graph do computation and the edges are data and there's no room for control flow in a graph that is a data flow graph right so static data flow is the type of graph that N and n Cafe use because all the ops are the nodes and the edges are just the data and the graph can't change get data flow just in time compiled data flow like Auto grad and chain ER has the same characteristics but the graph can change from iteration to iteration because we wait until you're done computing the forward pass to build the graph in the middle there's kind of a hybrid and I don't know what to call that graph type the ops are nodes the edges are data but then there's special information that the runtime gets in order to expand control flow or for loops so scan is in Theano is an instance of this where the Theano runtime has special information that allows it to make scan work but it's kind of it's it's it's conspiring with the graph data type to do that there's actually another graph type that naturally expresses control flow and data flow together that I haven't seen implemented in a deep learning library it's called see of nodes from cliff clicks thesis in the mid-90s it seems like a really natural thing to try and man maybe that's something that comes up in the future but that's kind of a big question marks maybe one of you will we'll try that out and see how well it works so in practice this level of granularity can sometimes slow us down having to work with addition and multiplication can be nice if you want to try crazy stuff but if you know you want to make a confident why don't you just rush all the way over to the left if you want to take you know inception and add another layer where you want to use the type in the middle an autograph allows you to do that so I'll just kind of walk through writing a neural net three ways very quickly and then and then close questions shortly thereafter so using the fully granular approach there's a lot of text on the screen but the top half is basically let's instantiate our parameters the way that we want to and then here just like I've showed you in previous slides let's do a multiply and let's do an addition and put it through non-linearity we're being very explicit right so we're breaking all the abstraction boundaries and we're just using primitive operations we can use the layer based approach so in Auto grad we have a facility to turn all of the N and modules of which there are a lot may be an exhaustive list for what you'd want to use for standard deep learning applications you can turn them into functions and then just use them so linear one on the linear parameters and your input and some activation you can go through your neural network this way so you can use a layer based approach if you want and if you just want your network just a feed-forward neural network we've got a couple of these kind of standard models just ready to go so you can just say give me a neural network give me log softmax and a loss and let me blow these guys together so you can do it any of those three ways Auto grad at Twitter has had a pretty cool impact we use NN for a lot of stuff when we use Auto grat as well but being able to reach for autograph to try something totally crazy and just knowing that you're going to get the right gradients has really accelerated the pace of high risk potentially high payoff attempts that we make so one crazy thing you might want to try is experiment with loss functions so instead of I have a hundred image classes and I want to have my convolutional neural network be good at classifying this hundred image classes maybe you have a taxonomy of classes maybe you have a vehicle and then a bus a car and a motorcycle and if you guess any one of those you kind of want partial credit for vehicle or if you guess motorcycle you want partial credit for for car so building that kind of a tree loss is actually really straightforward an auto grad and you can do that in in just one sitting but might be more complicated to do that in other libraries we have to crack open the abstraction barrier write your own partial derivatives glue it back together and then use that module that you've built we've trained models that are in production in auto grad so this is something that's a battle-tested to a sense and is running on large amount of media Twitter in a sense Auto grad doesn't actually matter when you're running in production because you just you have your function definition for your prediction of your neural network and then the gradient part just goes away or so all the fancy stuff where we play Storch with our secret you know listener functions all that just goes away and you just have some numerical code so there's actually no speed penalty a test time at all and we have an optimized mode which does a little bit of compiler stuff still work in progress but for the average model it's as fast sometimes faster than n N and for really complicated stuff if you wrote that by hand you'd probably be faster but the time to first model fit using Auto grad is dramatically reduced because you don't have to worry about correctness so this is a big wall of text but it's meant to put in your head some ideas of things from automatic differentiation from that world that we don't have yet that we really want right to be able to train models faster and better so the first is checkpointing this does not check pointing where you save your model every 10 iterations this is check pointing where on your forward pass you might you in normal reverse mode automatic differentiation you have to remember every single piece of computation you do because you might need it to calculate the derivatives and checkpointing you just delete them you let them go away because you think that some of those might actually be easier to recompute than to store alright so for point wise nonlinearities for instance it might be easier once you've loaded your data just to recompute the reloj as opposed to saving the result of reloj and loading that back in again mixing forward and reverse mode is something that you can imagine being important for kind of complicated architectures although I don't really know how much impact that would have so in the chain rule you can either go from left to right or you could start in the middle and go out you can do all kinds of crazy stuff if you want and we really just do reverse mode for diamond shape graphs where your computation explodes out and it comes back in that might be useful to start with forward mode and then finish with the reverse mode or an hourglass you might want to start with reverse mode and end with forward mode stencils are a generalization of convolutions that people use a lot in computer graphics automatically calculate really efficient derivatives of image processing just general image processing algorithms is under active investigation in the graphics world and in the computer vision world so these are two references that are kind of neat papers source to source transformations is something that hasn't really made it it basically has kind of been dormant for about ten or fifteen years so the gold standard used to be you take a piece of code as text and you output another piece of code as text what we're doing now in deep learning is we're always building runtimes we're always building some domain-specific layer that depends on you actually running code it used to be that you just read that text and kind of like a compiler spit out the gradient this this was the gold standard it might not be now but I think it's worth three investigating and then higher order gradients so Hessian vector products and kind of Hessian based optimization maybe doesn't always have full payoff I actually don't recall hearing anything about this at this school so far because it's very expensive and difficult to do expensive computationally fashion is just if you take the grad of F it gives you the gradients if you want the second derivative right so you take grad a grad of F so there's efficient ways to do this it's still kind of an open problem but there are libraries out there the Python version of autograph dust as well diff sharp and hype both also do this as well so to kind of close out you should just try it out it's really easy to get it if you have anaconda if you use Python we've made it so that Lua is fully installable with anaconda so if you're already using it it's very very easy to get all of the tools that I've showed you today and that's kind of the single line to interface with it and if you have any questions you can find me on Twitter or email or github but I'm happy to to answer any questions that you have oh yeah I have no idea thanks thanks for the great talk oh yeah I was wondering what's the state of the data visualization facilities in Lua compared to say Python if I'm Frank it's it's not as good python has been at this for you know five ten years really actively building matplotlib and you know Seabourn and all these other libraries and in Lua were importing other people's work so book ajs is really the best that i've seen so far and that's something you can use in a notebook so you have the full suite of that of that particular library yeah hey thanks for the luck is it possible to convert a model train with torch in into a C model that's deployable in you know production we just run torch in production we use a little model but you want to run it and see so the whole layer of torch that's actually doing the work is in C and calling torch from C I don't have a specific website I can point you to but you can very easily call and execute a Lua script from C it's like three or four lines of code in C thank you the follow-up the question about see just now just like if I'm gonna compile I mean I want to have Tosh into my sequence passcode what kind of overhead do I see I see just animations yourself like I have a 10,000 line - what just-in-time compiler I need to put that in there right oh I can I avoid that because for example I think about if I'm going to put the one in an embedded system they have a mouth resource of anything during inference time so I'm sorry during yet during inference time there's there's no appreciable overhead if I'm understanding your question right so you you are importing a Louis so in your C code you're going to basically say Lua please run this Lua script and that's going to call out into other C code so all this overhead I talked about with autograph that's training time that doesn't exist at test time at all so so during test time but the thing is I still need to have Lua compile into my C code right yeah so this is something people have been doing for like 15 20 years it's pretty mature so Lua is in like microwaves for instance people have done very embedded applications of Lua yeah I think the binary for Lu is like I don't want to it's like a round it's a kilobytes it's very very small there's 10,000 lines of code so when it compiles down on small so there's a question from the twitters says i'm using a combination of Karros and tensor flow why should I use torture auto grad if you're happy then you know that's great I guess so people tend to reach for torch when they would like to be able to reason very easily about performance the kind of the more of a compiler infrastructure that gets added to a deep learning environment the harder it can be for the end user right away from the people that originally made the library can be harder for the end user to reason why is this slow why is this not working you might eventually see some github issue later my network is slow in these conditions and then it gets closed a year after you had to have shipped your project right I mean these things can happen it's not the fault of anybody it's just that torch was designed to basically be very thin a thin layer over C code so if that's something that you care about torch is a really good thing to work for if careless and tensorflow is working great for you then keep deep learning you know that's awesome so I'm trying to see it's hard to filter where will the slides be posted it's not a deep learning question but they will be posted that's the answer to that question I have a question now how do I access through so normally all the web services production generally are another you know fast based application in Python or you know Java based Web Services right or maybe in you know in the cellphone through Android which is also Java right so how do you call these models which were you know trained in torch how would you actually access those there's a couple different ways you can do that if you're using a feed-forward neural network writing the Java code to do the matrix multiplies can be pretty straightforward and we've actually done that before or it's just simpler tor just write the deep learning code load in the weights we'll serialize it however you know it needs to be loaded that's one approach is kind of you know hacking short term at Twitter we've engineered a system where we actually have Lua virtual machines running inside of Java and we talked over the j'ni so we have like a more permanent solution for that but if you're using standard model architectures you might try to serialize your weights and then use the native deep learning library that exists to load up those weights and then run for it and that with some debugging I think that's perfectly fair approach if you have this split between testing and kind of deployment where you're constrained by language or environment that's generally the thing that you know I mean you do basically just you know C realize your model and then try to read it what about the latency actually so related to you know this so when you see realize that hackish way at least you can get you know that latency things sold out but is there any plan basically to have you know interfaces available for other languages so that you know you don't have to do this extra step of serializing and then you know loading into language if you if you don't like in your case you were mentioning that in Twitter you have - available inside your Java JVM our access to the JVM using j'ni so what what what impact does it have on the latency and by latency you mean time to ship the model not the latency of how long it takes many predictions oh that's gonna be very engineering dependent so if you're calling torch from C code the latency is not appreciable over if you're just running Lua code and that can be extremely fast if you're going through some wrapper like through the J&I or something like that you will incur an overhead and you should just try to pick the interfaces that reduce that as much even if you incur engineering overhead to do so I don't know if that answers your question I'm a little bit distant from the server side so I can't give you I just don't know but generally I think what I can say this that's fair is we're constrained by machine learning you know model complexity latency we are not constrained by overhead of like figuring out how to actually get those predictions like to an HTTP request for instance serving which you know which is kind of sort of solving this problem yeah not that I'm aware of again the torch community is not centralized and so people could be working on a totally awesome you know complement to the the tensorflow server but I am not aware of it thank you okay we're going to take a short break of 15 minutes let's thanks Alex againso I'm gonna tell you about machine learning with torch and with torture Auto grads so the the description of the talk isn't entirely correct I'm gonna do practical stuff for the first half and then what I want to do is dive into torch Auto grad and some of the concepts that are behind it and those concepts also happen to be shared amongst all deep learning libraries so I really want to give you a perspective of the common thread that links all deep learning software you could possibly use and then also talk a bit about what makes each of the libraries different and why there's I will I will hypothesize why there's so many and the different choices so one thing I want to try there's been a lot of questions and we've gone over time but if there's not questions that go over time in the room there's a lot of people watching online and if there's extra time we'll of course prioritize people here but if you ask a question with the DL school hashtag or if you tweet at me directly I will try to answer those questions from online and I'll certainly answer them offline as well so ask if you're watching at home maybe that will kind of increase you know meaningful participation for people watching through the stream that aren't here today umm a lot of this material was developed with sumus chintala at Facebook he's kind of the Czar of the torch ecosystem these days and Hugo la rochelle who you heard from yesterday and also Ryan Adams who's at Twitter with us and all this some material is available on this github repository that you got actually on a printed sheet for installing torch so all the examples that I'll show you will be in in one notebook and then there's a separate notebook which it actually won't reference in the talk that's a full end-to-end walkthrough of how to train a convolutional neural network on CFR 10 so that's kind of a self-paced tutorial notebook that you can work through on your own time but I'm going to focus on the basics on the fundamentals and hopefully give you some of the concepts and vocabulary that you can use to really dive into torch on your own time so let's let's get going so torch is an array programming language for Lua right so it's like numpy it's like MATLAB but it's in the Lua language so torch is - Lua as numpy is - pi right so what you can do in torch you can do in you know any language this is the absolute minimum basics you can grab strings and print them you can put things in associative data types in Python there's tuples and lists and sets and dictionaries in lua there's just one data type called a table so you'll see that a lot but you can do all those things that I mentioned before with with a table and you got four loops and if statements the core type of torch is the tensor just like in in numpy when you have the ND array which is a way of shaping sets of numbers into matrices or tensors we have the tensor and you can fill it up with random numbers you can multiply them standard stuff but the tensor is the core data type of torch and we've got plotting functionality going over at a very high level I'll show you some more specific code in a moment so you can do all the kind of standard stuff that you'd do in any other array based language there's all the tensor functions that you'd like to like to use including all the linear algebra and convolutions and and you know blast functions and I'm leaving this link here when the slides get uploaded you can follow this and kind of dive into the documentation and see exactly what what kind of tools you have at your disposal in in the notebook and the eye torch notebook which is something that seumas put together you can prepend any torch function with a question mark and that gives you the help for that function so it makes it really nice to discover functionality in the torch library in the notebook so why is it in Lua alright it's kind of a maybe a strange maybe esoteric language to write things in Lua is is unreasonably fast for how convenient it is to use especially a flavor of Lua called Lua jet for loops in Lua jet are basically the same speed as C so this for loop here is actually in production code in master and torch it's not C code but this is perfectly fast enough right so that's a really nice aspect of Lua is you can depend on super high-performance c-code and then on top of it you've got this very convenient glue layer but you don't pay much of a speed penalty to use that glue layer so that's one of the reasons why we've used Lua another advantage that some people might see as a plus is the language itself is quite small so there's 10,000 lines of C code that define the whole language of Lua so you can really sit down with the manual in an afternoon and understand most of the language on your own that same day another aspect which is pretty critical for deep learning but also for other fields is that it's really easy to interoperate with C libraries it was designed originally to be embedded so Lua was a language that was designed to run inside of another C program but have a little scripting layer inside of it so it's very easy to call indicee it's very easy for c to call into Lua so this is another reason why it's kind of an appropriate choice for deep learning libraries the FFI for like the FF I call signature and the idea has been copied into many other languages so C FF I and Python is a Python version of the Lua FF I julia has something similar as well and as I mentioned it was originally designed to be embedded and it's in all kinds of crazy places that you maybe wouldn't expect Lua to be so in World of Warcraft all the graphics are in C++ or whatever they wrote it in but like the boss battles or the quests so like when you go give the gem to the blacksmith or whatever and they give you back the magic sword the scripting of those events that happens in Lua and if you write scripts for world of warcraft to make your own quests that's Lua Adobe Lightroom is a photo processing app all the image processing is done in C++ but all the UI and everything was done in Lua so again it was used to bind together high-performance code with a with kind of a scripting layer and Redis and nginx which are kind of workhorses in the field of web development are both scriptable with Lua and in fact if you go to github pages like my page github I oh if somebody's hosting a web page on github that's served in part by Lua the apocryphal story of why I was originally chosen maybe you could correct me is klimova Oh BAE was trying to build an embedded machine learning application some device he could whereas helmut and classify the world with the CNN when he was a young student and he was trying to do this with Python and it's incredibly frustrating to get Python to run on embedded chips maybe it's easier now with raspberry pi but that just wasn't the case and then he stumbled upon Lua and turns out people had been building Lua into embedded applications for years before that and so that kind of was the snowballing effect so that's that's the hearsay for how we arrived at Lua but maybe there's there's another story another really nice feature of torch is we have first-class support for GPU computation interactive GPU computation so it's very very easy to get some data from the CPU to the GPU and then everything that you do with that data happens on the GPU without you having to worry about writing CUDA kernels right so this has been a feature of Lua torch which is becoming maybe a little bit less unique now but this was this was a pretty solid feature when it first came out so interactive GPU computing and I'll go very quickly over some of the basic features and all of these examples again are in a notebook which you can do kind of at your own pace if you'd like so there's all the basic arithmetic like creating matrices and and doing arithmetic between them taking maxes of numbers and arrays clamping building tensors out of ranges boolean operations over entire arrays special functions this is supported through a wrapper around the Cepheus library this is what numpy uses to support things like 10h and atan2 and other kinds of functions that I guess are in the special class and then sumif again has wrapped the Bocage a/s library which is originally just for python but it provides really nice and beautiful plots in the eye torch notebook and so we can you know draw random numbers from our favorite distributions and make nice histograms of these so you can do nice data exploration in the eye torch notebook along with deep learning so one feature that is attractive to some folks but just an interesting feature of the torch ecosystem is that although there's a lot of industries support it is not industry owned so at Twitter and at Facebook air research in at Nvidia we all contribute a lot to the torch community but we don't own it we can't really steer it to go one way or the other definitively and there's a ton of other people that participate academically in this ecosystem and that's a really nice feature and along with I guess because of the really nice habits of people in deep learning when a paper comes out there's often a high quality code implementation that follows it not not always but but very often at least compared with with other fields and torch is one of the environments in which you'll often see high quality implementations of really cutting-edge stuff so if you just browsed through github and you kind of follow researchers on github you can see really high quality implementations of image captioning of neural style transfer so you can just clone this github repository and run this yourself seek to seek models kind of the what is whatever is the state of the art there's usually a torch implementation of it some of the recent work in generating very realistic synthetic images with generative adversarial networks also has great torch code implementing it so given that there's this active community on github in deep learning for torch how does that stack up against other communities just to give you some context so the Python data science community is is pretty enormous and its focuses are also very very varied if you enter into the data science community in torch and lua you'll likely find deep learning people but not a lot of other people so it's strengthened deep learning compared to its size is actually quite enormous and for those that are kind of thinking of switching between Python and Lua and giving torch a try the effort to switch from Python to Lua you can probably do that in a day if you've tried some Python programming so I was a Python programmer for a while and getting started on Lua took took me maybe a couple days and I was you know actually productive at work and maybe a week or so but you can actually run your code and understand and write new things pretty quickly if you've worked in a scripting language like MATLAB or or Python so if you were intimidated or waiting to try it you should just dive in so how does torch compared to other deep learning libraries specifically as opposed to languages and the first thing I'll say is there's really no silver bullet right now there are a lot of deep learning libraries out there I say tensorflow is by far the largest and this is a plot that was made by a colleague of SU myths and I wish it kind of had confidence intervals on it because it's not strictly that these are like you know points in in deep learning space but maybe this is a good guess of where things kind of fit it seems as if tensorflow was engineered to be very good in an industrial production setting and it seems like it's really fulfilling that Theano seems to have always had a research goal in mind and has been really awesome in the research community for some time Torche tends to be more towards research than industry I think Twitter maybe has pulled it a little bit towards production we maybe are the only example I'd love to learn of others but were maybe the only example of a large company that uses torch in production to serve models so every piece of media that comes in to Twitter goes through a torch model at this point so we're really dealing with an enormous amount of data in a live setting the development of torch just to give you a sense of how we think about how it was built and how we're extending it is there's some kind of tenets of our core philosophy and if really the first is things should be as not to this isn't necessarily good or bad this but this is our choice whenever you hit enter on a particular line and your I torch notebook or on the command line you should get an answer back and this is something that we've we've tried to stick to pretty pretty tightly so no compilation time imperative programming right so just write your code and you know each each line of code executes something and passes it to the next line and minimal abstraction what I mean by minimal abstraction is if you want a reason about how your code is performing it shouldn't take you that many jumps to go to the C code that's actually being run in fact it usually is one or two jumps from the file that defines the function that you care about to the actual C code so if you want a reason about performance or really understand what's going on it's it's it's quite easy to do so in torch I want to take a little bit of a detour and tell you about how torch thinks about its objects how it thinks about the tensor because this can help you also reason about performance a lot of the reason why people come to torch is to build high-performance models very quickly and easily so I mentioned tensors before so attentional tensor a tensor is an N dimensional array and a tensor is actually just a pointer it's a view into your member into your data that's sitting in memory all right so it's just a it's a shape it's um it's a view into into what's actually being stored in your RAM and it's stored in a row major way so that means if I go to the first element of my tensor in memory and I move over one I'm moving over one in a row and not one in a column column major memory storage does exist it's just less common today so you often see row major so this tensor is defined by its link to some storage and it's size 4 by 6 and it's tried six by one and six by one means if I move one down in the column direction I actually have to skip six elements in memory right whereas the one here means if I move over one in the second axis the row axis I have to go over one in memory so if I take a slice of this tensor using the Select command so I select along the first dimension the third element what he gives me back is a new tensor it doesn't give me a new memory this is a thing that that happens a lot in torch is you'll deal with views into memory you won't do memory copies right so usually working with kind of the raw data in RAM and so this creates a new tensor with the size of six because there's six elements astride of one because we've pulled out a row not a column and an offset of 13 that means I have to go 13 elements from the beginning of the original storage to find that piece of memory so if I pull out a column then something different happens which is they still have or I have a size of four here and my stride is now six because in order to grab each element of the column I have to skip six and then the offset of three is because I grab the third element there all right so that's kind of a view of the of the memory model and if we act run something like this like we instantiate a double-a tensor of double of foot double values inside of the tensor and fill it with you know uniform uniform distribution and print it we can see the values here and then we grab a slice B and print it it's just this row and then we can fill B with just some number and print it now it's filled with that number now if we go back and print a we've actually overwritten the values there so this is something you see a lot in torches is working on one big piece of shared memory and as I mentioned before working with CUDA is really really easy so if you just require ku torch which is installed automatically if you have a CUDA GPU using the instructions on the github repository you can instantiate a tensor on the GPU and do the same thing and it will just work so now I want to talk a bit about the frameworks that you'll use to actually train neural networks in torch so this is a schematic kind of cartoon of how we of the pieces we typically need to train a neural network so we've got our data stored on you know hard drive or on a big distributed file system and we have some system for loading that data off of that file system which goes into a nice queue and then some training code which orchestrates a neural network so the thing actually making the prediction a cost function which is a measure of how good our neural network is at any point in our training and an optimizer which is going to take the gradient of the cost with respect to the parameters in the neural network and try to make the neural network better so in the torch ecosystem we've got some packages that tackle each one of these separately so I won't talk about threads here there's actually several different libraries that will do this there's actually several different libraries that will do each one of these things but this one is maybe the most common or the easiest to start with and and then here we'll cover both the specification of the neural network and the cost function as well as the mechanisms to push data through the neural network in the cost function and pull the gradients back from the cost to the parameters and then the optimizer which is we've heard mentioned several times today is to cast a gradient descent or we're outta grad so let me talk about NN first give you a flavor of kind of how it works and what the pieces are so NN is a package for building feed-forward neural networks mostly feed-forward neural networks but kind of clicking Lego blocks together right so you might start with your input and then click together a fully connected layer and then another fully connected layer and then maybe some output right so here I've defined a sequential container which is going to be a container for all my Lego blocks and then I might click in a spatial convolution so I'm going to be working with images maybe a non-linearity some max pooling some other layers as well to kind of complete the whole neural network and then I might add a log softmax at the end to to compute class probabilities so this this kind of the structure that you'll build neural networks with in NN is define a container and then one by one add pieces down a processing hierarchy and I mentioned the sequential container which is starting from inputs and then proceeding linearly there's two other types of containers that you might use but generally NN shines when your architecture is linear right not when it's got some crazy branches or anything like that the there's not a lot of API to the NN package so if you if you learn these couple functions which will be in the slides for later if you want to refer to them back you will understand all the mechanisms that you need to know to push data through a neural network and then to push it through a criterion or a loss function and then to pull those gradients back in order to make a gradient update to your model so these are really the API is the levers that you need to know to kind of drive your neural network and of course we have a CUDA back-end for n n so in the same way that you'll just call CUDA on some data you can call CUDA on a container and that will move the whole model onto the GPU and then anything that you do with that model will occur on the GPU so it's kind of a one-liner to start training models on a graphics processor so for doing feed-forward neural networks n n is pretty great but for starting too weirder architectures like richard social yesterday mentioned a pretty complicated NLP model that starts with glove vectors which are kind of like shallow neural networks and then a recursive neural network and then an attention mechanism and all these things were interacting in strange ways that's actually pretty hard to specify in NN at Twitter we have a package called torch Auto grab which makes these kinds of gluing different model pieces together really easy and in fact the pieces can be as small as addition division multiplication and subtraction so you can glue together any size piece of computation and still get a correct model out and I'll talk more about that in a moment the optin package is what you need in order to train models with like stochastic gradient descent or a degrade or out of delta whatever your optimizer is that you that's your favor the API is pretty straightforward but maybe a little bit different for people kind of coming from the Python world it's got a bit of a functional approach where it will actually you'll you'll pass a function to opt in that will evaluate your neural network and pass back the gradients so that's just something to be aware of it's a little bit of a different style another gotcha with optin that you might run into and you'll see in some of the notebooks that are online is your parameters should be linear in memory so if you want to optimize to neural networks that are interacting in some way you actually need to first bring their parameters together into one tensor and then pass that to opt in there's just something to be aware of so I want to talk for the rest of the talk about torch Auto grad but also about some of the ideas that are behind torch Auto grad and how those link all the deep learning libraries that you possibly could choose so first I want to take a step back and say that just appreciate the wonderful stable abstractions that we have in scientific computing right so Fortran you know back in 57 I don't think anybody uses Fortran 57 but people might actually still use Fortran 90 the idea of an array was didn't exist on a computer and it really took some pretty crazy thinking I think to build a system that made arrays something we take for granted same with linear algebra over about a 20-year period starting in the late 70s people decided oh maybe we should think about linear algebra in a systematic way and now we don't really worry about this if you want to multiply two matrices that used to be you know a phd's worth of work to do that at scale and now we just you know we don't even actually import Blas there's so many wrappers of blasts that we don't even think about this anymore so this is another abstraction and also the idea that we should have all of the routines that we would possibly want to call in one place available that we don't have to write that was kind of invented I would say by MATLAB in the mid-80s and then really popularized in the open-source community by numpy and we should take them for granted we should totally forget about them that because they make us faster they make us better for us to assume these things will work so machine learning has other abstractions besides these computational ones that we take for granted all gradient based optimization that includes neural nets as a subset relies on automatic differentiation to calculate those gradients right and and I like this definition from Barack Perlmutter automatic differentiation mechanically calculates derivatives as functions expressed as computer programs right so it doesn't derive things are right on a piece of paper with a pencil it derives computer programs app machine precision and with complexity guarantees those last two clauses differentiate it from finite differences where you take the input to a program you perturb it slightly and you measure the gradient that way that's a very bad way to measure gradients it's it's numerically very unstable and it's not symbolic differentiation so it's not writing down the symbolic expression of a neural network putting it in Mathematica or maple and then it asking for the the derivative because your expression might go from this to this so you get expressions well when you do naive symbolic differentiation and you don't get that with automatic differentiation so automatic differentiation I would say is the abstraction for gradient based machine learning it's been rediscovered several times there's a review by Woodrow and there I think the first implementation where it actually operates on a computer program was by Bert's bill pending in 1980 although it has been described back you know in 1964 by Wengert in in neural networks rumble heart is the one that I suppose popularized it as back propagation although back propagation is a special case of auto-da-fé this this I think is important in nuclear science and computational fluid dynamics and in weather modeling these people have been using auto-da-fé for years decades and their tools in many ways are much more sophisticated than we have in machine learning there's a lot of ideas that we have yet to import from people that model the weather that would really benefit our ability to train larger and larger models and I would clarify that our abstraction and machine learning is actually reverse mode automatic differentiation there's two different types two extremes I should say forward mode in Reverse mode you never hear about forward mode and you never hear about forward mode of machine learning because it's a very bad idea to try forward mode and machine learning and I'll show you why so here is a cat picture from the internet and my job at my job is to decide that that is in fact a cat picture this is actually something that we do do at Twitter what I am doing is passing this cat through successive layers of transformations than eventually producing a probability over classes I'm getting it wrong my classifier thinks it's a dog so I'd like to train my neural net to think it's a cat so I have a loss a gradient of my loss and I have it with respect to my parameters and this is my gradient that will let me update my parameters and it is composed of multiple pieces and using the chain rule I know that I can fold this together to actually compute the loss I want which is the gradient of the law through the respect to the parameters the issue is I can do it either left to right or right to left so going from left to right looks like this whoops that was very fast okay I'll do two big matrix matrix multiplies so this is bad this is not good because we had these huge matrix matrix products that we're keeping around it's actually worse than this and I'll show you in another view of forward node so see I have a computer program so no longer a symbolic representation of a neural net this is just some computer program and let's say I'd like to optimize a write a is the single parameter of my neural net it's a very silly trivial example but I think it will help illustrate the point so I can execute this program and look at all of the arithmetic operations that occur and build what's called a trace so I'll define say a is 3 I'll define B is to C is 1 and then I'll start executing the code I'm actually going to look if B is greater than C and choose a branch to operate on but then ignore it in my trace so I've chosen one of those traces that one of those branches which is the first because B is greater than C and I have some output value D and I'll return the output value all right so this is a trace execution of my program given some inputs so to calculate in forward mode the derivative of my output D with respect to a I'll define a is 3 and then initialize a gradient of a with respect to itself and the idea is I eventually want the derivative of D with respect to a and I'll build it up sequentially da da and then I'll do D be da and then Dissidia in ddd a so I'm moving from the left to the right building up my gradient I can't do much about the derivative of B with respect to a right now so I'll define C and remove C with respect to a and then I have my value D and then I can define my target value which is the gradient of D with respect to a so if I wanted the gradient of D with respect to B so if I had a two parameter neural network and I wanted optimize both at once I would have to execute this whole thing again and initialize this guy here as DB DB has one right so if you have a million parameters in your neural network or tens of millions if you have to do a million evaluations of forward mode or tens of millions of evaluations of fort mode so it is a very bad idea to try forward mode automatic differentiation on neural network and that's why you probably never heard of it so now you can forget about it but the alternative is reverse mode and that's starting from the right to the left so now I've got this nice matrix that your products which are much smaller and the complexity is much better and there's an interesting difference when I actually go to do this in computer code and you'll see these words are closer together and that's because for reverse mode I actually have to evaluate the whole program before I can start deriving because I'm starting with the derivative of D with respect to D and then decrementing derivative of D with respect to C with respect to D with respect to a so I'm going the other way but I have to have all the information first before I start that so now I can initialize derivative of D with respect to D and I can walk backwards and return both the value and get gradient what's really nice about this is you'll notice here I actually have all the information I need to calculate the derivatives of D with respect to these other parameters so that's why we really like reverse mode auto-da-fé aka back propagation for neural nets is if you have a million of these guys you really want to be ready to compute them all at once right and doing these with matrices is very efficient thing to do on the computer so we've implemented this trace based automatic differentiation in a package called Auto grad and this is the entirety of a neural network so this is how you would specify and train a neural network and autocrat so I'll initialize my parameters we'll just be some random numbers and then here is my neural network function I'm multiplying my you know image that I'm passing in by my white matrix and adding a bias non-linearity doing it again and then returning some probabilities and I have a loss which will take in an image and return a prediction so just using this function and then I'll just take the mean squared error or it's the sum squared error in order to get the gradients of this function the derivative of the loss with respect to these parameters all I have to do is import this autograph package and then call grad on this function this returns a new function that returns the gradients of my original function so it's a what's called a higher-order function it's inputs and its outputs are a function so whenever you see that Noblet that upside-down triangle grad triangle this is the coding equivalent of that and then to Train we'll just call our D loss function on our parameters our image and our label which I'm just pretending like you already have a system to get here when we have our gradients and then we're updating with stochastic gradient descent here all right so it's a very thin it's it's really just this this is the interface with which you talk with Auto grad so what's actually happening so here's my simple function as we evaluate it we're actually keeping track of everything that you're doing in order to be able to reverse it so we're actually building that trace list that I described before and keeping track of it internally so we'll start online I guess that's five so we'll multiply some things we'll keep track of the fact you multiplied and the inputs will keep track of the addition and the inputs and also the output of addition will keep track of inputs outputs in the function every time and we'll kind of walk down this function and build your compute graph just in time so as you're running your code we're learning what you've done and the way we track that and I won't go into details we actually replace every function and torch with like a like a spy function so instead of just running torch dot some our spy function says oh I hear you're running torch dot some let me remember the parameters you gave me let me run some on those parameters remember the output and then return it like nothing happened but internally we're remembering all those things and the way we do this to actually compute the gradients is we're walking back this list like I described before and every time we get to a point where we need to calculate a partial derivative we look it up so we've written all of the partial derivatives for Torche functions and it really every neural network library is going to do this at some level of granularity so let me walk you through another couple examples just to show you what it could do so this is kind of a pretty vanilla one we can you know add and multiply scalars and get the correct gradient this is where things get a little bit more interesting if there's an if statement all right so this control flow can be a little bit difficult or awkward and a lot of existing deep learning libraries because we just listen to what era medic functions get run we ignore control flow so we just go right through this stuff all right so we can get the correct gradient even with if statements we actually care about tensors when we're doing optimization or machine learning so everything I've shown you that works with scalars also works with tensors just as easily this is in the notebook that is on the github repository if you want to play with it this is where things get a little bit interesting for loops also work just fine and not just for loops that have a fixed length which is something that is perhaps easy to unroll but for loops whose duration can depend on data you just computed right or while loops whose stopping condition can depend on a computation that occurs in the while loop we don't really care we're building your graph dynamically and when it's done and when you return some value will calculate the derivative derivatives of the graph that we have you can turn any for loop into a recursive function this is kind of wacky I mean I don't know how you would actually use this in practice but you can cook up a lot of crazy things you might try with autograph and they just work so here we have a function f if B is at some stopping condition will return a otherwise we'll call F and we're gonna differentiate this right so we're gonna differentiate a fully recursive function and it works just fine another aspect which is coming up more and more as papers are coming out that basically disrespect the sanctity of the partial you know of the derivative of the gradient and people are computing synthetic gradients they're you know adding they're clipping two gradients or people are messing with kind of the the internals of back propagation or of auto-da-fé it's actually pretty easy to start to engage with in Auto grad so say I'm going to sum the floor of a to the third power so the floor operation is piecewise constant so the derivative is zero almost everywhere except for where it's undefined why would I want to do this for instance if you wanted to build a differentiable JPEG encoder or differentiable MPEG encoder in compression algorithms like that there's often a quantization step that will floor around or truncate numbers and if you wanted to differentiate through that to build like a neural Jake algorithm or something you need to pass gradients through something that ordinarily does not and so if we look at what the gradient is at zero everywhere I won't go into the details but you can ask Auto grad to use your own gradient for anything so if you have a new module that you want to define and either you've written high-performance code for it and you want to use it or you want to redefine or overwrite you know the gradients that we have there's a pretty easy mechanism for doing that and then when you call your special dot floor you can propagate gradients through it right and here I was just saying basically ignore the gradient of floor so this is a toy example but there are real places where you have a non differentiable bottleneck inside of your computer off and you want to either hop over it or find some approximation and auto grad has a mechanism for very easily plugging those types of things in so that's a bit of what auto grad is and what it can do and I want to turn our attention to how autograph relates to other deep learning libraries and maybe how they're common and how they're similar and how they're different so one big difference that I found between different deep learning libraries is the level of granularity at which you are allowed to specify your neural network so there's a lot of libraries where you say you get a confident or you get a feed-forward neural network and that's it right so the menu is two items long and that's fine I think Andre I really hit it on the head where if you want to solve a problem don't be a hero use somebody else's network so maybe this is vgg that you've downloaded from from the model Zoo or something like that right so this is the don't be a hero regime on the left in the middle there's a lot of really convenient neural net specific libraries like torch and n and Karras and lasagna and you get to put together big layers and you don't really get to see what's inside those layers but you get to click together linear layers or convolutions and usually that's kind of what you want to do and on the far end of the spectrum the things you can click together are the function the the numeric functions in your kind of host scientific computing library right like add multiply subtract and these are features of projects like Otto grad and Theano and tensor flow and the reason why these boundaries are made is because the developers have chosen to give you partial derivatives at these interfaces all right so this is how they've defined their api's and these are the interfaces with you know across which you as a user cannot pass if you want to new one of these modules for the type on the left or the type in the middle you have to go in and build a whole new model and actually implement the partial derivatives but with the types of libraries on the right you can build your own models by modules by composing primitive operations all right so that's one difference that you can find in practice how these things are implemented under the hood usually means this is the totally shrink-wrap stuff and maybe they implemented this whole thing by hand usually these guys in the middle are rappers they're rapping some other library and the guys on the right are usually actually implementing automatic differentiation so Auto grad in theano and tensorflow all implement auto death and the guys in the middle are taking advantage of that to make more convenient wrappers so another aspect that's different is how these graphs are built so I'll remind you in Auto grad we build these things just in time by listening to what you're doing and recording it but that's not how all neural network libraries are built and this is an axis along which I think that they are differentiated meaningfully so there's a lot of libraries that build these graphs explicitly where you say I'm going to click this Lego block into this Lego block where I'm going to give you this yamo specification file the graph is totally static and you really have no opportunity for compiler optimizations there and then there are the just-in-time library so Auto grad and chain ER is another one where you get any graph the graph can be anything it can change from sample to sample it can be you know to the length of the graph can be determined by the compute that occurs in the graph you have very little opportunity for compiler optimizations there so speed can be an issue sometimes and in the middle there's a head of time libraries like tensorflow and Theano where you construct your graph using a domain-specific language you hand it off to their runtime and then they can do crazy stuff to make it faster the problem with that is it can be awkward to work with I guess that got cut off it can be awkward to work with control flow and I think there's a reason why it can be awkward to work with control flow and it's because of the types of graphs that these libraries are actually manipulating so we say compute graph a lot we say data flow graph a lot data flow graph has a pretty restricted meaning and it means that the nodes in your graph do computation and the edges are data and there's no room for control flow in a graph that is a data flow graph right so static data flow is the type of graph that N and n Cafe use because all the ops are the nodes and the edges are just the data and the graph can't change get data flow just in time compiled data flow like Auto grad and chain ER has the same characteristics but the graph can change from iteration to iteration because we wait until you're done computing the forward pass to build the graph in the middle there's kind of a hybrid and I don't know what to call that graph type the ops are nodes the edges are data but then there's special information that the runtime gets in order to expand control flow or for loops so scan is in Theano is an instance of this where the Theano runtime has special information that allows it to make scan work but it's kind of it's it's it's conspiring with the graph data type to do that there's actually another graph type that naturally expresses control flow and data flow together that I haven't seen implemented in a deep learning library it's called see of nodes from cliff clicks thesis in the mid-90s it seems like a really natural thing to try and man maybe that's something that comes up in the future but that's kind of a big question marks maybe one of you will we'll try that out and see how well it works so in practice this level of granularity can sometimes slow us down having to work with addition and multiplication can be nice if you want to try crazy stuff but if you know you want to make a confident why don't you just rush all the way over to the left if you want to take you know inception and add another layer where you want to use the type in the middle an autograph allows you to do that so I'll just kind of walk through writing a neural net three ways very quickly and then and then close questions shortly thereafter so using the fully granular approach there's a lot of text on the screen but the top half is basically let's instantiate our parameters the way that we want to and then here just like I've showed you in previous slides let's do a multiply and let's do an addition and put it through non-linearity we're being very explicit right so we're breaking all the abstraction boundaries and we're just using primitive operations we can use the layer based approach so in Auto grad we have a facility to turn all of the N and modules of which there are a lot may be an exhaustive list for what you'd want to use for standard deep learning applications you can turn them into functions and then just use them so linear one on the linear parameters and your input and some activation you can go through your neural network this way so you can use a layer based approach if you want and if you just want your network just a feed-forward neural network we've got a couple of these kind of standard models just ready to go so you can just say give me a neural network give me log softmax and a loss and let me blow these guys together so you can do it any of those three ways Auto grad at Twitter has had a pretty cool impact we use NN for a lot of stuff when we use Auto grat as well but being able to reach for autograph to try something totally crazy and just knowing that you're going to get the right gradients has really accelerated the pace of high risk potentially high payoff attempts that we make so one crazy thing you might want to try is experiment with loss functions so instead of I have a hundred image classes and I want to have my convolutional neural network be good at classifying this hundred image classes maybe you have a taxonomy of classes maybe you have a vehicle and then a bus a car and a motorcycle and if you guess any one of those you kind of want partial credit for vehicle or if you guess motorcycle you want partial credit for for car so building that kind of a tree loss is actually really straightforward an auto grad and you can do that in in just one sitting but might be more complicated to do that in other libraries we have to crack open the abstraction barrier write your own partial derivatives glue it back together and then use that module that you've built we've trained models that are in production in auto grad so this is something that's a battle-tested to a sense and is running on large amount of media Twitter in a sense Auto grad doesn't actually matter when you're running in production because you just you have your function definition for your prediction of your neural network and then the gradient part just goes away or so all the fancy stuff where we play Storch with our secret you know listener functions all that just goes away and you just have some numerical code so there's actually no speed penalty a test time at all and we have an optimized mode which does a little bit of compiler stuff still work in progress but for the average model it's as fast sometimes faster than n N and for really complicated stuff if you wrote that by hand you'd probably be faster but the time to first model fit using Auto grad is dramatically reduced because you don't have to worry about correctness so this is a big wall of text but it's meant to put in your head some ideas of things from automatic differentiation from that world that we don't have yet that we really want right to be able to train models faster and better so the first is checkpointing this does not check pointing where you save your model every 10 iterations this is check pointing where on your forward pass you might you in normal reverse mode automatic differentiation you have to remember every single piece of computation you do because you might need it to calculate the derivatives and checkpointing you just delete them you let them go away because you think that some of those might actually be easier to recompute than to store alright so for point wise nonlinearities for instance it might be easier once you've loaded your data just to recompute the reloj as opposed to saving the result of reloj and loading that back in again mixing forward and reverse mode is something that you can imagine being important for kind of complicated architectures although I don't really know how much impact that would have so in the chain rule you can either go from left to right or you could start in the middle and go out you can do all kinds of crazy stuff if you want and we really just do reverse mode for diamond shape graphs where your computation explodes out and it comes back in that might be useful to start with forward mode and then finish with the reverse mode or an hourglass you might want to start with reverse mode and end with forward mode stencils are a generalization of convolutions that people use a lot in computer graphics automatically calculate really efficient derivatives of image processing just general image processing algorithms is under active investigation in the graphics world and in the computer vision world so these are two references that are kind of neat papers source to source transformations is something that hasn't really made it it basically has kind of been dormant for about ten or fifteen years so the gold standard used to be you take a piece of code as text and you output another piece of code as text what we're doing now in deep learning is we're always building runtimes we're always building some domain-specific layer that depends on you actually running code it used to be that you just read that text and kind of like a compiler spit out the gradient this this was the gold standard it might not be now but I think it's worth three investigating and then higher order gradients so Hessian vector products and kind of Hessian based optimization maybe doesn't always have full payoff I actually don't recall hearing anything about this at this school so far because it's very expensive and difficult to do expensive computationally fashion is just if you take the grad of F it gives you the gradients if you want the second derivative right so you take grad a grad of F so there's efficient ways to do this it's still kind of an open problem but there are libraries out there the Python version of autograph dust as well diff sharp and hype both also do this as well so to kind of close out you should just try it out it's really easy to get it if you have anaconda if you use Python we've made it so that Lua is fully installable with anaconda so if you're already using it it's very very easy to get all of the tools that I've showed you today and that's kind of the single line to interface with it and if you have any questions you can find me on Twitter or email or github but I'm happy to to answer any questions that you have oh yeah I have no idea thanks thanks for the great talk oh yeah I was wondering what's the state of the data visualization facilities in Lua compared to say Python if I'm Frank it's it's not as good python has been at this for you know five ten years really actively building matplotlib and you know Seabourn and all these other libraries and in Lua were importing other people's work so book ajs is really the best that i've seen so far and that's something you can use in a notebook so you have the full suite of that of that particular library yeah hey thanks for the luck is it possible to convert a model train with torch in into a C model that's deployable in you know production we just run torch in production we use a little model but you want to run it and see so the whole layer of torch that's actually doing the work is in C and calling torch from C I don't have a specific website I can point you to but you can very easily call and execute a Lua script from C it's like three or four lines of code in C thank you the follow-up the question about see just now just like if I'm gonna compile I mean I want to have Tosh into my sequence passcode what kind of overhead do I see I see just animations yourself like I have a 10,000 line - what just-in-time compiler I need to put that in there right oh I can I avoid that because for example I think about if I'm going to put the one in an embedded system they have a mouth resource of anything during inference time so I'm sorry during yet during inference time there's there's no appreciable overhead if I'm understanding your question right so you you are importing a Louis so in your C code you're going to basically say Lua please run this Lua script and that's going to call out into other C code so all this overhead I talked about with autograph that's training time that doesn't exist at test time at all so so during test time but the thing is I still need to have Lua compile into my C code right yeah so this is something people have been doing for like 15 20 years it's pretty mature so Lua is in like microwaves for instance people have done very embedded applications of Lua yeah I think the binary for Lu is like I don't want to it's like a round it's a kilobytes it's very very small there's 10,000 lines of code so when it compiles down on small so there's a question from the twitters says i'm using a combination of Karros and tensor flow why should I use torture auto grad if you're happy then you know that's great I guess so people tend to reach for torch when they would like to be able to reason very easily about performance the kind of the more of a compiler infrastructure that gets added to a deep learning environment the harder it can be for the end user right away from the people that originally made the library can be harder for the end user to reason why is this slow why is this not working you might eventually see some github issue later my network is slow in these conditions and then it gets closed a year after you had to have shipped your project right I mean these things can happen it's not the fault of anybody it's just that torch was designed to basically be very thin a thin layer over C code so if that's something that you care about torch is a really good thing to work for if careless and tensorflow is working great for you then keep deep learning you know that's awesome so I'm trying to see it's hard to filter where will the slides be posted it's not a deep learning question but they will be posted that's the answer to that question I have a question now how do I access through so normally all the web services production generally are another you know fast based application in Python or you know Java based Web Services right or maybe in you know in the cellphone through Android which is also Java right so how do you call these models which were you know trained in torch how would you actually access those there's a couple different ways you can do that if you're using a feed-forward neural network writing the Java code to do the matrix multiplies can be pretty straightforward and we've actually done that before or it's just simpler tor just write the deep learning code load in the weights we'll serialize it however you know it needs to be loaded that's one approach is kind of you know hacking short term at Twitter we've engineered a system where we actually have Lua virtual machines running inside of Java and we talked over the j'ni so we have like a more permanent solution for that but if you're using standard model architectures you might try to serialize your weights and then use the native deep learning library that exists to load up those weights and then run for it and that with some debugging I think that's perfectly fair approach if you have this split between testing and kind of deployment where you're constrained by language or environment that's generally the thing that you know I mean you do basically just you know C realize your model and then try to read it what about the latency actually so related to you know this so when you see realize that hackish way at least you can get you know that latency things sold out but is there any plan basically to have you know interfaces available for other languages so that you know you don't have to do this extra step of serializing and then you know loading into language if you if you don't like in your case you were mentioning that in Twitter you have - available inside your Java JVM our access to the JVM using j'ni so what what what impact does it have on the latency and by latency you mean time to ship the model not the latency of how long it takes many predictions oh that's gonna be very engineering dependent so if you're calling torch from C code the latency is not appreciable over if you're just running Lua code and that can be extremely fast if you're going through some wrapper like through the J&I or something like that you will incur an overhead and you should just try to pick the interfaces that reduce that as much even if you incur engineering overhead to do so I don't know if that answers your question I'm a little bit distant from the server side so I can't give you I just don't know but generally I think what I can say this that's fair is we're constrained by machine learning you know model complexity latency we are not constrained by overhead of like figuring out how to actually get those predictions like to an HTTP request for instance serving which you know which is kind of sort of solving this problem yeah not that I'm aware of again the torch community is not centralized and so people could be working on a totally awesome you know complement to the the tensorflow server but I am not aware of it thank you okay we're going to take a short break of 15 minutes let's thanks Alex again\n"