SLIDE - Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning with Beidi Chen...
The Research on Optimizing Computational Efficiency in Deep Learning
One year ago, our team predicted links that would lead to significant advancements in deep learning. We were particularly interested in exploring optimization techniques for improving computational efficiency in neural networks. Our research focused on developing a novel approach that leverages smart algorithms and sparse matrix operations to accelerate training processes.
Our initial results showed promising outcomes, with the proposed method achieving convergence times that were 2.5 times faster than those achieved using TensorFlow's GPU implementation. However, we also observed that accuracy was maintained but not significantly improved. This suggests that our technique is well-suited for applications where computational efficiency is a top priority, such as in extreme classification tasks.
While our initial results are encouraging, it's essential to note that this approach is still a prototype and requires further refinement. Our team is actively exploring ways to adapt this method to other types of problems that exhibit high degrees of sparsity. We believe that the fundamental principles underlying our technique can be applied broadly, even in scenarios where the input data is not as sparse.
In fact, we are currently working on integrating this approach with popular frameworks such as BERT, which requires significant computational resources for training. Our goal is to identify the bottlenecks in these systems and determine whether our technique can provide a meaningful speedup without compromising accuracy. This will require careful consideration of the specific architecture of each system and a thorough understanding of the trade-offs between computational efficiency and model performance.
One of the key challenges we face when applying this approach to new scenarios is identifying the optimal level of sparsity to target. Simply replacing the entire matrix with only the top K or N active nodes may lead to significant accuracy losses, while overly aggressive pruning may not provide sufficient speedup. Our team is working on developing more sophisticated methods for selecting the most informative nodes and adjusting the level of sparsity accordingly.
Another important aspect of our research involves optimizing the hashing algorithm used in our technique for GPU implementation. Currently, this step relies on CPU-based implementations due to the need for significant memory resources. However, recent breakthroughs in dynamic hash tables have provided a promising direction forward. Our team is exploring the possibility of implementing locality-sensitive hashing on GPUs, which could further accelerate computation and enable even more efficient use of modern hardware.
Overall, our research aims to push the boundaries of what is possible in deep learning by developing novel optimization techniques that can be applied broadly across various architectures and problem domains. While there are challenges to overcome, we believe that our work has the potential to make a significant impact in the field and pave the way for future breakthroughs.
"WEBVTTKind: captionsLanguage: enall right everyone I am on the line with baby Chen baby is a PhD candidate in the Department of Computer Science at Rice University baby welcome to the twill a eye podcast Thanks so I had the opportunity to hear baby's presentation at the system's workshop and mal was I always get confused mo for system systems for ML this was systems for ML correct yeah they changed recently too we so made this paper to the conference the first half it called M else's and then later change it to sis mo because of the trademarks yeah well I think there are two different multiple different workshops with very similar with all the same words but in different orders but in any case you you presented at the workshop on your paper slide which we will go into but before we do that share with us a little bit about your background what sparked your interest in machine learning particular on kind of this intersection of hardware and algorithms you know how did this all come about sure so before I study my PhD arise I had my anugrah in University of California Berkeley and I was actually doing some networks research it's kind of still computer size for the little bit irrelevant I did my grad my grad work and networks also like stochastic modeling and all that kind of stuff you're doing no not the same different kind of network okay yeah so we're kind of doing like a software-defined buting's yeah so that's what's kind of close to the AI part because you want you automated some process and you predict something for people yeah yeah then when I come to rise I got interested and machine learning algorithms because I saw its I always have this in mind all the problems were trying to attack in computer science not all like most of the bottle acts are usually the computation not all the problems are showing this part but if you think about it we usually have a brute force way of doing things and if we do have the computation of power or we have the time to achieve it then the problem self but the problem is we usually don't have that time and computational power so for the algorithm part what is good is reducing from Big O and square to B go in or Big O and to calcium is usually making that possible so on the way on doing research is kind of I always think about what is the brute force way what is the naive way I can chip this and can I speed it up either in timewise or reducing the memory any part that is blocking this happening to produce the optimal solution and we can do approximation and that's why I kind of doing all the randomized algorithms in my PhD because that's gonna help with the memory and time maybe the algorithms are take advantage of randomization or you just are working on a bunch of random different projects locality-sensitive hashing all kind of applications to tackle different problems got it so this locality hashing is kind of well obviously at the core of the paper you presented at the emesis workshop slide what is the full name of that paper obviously we change a bunch of times I think the first version we're trying to saying that we're defending for the smart algorithms over specialized harbors and then we found a kind of offensive to some people because we're not saying how we're development is not good we're just saying that we also need more progress on the algorithms time so we're just changing the name back to we're beating like psychic trying to beat GPU over like CPU stuff so easily without effort so nobody gets offended by GPS beating up on CPS except Intel right yeah we actually have three more collaborators after the first version of the paper came out they also help us all further speeded up and by using all those Intel tricks before we get into the kind of the core innovation of the paper what is the motivation you've hinted at this also everything takes too long but maybe you talk a little bit more about the specific problem that you're trying to solve with this paper right so we all know that now all the applications like vision NLP and they all kind of have the state of art by using the newer networks but the problem one bottleneck of the newer network is the computation that's why the inspiring GPUs were invented and then people have the computational power to kind of do this all complicated computations so the major bahala is still the matrix multiplication and if you can do like large matrix multiplication really fast then that's kind of solving half of the computational problem in your networks and that's why we started from those the problem which is like concerts extremely classification so basically for example you have your Amazon user so you're typing something like I want like key shoes and then in a background hopefully you're gonna have a black boss which can make you can help you decide which products to show to you then it's kind of reduced to like extreme classification because you're typing a query that's your crate belong to for 1 million classes which one is it one class can be added as shoes one class can be like Nike run issues like that so for this kind of problem the major computation is the last layer because you have 1 million classes then you have to do all this the major computation over there and then we're thinking about is GPU the only way to solve this problem and because we realize this kind of problem even GPU the speed-up is not enough because like the high memory access and then we're thinking about all the computations necessary then the answer is actually no because majority of it is redundant because in the neural networks it's like the hidden layers usually you care about the high activations which is like to inner product producing high results and for last layer is is like even worse because for those kind of extreme classification what you care is about like issues re does choose but you don't care about a headphone so fast those irrelevant stuff like the computation is kind of wasted but the problem is before you do the computation you don't know which part are important which part does not and then the hashing algorithms is exactly good for these kind of situations kind of predict which computation are going to give you like high density so that you just compute those to go to approximation of the few competition and then in this way all the computations in the neural networks will be sparse and including the backwards so that's how we come up with the problem and find the solution so you mentioned hashing algorithms are useful here and telling you which of your computations are going to be useful what why and how is that because usually locality-sensitive hashing is very useful for the search or ranking industry the reason why it is good for finding the nearest neighbors and in the in the Norn our space what is near neighbors is like if two vectors attend to have the similar direction then the inner product will tend to be higher let's assume a uniform norm here so in this way vector a and B are neighbors if they can produce high in a product that hashing is kind of if you know vector a they will the work has been hasn't time find all the bees which are close to a neighbors of a and producing those bees which has the higher end a product with a and that's why I call it like a useful algorithm exactly for this kind of situation got it got it so locality sensitive hashing that's something that has been around that's not something you created years around for twenty years and you know what are some of the other applications in which it tends to come up you mentioned search yeah usually so for a local hands of Akashi one big application is if you have let's say 1 million images or 1 billion images you can't and then you have images of query on hand for example a cat you cannot search linearly for the 1 billion images in a database because that's kind of called Cosme yeah so one the only smart way is you save the one Mila images based on some kind of structures which we call a like neighbors goes to the same bucket so that way whenever you have a query they will quickly use the same way to compute the hash code and go to the bucket which contains all the neighbors so that your search time is bigger one otherwise when you type something on Google is probably take you ages to search for all the information currently we have now so we have to reduce the search time to beagle why instead of we go in and that's all the hashing algorithms are useful producing the hash is it is it kind of a deterministic process or is there a formula or is it a learned process are you learning the hashes across the data set so two categories why is the learnt one and we're actually on the other side is like data independent hash functions are not deterministic is join from some distribution and all the hash functions are all in guarantee that the collet like the for example you have two items exact we the same and then the probability that searching that our element will be one but if the element has some minor differences then the probability will be different so by calling probability is each time you generate the hash function is probably different ones so we can all these eight others with high probability we're gonna retrieve the neighbors seems like if the hash function is going to determine the proximity of different pieces of data it needs to know something structural about the data and I'm wondering then is does that mean that you know you have to do a lot of work upfront to you know create this hash function based on a given data set and then you know anytime you want to look at a different type of data do you have to redo that no but we actually provide the freedom for users to implement their own hash functions so there are four things we're providing for different kind of similarities I think for hash functions there are two lines of work the first one is the learned hash functions we're currently not going to that line but people have freedom to adding those solutions to our framework as well but the second one is we're focusing on the data independent one by data independent I'm saying that not only in based on the data but also based on different similarities for example the most useful ones in our framework currently is cosine similarity I would just talk about if you want to vectors to determine which vectors are around one query vector then cosine similarity might be the most useful one because that's kind of determining the angle which is the direction and another wise like the ranking so they don't care about the normal Direction equals a vector but they care about the rank like for a given vector does the first one the first dimension is larger than the second the second is larger than the for like this kind of preserved alder does is extremely useful for images and there are there a third which is called Ming hash we just kind of preserve the Jocasta molarity is like four to give and set is preserved the similarity between so is actually four different similarity and several different kinds of data you can use different kind of hash functions to have to achieve the same guarantees and I recently also get into the learn hash function that part but the one particular problem that extremely trouble our framework is our framework is not we're gonna change the hash functions or review the hash tables once in a while because we want to preserve the randomness but the read beauty and part for those learned hash functions will be very long yeah so that's a good thing about the data independent hash because it is not time consuming but the interest in the learned hash functions is that they might be more accurate or allow you to express different types of similarity yes that is correct that's why we are seriously considering this case in our framework because our frame ultimately wants to spend very the last time predicting which neurons should be active and then do all the computation on CPU there's actually very hard task you have to do it very quickly otherwise that overhead is gonna kill you so that's why I'm talking about like for the learn hash function although is very accurate it's gonna help us like spends maybe 10 times if the current time of computing hash functions and that's gonna make the framework may be comparatively ok with GPU but not obvious speed-up so the solution is doable it's just I think there's a time and accuracy trade-off here but I we think on the independent hash functions are along the sweet spots and so what's the relationship between the hash function and the idea of adaptive sparsity that's a concept that you mentioned in the paper right so just considering the random drop out it's actually like each time you're just giving sparsity to the neural network randomly so that you're is highly possible that you're gonna meet some important parts if you do that randomly so by adaptive sparsity here is that's why I liked for the random drop out this varsity can maximum goes to like 50% or so otherwise you will decrease your accuracy but we do another experiments and is actually finding a lot of like previous papers to if we can compute all the computations and we only take maybe 5% or 1% of nose as the New York of the active nose then the accuracy will not drop but the problem is you have to make the Hugh the huge computation to know which ones we're going to produce you the top ones so that's inspiring us like we do approximate like 5% or 10% but so the speed is kind of close to the random dropout but the effectiveness of it is close to the adaptive top K so that's why we find the sweet spot in this - does that mean that you're applying the location sensitive hashing in two places one at the very end and for your classifier and then also more deeply in the dropout so Joe Paul is actually in parallel with our framework I'm just giving an example of how the hashing is gonna help if hashing is doing a random selection of the nose is exactly the same as the dropout okay yeah is computing the real top K then that's called the adaptive sparsity but it's somewhere in between is using the same time it's very efficient as the random drop out because the cry we only take Big O one time right but it's gonna retrieve you the nose that is highly possible in that top K so I'm just like giving intuition of what Hashem was selecting it's nothing but just selecting adaptively the active notes for each layer so what is very useful in the case of extreme classification is because of the full computing like the majority of the computation is in the last layer because of the number of classes so that's going to reduce a lot of computation but for the usual fully connected neural network each layer we can build different hash tables and we can do as a sparsity over each layers so that's not a restricted to the last layer there are other elements of this approach that we should be sure to cover there's the application of the hashing do you have to do anything special in terms of your your data collection or preparation or should you just be able to apply this technique so we're actually exploring two directions of extension to this word so this is our current framework support the fully connected neural networks but people are more interested and when they have like the NLP or vision problem they want the CNO ostium currently what we can do is at least for last layer we can apply the same tactic but the problem is for seeing or lost them in the middle like those hidden layers we have to investigate there to see what advantage is the hashing approach and for another line of work we're thinking about distributed because it's well known that for the distributed setting the computational bottleneck is the communication and our slide has as natural advantage here because of the we're only selecting for example 1% or 5% of the active notes for each layer so that in the backpropagation the gradient updates is also very sparse so that for the distributed setting this is natural you fishing for for the same reasons that we've talked about previously intelligently using the hashing to kind of identify and strip out the sparsity have you kind of measured the results and compared this approach to others and then what kind of results have you seen yeah for the results part that's the most impact for people care about so for GPU comparison we're using the current state of our to framework tensorflow and also use also run the same program on via 100 GPU which is currently kind of the state of our people use to run to perform all the experiments for those applications and also for tens of low CPU we also try to see that for all those cpu tricks they're using our repeating them and obviously you will be slower than the tens of flow GPU but will still do some comparison there to see like there's three lines which ours is the most efficient and then tensorflow GPU on VR hundred and tens of low CPU on the same cpu machine we're using because we want to still compare with the CPU otherwise we can use a super-powerful CPU meshing and then that's the using that to beat the GPU is not necessary for people because we want people to do the same thing on CPU which is not that costly as using like a very hundred or miss kind of expensive GPU machines and also there's a lice finding about how many cores we include in our CPU computation so that we can beat the test flow GPU it seems like eight cores to 16 cores is enough to out perform the same task of on GPU we are hundreds so although we're in the in a paper where come using like a 44 corner machine but we do the obligation study to see where it intersects the performance is actually 8 to 16 course somewhere in the middle ok we done so with 8 to 16 cores you're able to see what outperform tensorflow on a GPU yes ok what what's the problem that you're that you're benchmarking this on its those are extremely classification tasks extreme classification and can you talk a little bit about it there are there established data sets and benchmarks for extreme classification can you talk a little bit about those yes so there's a whole repo about all those for example the example that I gave earlier for Amazon data it's usually like yeah you have Curie and you have all the products that's like extreme classification or sometimes you have user typing query and then you have a search results for the finally the website you are clicking that's like another case yeah so all those benchmarks are including one repo or the extreme classification field like people use those are stuff benchmarks for the Amazon you said how many classes are there for the data set we're using is 670 K okay and they use another delicious data set that's like the one year where you're predicting links yeah it was like 200 K so you've got the first set of diagrams where you've got the three lines and that one is looking at the slide accuracy and then you have time on the bottom what is the time is that like convergence time or else yes like training time training time that's looks go yeah okay so you've got pretty strong results here do you have you had anyone take this and implement it in practice kind of an industry or for a kind of a live application yeah I reserve a couple of emails of requesting we have a license there and then people can implement it and you know is extreme really of course there will be super interesting yeah they're working on it to include to there some kind of libraries in a future to there's one thing that I want to share so the oldest P does we're getting here is not about all those simple flaws or implementation tricks we don't even use cool glass or this kind of libraries for speeding up the computation which is use all the primitives and all the commutations were saving here because of the smart algorithms so because we're only in choosing 5% of the nose active nose so that the commutation was saved so I think there's framework if you'd like we can have further speed-up if those techniques tricks are applying to this but this is just like prototype which proved that we can do this with smart algorithms kind of going back to the results you found that the I found the number here the convergence time required is over two and a half times less than with tensorflow GPU and then I was looking for your accuracy number here how does it compare in terms of accuracy there's no accuracy job okay so you're able to maintain accuracy but converts more quality so you've done your benchmarking for this extreme classification task which has inherent sparsity to it are there other types of problems that you think that this would easily lend itself to do too because they also have high degrees of sparsity or do you see this as you know are there ways to kind of adapt this approach so that it you can apply it to use cases where they're not quite as inherently sparse so you're talking about the approach we're having approach you're taking and the applicability and you know what are the requirements of the scenarios in which it would applies you know would they have to be as sparse as the extreme classification tasks or can you see some ways that they can apply more broadly you can be applied broadly as well as I just mentioned for example we're currently working on if we can use this speeding up on the training of Bert which is very popular recently so the difficulty of applying this to an existing framework is we have to profile where is the computation bottleneck because speeding up the LOM bottleneck won't help and because we're trained we're producing some overheads as well and that's not gonna suite the whole process of any way so we have to find the bottleneck of the training and if this technique can intersect in this dispersity so if we do an experiment saying that we're replacing our algorithm with like the real top K and is decreasing the accuracy too much then it's not worth investigating but if we're for example choose the real top 5% of the active nodes and the accuracy still maintains they in this case there's hope for us to do this approximation because like them kind of like the ground truth works but now we just need to speed it up yeah yeah yeah so it depends yeah it depends on the architectures of the network and yes yeah so it's not something that you would envision ever being kind of a generic tool on in the toolkit that you would apply to different types of problems you have to kind of hand tailor the application of this technique to the different two different algorithms and where they're bottlenecked why I think this is a still generic is because in all those neural networks or architectures anyway reduce to the matrix multiplication problem once that one is there this should be useful but it's just like we need to see which where do we where the they were sent to apply to okay got it and we're also looking for like a GPU implementation of the same masses well because the major bottleneck for this technique currently going to GPU is because of the hashing algorithm because for those many years hashing only has the cpu implementation because it requires a lot of memory because it's remembering something right we save something in the hash table and recently we see a step forward to the dynamic hash tables and GPUs from UC Davis folks so for those I kind of were saying that it might be possible to implementing like a locality sensitive hashing on GPU and then this can further speed of GPU and that's like also a possible future direction of the framework awesome awesome well baby thanks so much for taking the time to share a little bit about what you're working on it's very cool stuff thanksall right everyone I am on the line with baby Chen baby is a PhD candidate in the Department of Computer Science at Rice University baby welcome to the twill a eye podcast Thanks so I had the opportunity to hear baby's presentation at the system's workshop and mal was I always get confused mo for system systems for ML this was systems for ML correct yeah they changed recently too we so made this paper to the conference the first half it called M else's and then later change it to sis mo because of the trademarks yeah well I think there are two different multiple different workshops with very similar with all the same words but in different orders but in any case you you presented at the workshop on your paper slide which we will go into but before we do that share with us a little bit about your background what sparked your interest in machine learning particular on kind of this intersection of hardware and algorithms you know how did this all come about sure so before I study my PhD arise I had my anugrah in University of California Berkeley and I was actually doing some networks research it's kind of still computer size for the little bit irrelevant I did my grad my grad work and networks also like stochastic modeling and all that kind of stuff you're doing no not the same different kind of network okay yeah so we're kind of doing like a software-defined buting's yeah so that's what's kind of close to the AI part because you want you automated some process and you predict something for people yeah yeah then when I come to rise I got interested and machine learning algorithms because I saw its I always have this in mind all the problems were trying to attack in computer science not all like most of the bottle acts are usually the computation not all the problems are showing this part but if you think about it we usually have a brute force way of doing things and if we do have the computation of power or we have the time to achieve it then the problem self but the problem is we usually don't have that time and computational power so for the algorithm part what is good is reducing from Big O and square to B go in or Big O and to calcium is usually making that possible so on the way on doing research is kind of I always think about what is the brute force way what is the naive way I can chip this and can I speed it up either in timewise or reducing the memory any part that is blocking this happening to produce the optimal solution and we can do approximation and that's why I kind of doing all the randomized algorithms in my PhD because that's gonna help with the memory and time maybe the algorithms are take advantage of randomization or you just are working on a bunch of random different projects locality-sensitive hashing all kind of applications to tackle different problems got it so this locality hashing is kind of well obviously at the core of the paper you presented at the emesis workshop slide what is the full name of that paper obviously we change a bunch of times I think the first version we're trying to saying that we're defending for the smart algorithms over specialized harbors and then we found a kind of offensive to some people because we're not saying how we're development is not good we're just saying that we also need more progress on the algorithms time so we're just changing the name back to we're beating like psychic trying to beat GPU over like CPU stuff so easily without effort so nobody gets offended by GPS beating up on CPS except Intel right yeah we actually have three more collaborators after the first version of the paper came out they also help us all further speeded up and by using all those Intel tricks before we get into the kind of the core innovation of the paper what is the motivation you've hinted at this also everything takes too long but maybe you talk a little bit more about the specific problem that you're trying to solve with this paper right so we all know that now all the applications like vision NLP and they all kind of have the state of art by using the newer networks but the problem one bottleneck of the newer network is the computation that's why the inspiring GPUs were invented and then people have the computational power to kind of do this all complicated computations so the major bahala is still the matrix multiplication and if you can do like large matrix multiplication really fast then that's kind of solving half of the computational problem in your networks and that's why we started from those the problem which is like concerts extremely classification so basically for example you have your Amazon user so you're typing something like I want like key shoes and then in a background hopefully you're gonna have a black boss which can make you can help you decide which products to show to you then it's kind of reduced to like extreme classification because you're typing a query that's your crate belong to for 1 million classes which one is it one class can be added as shoes one class can be like Nike run issues like that so for this kind of problem the major computation is the last layer because you have 1 million classes then you have to do all this the major computation over there and then we're thinking about is GPU the only way to solve this problem and because we realize this kind of problem even GPU the speed-up is not enough because like the high memory access and then we're thinking about all the computations necessary then the answer is actually no because majority of it is redundant because in the neural networks it's like the hidden layers usually you care about the high activations which is like to inner product producing high results and for last layer is is like even worse because for those kind of extreme classification what you care is about like issues re does choose but you don't care about a headphone so fast those irrelevant stuff like the computation is kind of wasted but the problem is before you do the computation you don't know which part are important which part does not and then the hashing algorithms is exactly good for these kind of situations kind of predict which computation are going to give you like high density so that you just compute those to go to approximation of the few competition and then in this way all the computations in the neural networks will be sparse and including the backwards so that's how we come up with the problem and find the solution so you mentioned hashing algorithms are useful here and telling you which of your computations are going to be useful what why and how is that because usually locality-sensitive hashing is very useful for the search or ranking industry the reason why it is good for finding the nearest neighbors and in the in the Norn our space what is near neighbors is like if two vectors attend to have the similar direction then the inner product will tend to be higher let's assume a uniform norm here so in this way vector a and B are neighbors if they can produce high in a product that hashing is kind of if you know vector a they will the work has been hasn't time find all the bees which are close to a neighbors of a and producing those bees which has the higher end a product with a and that's why I call it like a useful algorithm exactly for this kind of situation got it got it so locality sensitive hashing that's something that has been around that's not something you created years around for twenty years and you know what are some of the other applications in which it tends to come up you mentioned search yeah usually so for a local hands of Akashi one big application is if you have let's say 1 million images or 1 billion images you can't and then you have images of query on hand for example a cat you cannot search linearly for the 1 billion images in a database because that's kind of called Cosme yeah so one the only smart way is you save the one Mila images based on some kind of structures which we call a like neighbors goes to the same bucket so that way whenever you have a query they will quickly use the same way to compute the hash code and go to the bucket which contains all the neighbors so that your search time is bigger one otherwise when you type something on Google is probably take you ages to search for all the information currently we have now so we have to reduce the search time to beagle why instead of we go in and that's all the hashing algorithms are useful producing the hash is it is it kind of a deterministic process or is there a formula or is it a learned process are you learning the hashes across the data set so two categories why is the learnt one and we're actually on the other side is like data independent hash functions are not deterministic is join from some distribution and all the hash functions are all in guarantee that the collet like the for example you have two items exact we the same and then the probability that searching that our element will be one but if the element has some minor differences then the probability will be different so by calling probability is each time you generate the hash function is probably different ones so we can all these eight others with high probability we're gonna retrieve the neighbors seems like if the hash function is going to determine the proximity of different pieces of data it needs to know something structural about the data and I'm wondering then is does that mean that you know you have to do a lot of work upfront to you know create this hash function based on a given data set and then you know anytime you want to look at a different type of data do you have to redo that no but we actually provide the freedom for users to implement their own hash functions so there are four things we're providing for different kind of similarities I think for hash functions there are two lines of work the first one is the learned hash functions we're currently not going to that line but people have freedom to adding those solutions to our framework as well but the second one is we're focusing on the data independent one by data independent I'm saying that not only in based on the data but also based on different similarities for example the most useful ones in our framework currently is cosine similarity I would just talk about if you want to vectors to determine which vectors are around one query vector then cosine similarity might be the most useful one because that's kind of determining the angle which is the direction and another wise like the ranking so they don't care about the normal Direction equals a vector but they care about the rank like for a given vector does the first one the first dimension is larger than the second the second is larger than the for like this kind of preserved alder does is extremely useful for images and there are there a third which is called Ming hash we just kind of preserve the Jocasta molarity is like four to give and set is preserved the similarity between so is actually four different similarity and several different kinds of data you can use different kind of hash functions to have to achieve the same guarantees and I recently also get into the learn hash function that part but the one particular problem that extremely trouble our framework is our framework is not we're gonna change the hash functions or review the hash tables once in a while because we want to preserve the randomness but the read beauty and part for those learned hash functions will be very long yeah so that's a good thing about the data independent hash because it is not time consuming but the interest in the learned hash functions is that they might be more accurate or allow you to express different types of similarity yes that is correct that's why we are seriously considering this case in our framework because our frame ultimately wants to spend very the last time predicting which neurons should be active and then do all the computation on CPU there's actually very hard task you have to do it very quickly otherwise that overhead is gonna kill you so that's why I'm talking about like for the learn hash function although is very accurate it's gonna help us like spends maybe 10 times if the current time of computing hash functions and that's gonna make the framework may be comparatively ok with GPU but not obvious speed-up so the solution is doable it's just I think there's a time and accuracy trade-off here but I we think on the independent hash functions are along the sweet spots and so what's the relationship between the hash function and the idea of adaptive sparsity that's a concept that you mentioned in the paper right so just considering the random drop out it's actually like each time you're just giving sparsity to the neural network randomly so that you're is highly possible that you're gonna meet some important parts if you do that randomly so by adaptive sparsity here is that's why I liked for the random drop out this varsity can maximum goes to like 50% or so otherwise you will decrease your accuracy but we do another experiments and is actually finding a lot of like previous papers to if we can compute all the computations and we only take maybe 5% or 1% of nose as the New York of the active nose then the accuracy will not drop but the problem is you have to make the Hugh the huge computation to know which ones we're going to produce you the top ones so that's inspiring us like we do approximate like 5% or 10% but so the speed is kind of close to the random dropout but the effectiveness of it is close to the adaptive top K so that's why we find the sweet spot in this - does that mean that you're applying the location sensitive hashing in two places one at the very end and for your classifier and then also more deeply in the dropout so Joe Paul is actually in parallel with our framework I'm just giving an example of how the hashing is gonna help if hashing is doing a random selection of the nose is exactly the same as the dropout okay yeah is computing the real top K then that's called the adaptive sparsity but it's somewhere in between is using the same time it's very efficient as the random drop out because the cry we only take Big O one time right but it's gonna retrieve you the nose that is highly possible in that top K so I'm just like giving intuition of what Hashem was selecting it's nothing but just selecting adaptively the active notes for each layer so what is very useful in the case of extreme classification is because of the full computing like the majority of the computation is in the last layer because of the number of classes so that's going to reduce a lot of computation but for the usual fully connected neural network each layer we can build different hash tables and we can do as a sparsity over each layers so that's not a restricted to the last layer there are other elements of this approach that we should be sure to cover there's the application of the hashing do you have to do anything special in terms of your your data collection or preparation or should you just be able to apply this technique so we're actually exploring two directions of extension to this word so this is our current framework support the fully connected neural networks but people are more interested and when they have like the NLP or vision problem they want the CNO ostium currently what we can do is at least for last layer we can apply the same tactic but the problem is for seeing or lost them in the middle like those hidden layers we have to investigate there to see what advantage is the hashing approach and for another line of work we're thinking about distributed because it's well known that for the distributed setting the computational bottleneck is the communication and our slide has as natural advantage here because of the we're only selecting for example 1% or 5% of the active notes for each layer so that in the backpropagation the gradient updates is also very sparse so that for the distributed setting this is natural you fishing for for the same reasons that we've talked about previously intelligently using the hashing to kind of identify and strip out the sparsity have you kind of measured the results and compared this approach to others and then what kind of results have you seen yeah for the results part that's the most impact for people care about so for GPU comparison we're using the current state of our to framework tensorflow and also use also run the same program on via 100 GPU which is currently kind of the state of our people use to run to perform all the experiments for those applications and also for tens of low CPU we also try to see that for all those cpu tricks they're using our repeating them and obviously you will be slower than the tens of flow GPU but will still do some comparison there to see like there's three lines which ours is the most efficient and then tensorflow GPU on VR hundred and tens of low CPU on the same cpu machine we're using because we want to still compare with the CPU otherwise we can use a super-powerful CPU meshing and then that's the using that to beat the GPU is not necessary for people because we want people to do the same thing on CPU which is not that costly as using like a very hundred or miss kind of expensive GPU machines and also there's a lice finding about how many cores we include in our CPU computation so that we can beat the test flow GPU it seems like eight cores to 16 cores is enough to out perform the same task of on GPU we are hundreds so although we're in the in a paper where come using like a 44 corner machine but we do the obligation study to see where it intersects the performance is actually 8 to 16 course somewhere in the middle ok we done so with 8 to 16 cores you're able to see what outperform tensorflow on a GPU yes ok what what's the problem that you're that you're benchmarking this on its those are extremely classification tasks extreme classification and can you talk a little bit about it there are there established data sets and benchmarks for extreme classification can you talk a little bit about those yes so there's a whole repo about all those for example the example that I gave earlier for Amazon data it's usually like yeah you have Curie and you have all the products that's like extreme classification or sometimes you have user typing query and then you have a search results for the finally the website you are clicking that's like another case yeah so all those benchmarks are including one repo or the extreme classification field like people use those are stuff benchmarks for the Amazon you said how many classes are there for the data set we're using is 670 K okay and they use another delicious data set that's like the one year where you're predicting links yeah it was like 200 K so you've got the first set of diagrams where you've got the three lines and that one is looking at the slide accuracy and then you have time on the bottom what is the time is that like convergence time or else yes like training time training time that's looks go yeah okay so you've got pretty strong results here do you have you had anyone take this and implement it in practice kind of an industry or for a kind of a live application yeah I reserve a couple of emails of requesting we have a license there and then people can implement it and you know is extreme really of course there will be super interesting yeah they're working on it to include to there some kind of libraries in a future to there's one thing that I want to share so the oldest P does we're getting here is not about all those simple flaws or implementation tricks we don't even use cool glass or this kind of libraries for speeding up the computation which is use all the primitives and all the commutations were saving here because of the smart algorithms so because we're only in choosing 5% of the nose active nose so that the commutation was saved so I think there's framework if you'd like we can have further speed-up if those techniques tricks are applying to this but this is just like prototype which proved that we can do this with smart algorithms kind of going back to the results you found that the I found the number here the convergence time required is over two and a half times less than with tensorflow GPU and then I was looking for your accuracy number here how does it compare in terms of accuracy there's no accuracy job okay so you're able to maintain accuracy but converts more quality so you've done your benchmarking for this extreme classification task which has inherent sparsity to it are there other types of problems that you think that this would easily lend itself to do too because they also have high degrees of sparsity or do you see this as you know are there ways to kind of adapt this approach so that it you can apply it to use cases where they're not quite as inherently sparse so you're talking about the approach we're having approach you're taking and the applicability and you know what are the requirements of the scenarios in which it would applies you know would they have to be as sparse as the extreme classification tasks or can you see some ways that they can apply more broadly you can be applied broadly as well as I just mentioned for example we're currently working on if we can use this speeding up on the training of Bert which is very popular recently so the difficulty of applying this to an existing framework is we have to profile where is the computation bottleneck because speeding up the LOM bottleneck won't help and because we're trained we're producing some overheads as well and that's not gonna suite the whole process of any way so we have to find the bottleneck of the training and if this technique can intersect in this dispersity so if we do an experiment saying that we're replacing our algorithm with like the real top K and is decreasing the accuracy too much then it's not worth investigating but if we're for example choose the real top 5% of the active nodes and the accuracy still maintains they in this case there's hope for us to do this approximation because like them kind of like the ground truth works but now we just need to speed it up yeah yeah yeah so it depends yeah it depends on the architectures of the network and yes yeah so it's not something that you would envision ever being kind of a generic tool on in the toolkit that you would apply to different types of problems you have to kind of hand tailor the application of this technique to the different two different algorithms and where they're bottlenecked why I think this is a still generic is because in all those neural networks or architectures anyway reduce to the matrix multiplication problem once that one is there this should be useful but it's just like we need to see which where do we where the they were sent to apply to okay got it and we're also looking for like a GPU implementation of the same masses well because the major bottleneck for this technique currently going to GPU is because of the hashing algorithm because for those many years hashing only has the cpu implementation because it requires a lot of memory because it's remembering something right we save something in the hash table and recently we see a step forward to the dynamic hash tables and GPUs from UC Davis folks so for those I kind of were saying that it might be possible to implementing like a locality sensitive hashing on GPU and then this can further speed of GPU and that's like also a possible future direction of the framework awesome awesome well baby thanks so much for taking the time to share a little bit about what you're working on it's very cool stuff thanks\n"