MeshCNN - A Network with an Edge @ TWiML Online Meetup EMEA

The Art and Science of Generative Models: A Conversation with [Name]

When it comes to presenting results from generative models, many people may think that it takes a lot of time to display the rendered output. However, with the rapid development of tools and techniques, this is no longer the case. As [Name] mentioned in their recent work on generative models, "there will be more and more tools to help you doing that right."

One of the most exciting areas of research in generative models is the potential for recreating 3D models from single view images. This has been a topic of interest for many researchers, including [Name], who has explored the idea of using generative models to create new 3D shapes and objects. In this conversation, we touched on the idea that generative models could be used to recreate 3D models from images, and how this might be achieved through the use of conditional generation.

The concept of conditional generation is an interesting one, and [Name] highlighted the potential for using it to create new shapes and objects. This involves training a model to take in input data and produce output that is similar to the input data. In the case of 3D models, this could involve generating a mesh based on a single view image, or even creating an entire 3D scene from scratch.

One approach to generative models is through the use of variational autoencoders (VAEs). These models work by collapsing sets of vectors into a lower-dimensional representation, which can then be decoded back into the original input data. This allows for the generation of new samples that are similar to the input data, and has been shown to be effective in a number of applications.

Another approach is through the use of generative adversarial networks (GANs). These models work by generating new samples that are similar to the input data, while also being trained on a separate dataset. The goal is to create a model that can generate high-quality samples that are indistinguishable from real data. [Name] highlighted the potential for using GANs in conjunction with other techniques, such as VAEs, to achieve even better results.

In addition to their use of generative models, [Name] also touched on the idea of visualizing the internal workings of these models. This is an important aspect of understanding how generative models work, and can be achieved through a variety of visualization techniques. One approach involves using color-coding to highlight different features or activations within the model. In the case of [Name]'s work, they used this technique to visualize the output of their model, including the edges and segmentation between different parts of the mesh.

The field of generative models is rapidly evolving, and it's exciting to think about the potential applications and uses for these techniques. As [Name] noted, one of the most promising areas of research is in the use of 3D data. With the ability to generate new shapes and objects from scratch, or even recreate existing ones from a single view image, the possibilities are endless.

In conclusion, generative models offer a powerful tool for creating new shapes and objects, as well as recreating existing ones from scratch. Through the use of techniques such as conditional generation, VAEs, and GANs, researchers are making rapid progress in this field. As we move forward, it's exciting to think about the potential applications and uses for these techniques, and how they will continue to shape the future of computer vision and 3D data.

A conversation between [Name] and the speaker touched on the idea of visualizing CNNs, including the use of visualization techniques such as color-coding to highlight different features or activations within the model. This was an interesting topic for discussion, and highlights the importance of understanding how generative models work.

Another aspect of the conversation that was discussed was the potential for using generative models in conjunction with other techniques. [Name] highlighted the idea of combining GANs with VAEs to achieve even better results, and also mentioned the use of noise vectors to generate new shapes and objects. This is an area of ongoing research, and it will be exciting to see how these techniques continue to evolve.

In terms of future work, [Name] noted that there are many different approaches being explored, including the use of voxels and other techniques to recreate 3D models from scratch. While recreating existing models from a single view image is an interesting area of research, it's likely that we will see more focus on creating new shapes and objects using generative models.

Overall, the conversation with [Name] provided a fascinating insight into the world of generative models, and highlights the potential for these techniques to shape the future of computer vision and 3D data. As researchers continue to explore and develop these ideas, we can expect to see even more exciting applications and uses emerge in the years to come.

Finally, [Name] mentioned that they are thinking about a couple of different ways to combine generative models with other types of data, including images and videos. This is an area of ongoing research, and it will be interesting to see how these techniques continue to evolve.

"WEBVTTKind: captionsLanguage: enwelcome to the ninth EMEA online meetup this time the main presentation is going to be about mesh CNN and network with an edge so it's all about applying convolutional nets on 3d data and that's going to be presented by Rana so if there's someone attending the first time we'll have a little introduction to the meetup and a small community discussion in ten minutes where we can discuss stuff you've seen or things you find interesting found interesting and then Rana is going to do the presentation and usually it's a 30 minute presentation 15 minutes question answers but normally we kind of mix this up so if someone has a question then maybe he can just ask a question via the chat or interrupt Rana if it's important to for you to understand something and yeah okay so for the next meetup we are still looking for a presenter so if you want to present a paper or propose a text special we recently did some pretty nice tech specials for example on Dhaka or on our general tools and so if you want presents something here then just reach out to me or Christian you don't have to be an author of the paper so Rana is really one of the authors of this paper but you don't have to be one if you want to just read through a paper and you don't have to be an expert this we threw pepper and presented to others you don't have to be an expert yeah at the moment there is a study group progressing on full-stack deep learning someone from the audience attending this study group I am personally I'm not attending no on ok so just reach out from slack there is the deep learning study group channel and there you can just ask how the progress is and what they are doing at the moment yeah just like I said if you have a question I'll mute yourself and ask grana or if you're catching some other time just submitted via the chat and I'm sure that cerana will also answer questions when you want to ask her after the presentation I have psych or LinkedIn or wherever yeah okay so does anyone have a topic talked about or something he saw in the last months that he found interesting just a mute yourself jump in I do have one or two things like always I found this paper very interesting the this is a paper where the office showed that the convolutional networks were used for image classification or image tasks are heavily relying on texture and not on on the shape of objects so he tried or they tried to to show the network pictures like this here for example so this picture would be classified as an elephant because of the texture texture and this would be classified as cat but while a human could see the cat in this picture we the work would be pretty sure that this is an elephant because of the texture and he so they showed that they're kind of bias towards is textured texture detection and he tried they tried some from more things like this for example so and they always compared this to to human recognition and as we all know modern CNN's are better for example on classifying images from the imagenet data set and better than humans and when when you start to show the network it just like this here for example only the silhouette of the of the cat or only the edges then it's the human again outperforms the those networks very very strongly that's very interesting I first thing I did not knew that I didn't know that that those nets are really good in finding textures you can see that in this visualizations of the intermediate features in the network but I didn't did not knew that this is so that they're heavily relying on those textures and they also proposed a method to overcome this which is also very interesting so they are using star transfer to kind of change the images - yeah more shape oriented images so you always use this kind of an augmentation method and you change the pictures with this time transfer and you kind of destroy the texture information in these images and so you get more bias towards the shape of the objects and not the texture and actually they found out that they can get better results on on image net than which the original just with the original images and yeah that found that very interesting it's a new insight I did not knew yeah there are some interest some very interesting figures on the fraction of yeah shape based decisions and texture based decisions of the network you can kind of test that yeah I recommend reading that paper also this one was very interesting I think the open image v5 dataset released by Google and they added an enormous amount of new segmentation masks for images yeah 2.8 million object instances that's really really a lot so for anyone doing object segmentation this could be interesting also yeah I was really amazed how this deep fake stuff that someone is do you guys hear the music in the background ah okay it's the video sorry start at the video so I'm really amazed on how good the deep fakes has become so this is a deep fake sorry and they are they did this for a museum and that's I recommend watching this that's really interesting it's really interesting how how good this works now and kind of scary also and for all of you who are using PI torch Facebook open sourced some of the internal tools and I didn't personally did not have a look into this but that seems pretty interesting they're open sourcing some tools they use for adaptive experimentations with with by torch so for everyone who's using patterns that could be very interesting okay are there any questions so far or anything you want to discuss can you trust like later on just post the links to your yes the tread yeah I will post that in it's like it's probably the best yeah okay so there are no more points then I would stop sharing my screen and Rana if you already you can click the share button resume chat and start with the main presentation span you hear me yes okay so yeah if you if anyone wants to interrupt so that's fine awesome okay so yeah i'm i'm rana I'm a PhD student at University and he's talking about her work my CNN which was accepted years ago hoping for something it next month so just sort of but an intro right convolutional neural networks work really well on images for a task like classification and segmentation and the reason that they work so well is they have a convolution and pooling layers which lead to a robust framework and so images are densely and uniformly sampled on a grid is so and what that means is that in evenly spaced intervals we have a pixel with RGB value and that makes it sort of trivial to apply convolution and put or synonyms but our world is 3d everything is like complex geometries around us tables and chairs and so many different industries like from robotics to autonomous vehicles and medical industries and 3d printing are interested in doing what's called a 3d deep learning so in 3d deep learning we also train our networks to classify and segment B shapes the problem is that the shapes are irregular so applying convolution and including is not necessarily so straightforward so so in a nutshell what's happened with fede deep learning I obviously can't talk about everything but on the on the left-hand side is sort of the first works which try to do 3d deep learning not directly on the shapes but converting to a more regular representations so the first types of works are multi view works take a 3d object and image it from many different views and then they pass it through parallel convolutional Noora or getting classified issue another type of approach like a metric approach takes the 3d shape and converts it to what's called box blows volume so you can see there's like a three by three grid here where every single style is zero or one or occupied or an occupy and so this is really computationally expensive because you have a lot of convolutions that are applied in 3d space which is like empty convolutions and the resolution is really poor so I think almost all the new works are in the class on the right hand side which are direct approaches so there's some bits of works on like manifolds which essentially like look at like these local descriptors on each people like like things that can move not like static mundane objects and we'll try to learn some type of disk character which will be invariant to like different emotions of people so if people are like sitting or standing on the same about the same feature vector then there's been a lot of work something like looks like Brad convolutional neural networks and that's probably the closest take them work to our work these works are you like showing problems on like genetic data or social networks like correctly just sort of general graphs not necessarily specific to to 3d and the last type of works are point clouds so the point cloud representation is just a list of 3d coordinates and space and the idea behind a point cloud is generally sort of sample of a shape like relatively uniformly all over the entire shape and that's how sort of we can get some type of understanding of the the shape structure right so this is obviously very sparse and irregular and so there's been a lot of works to sort of develop irregular operators for point clouds so point clouds are great but in this work we want to develop area operators for meshes so mesh is essentially a representation for a few object and I'll explain exactly what that is but general it's something that sparse and it's a regular and not not uniforms this sort of also difficult to think about how to apply coalition's on it so what is the mesh it's essentially a list of vertices 3d coordinates and space and faces so it's the three indices that make up in this case we're talking about triangular meshes so the faces are triangles they make up a triangle so actually this is this list should be permutation invariant so what that means is that it doesn't matter the order of the vertices or the order of the faces we should still get the same like 3d object in space right the problem is like if you look at this version see here on the left it has three neighbors right and this is our first thing out there right now is part like there's infinite possibilities the number of neighbors and there also is no order you don't know how to separate it by a convolution on brizzy's it's not very clear so our idea was to use edges edges are essentially there's three edges and triangle and it's just given by the two but it protects IDs so the nice thing about edges is that essentially every single edge in a triangular mesh is incident to exactly two faces and in every single face those two edges so each edge has this nice fixed size neighborhood of four edges except records are boundary and the boundary kneepad with sears ronald is just one question Ash so ash means that there has to be an angle between two two triangles or - as an edge so when you have an angle between two triangles or is also an edge for example here on the on a yeah kind of flat surface so that there is no angle between two triangles yeah yeah so an edge okay so the edge is define like from the data structure of the mesh so for example here there's no angle like between these two faces that's okay because this is that this is this particular edge for this okay um okay so why you smashes over a point clouds well I think there's two sort of main advantages to the mesh but that's the point that the first is that there are they're really efficient so we can represent like this large flat surface with only two triangles whereas like a point that we need to sample this surface uniformly and with a lot of points on the other hand if we have these like I need detailed regions we can use a lot of a lot more triangles to represent the details more accurately so it's an efficient representation and it also explicitly contains the surface information so for example if you were to save all this with a point cloud you might not necessarily know which points belong to left versus the right camel joint but with a mesh you actually if you look at the point here versus here you can see that the distance on the surface is very large so you can easily separate the the difference between the left and the right camel joints which is through what's called geodesic distance or this distance on the surface so that now that you understand why I'm a southern awesome now you can see that we want to go convolutional neural network directly on the mesh and so we need to develop some sort of like irregular operators to to do that okay so in our work edges are sort of like pixels and images with pixels and images every pixel just starts with an RGB value and with mushes certain you to make something up because um there's nothing specific specifically like defined for that I don't think it really matters but so we just went with something that seemed like a reasonable simple geometric feature so for example for every edge we extract these five five geometric features the angle between throwing into such the space in this case this angle this angle and also the ratio of this type to this base triangle on the same video so these features are invariant to similarity transformations which means we can apply like rotation on the mesh or translation or uniform scale and we'll stay at the same four dimensional feature version which is nice okay so how does refresher how does convolution working images if you have this blue input image in this gray you go through kernel it's a dot product which is an element-wise multiplication summation of the kernel and the facial support in the image and outcomes this feature activation map right and so the reason if this is easy for us to do because is because there's a consistent number of there's always nine pixels and there's always a consistent order that we can apply that the dot product to it so actually this reminds me of the work that was presented earlier that this convolution really aggregates a lot of local information about about the image into like these feature of future observations and that's what makes the salmon's for so long so we want to do the same thing for meshes okay so we we start looking at every single edge like I said before as for what's called wandering neighbors so each edge right so is an edge with a feature vector it starts out as a five dimensional feature vector but we're going to learn convolutional filters and so it's going to grow or in size and we want to apply a convolutional filter on the edge in feature vectors and the four neighboring edge feature vectors so the problem is is we don't necessarily want to be sensitive to like the initial ordering of vertices of faces and so we wanted to find a way to apply the convolution in a consistent way so in meshes there's something called face Direction on the face the normal direction which is each basis ordered counterclockwise so every single place has an order the problem is is we don't necessarily know which face to start with like that is ambiguous so the solution is we can okay essentially we know if we started at edge E and we wanted to look at the four neighbors we could either start with a or C right these two pairs are ambiguous and the next feature could either be like if it was a so VP and if I was see the next picture so the blue pairs of red cards are in the u.s. each other so what we do is we build instead applying the convolution directly on the features of any for example we apply it on the features of a plus C so a plus C element wise there's all the features in edge a and all the features image C elementwise add it together we get a new symmetric feature which doesn't care if it's able to see what sequence ad and we do the same thing with the subtraction and absolute value so we built this feel basically from these four neighboring edges for symmetric features which are invariant to the order the original order of the mesh so just to summarize everything that I said so far we have a mesh we go through the list of edges we extract some simple geometric features by dimensional feature every single edge we look at the one ring neighbors and the the normal direction and we use that to build a convolution which is no variant and we learn a certain number of filters we apply it on the edge and we continue this process right now we have we started with five dimensional feature vector for each edge and now we have K dimensional it's I think that it works so well on this this geometric features because that's not it's not intuitive because I mean those convolutions apply to to those kernels of by two images and yeah kind of those trains cascade or filters that kind of intuitively make sense very interesting that it also works on this kind of features geometric features off of the triangles yes I like some people I think we'll take images and convert them so like it just be colored space or some other cause based on the men like convolutional codes and so I don't know maybe maybe it will give some very small boost in performance or something but like I think most people really focus on the architecture of their network and then say that the network will learn the important features and it doesn't really matter what I what it uses an input the network flow like learn some abstract representation which is like I'm going to sort of like whatever is happening an input so I think the idea here was to show something similar that like for meshes we don't necessarily need to extract these very complex and high high you know high dimensional features that maybe are very like powerful and descriptive like we instead we want the network to learn like what's powerful and descriptive and we just and put something really simple yeah so okay so in in images we do like pulling and pulling is sort of the way it's defined right now in use right now works for regular images but it's not necessarily clear how to do it for irregular structure is like the mesh so we sort of define a general general definition of pooling and then we say that image pointing sort of a special case of that and our metric pooling is a special case of that so the three sort of core operations of bullying are first we define the pulley regions on the case of this example here the coil region that's given by this 2x2 filter and then in the next step we merge the features in each of the regions so if we take the max max point and then we need to redefine adjacencies so in images this is sort of trivial because you know this is 7 and this 9 are far away right and this before the max flow after the max pooling in the new like image grid there neighbors and we don't need to explicitly define anything with the mattress we need to to redefine the reason which is important is because it makes the network learn stronger and more robust representation okay so the inspiration for our for our pulling or masterfully was taken from the classic classic technique in computer graphics which is called my simplification West notification is essentially that goal is to reduce the number of mesh elements but we want to preserve the sippers of the overall original shape so what it's done is what's called okay there's several ways to do it but the most my friend was probably right about that course edge across essentially iterates over all the edges and removes the edge that will create the least distortion to the overall shape so the way that works is let's say we want to quote delete this red edge so this red edge will collapse to a single point and the two edges on in this base slow claps to the single engine the two edges so most fun Facebook lastest English so in one edge collapse there's three edges from so we want to instead of having a geometric edge across we wanted to have a the network on the edge pops and this is interesting for a couple reasons first of all it allows us to visualize what the network learned like what the network decided was important which edges to delete which gives us some like cute some cues about how robust the model really is like some works on like adversarial adversarial attacks and like when model interpret ability so like you think this is interesting because we can sort of see that when our and also of course as I said before bullying will allows when I work to strengthen the features that I learned and yeah so the way the way it works is essentially we just want to delete edges with a small speech activation that's what that means is for every single edge of the mesh we look at the feature with the edge which has the feature with a minimum norm the compute the norm for every single edge feature Activision and the minimum one is the one we decide to delete so the deletion has basically two main steps the first is to aggregate the features so in this case we pull this edge so the edge a B and E are averaged element wise and we get a new feature edge feature P and same for this on the bottom and then we need to update the mesh because now we have that different much so the way that works is so you can see that like the edge here for example now it has like new new adjacencies for the next convolution so we need to update the mesh data structures in order to get those new convolutional neighbors we also we also developed a on cooling layer which is pretty much the same as how it's done in images with a couple small changes which are minor but the the main idea is the same Ian's idea basically and we restore like so we sort of remember what the topology was before the fully image on pooling you sort of remember the indices of the Max and then you give that to them when we are here we remember the topology like what what was before that pooling we're and then we restore that completely and then we need to create these unpooled features so an image it's usually it started with zeros we sort of thought it was more it may be more sensitive to have like a weighted average of almost in the pooled feature vector but you know I don't know that makes so much but that's because afterwards comes convolution so when I did that I mentioned before is that because the network is choosing which edges to delete such a based on the their future activations if we trained for instance a network to classify whether bases and handles versus another one whether it has a neck so the simplifications like the mesh rulings will look different so this is just like a visualization mode what that might look like so in this top example alike but handles most of the handles are left and then those for example explosive and that yourself right so I'll just mention I would just show a couple of occasions of course there's like more information in the paper but we essentially take the convolution and pulling where's we put them together sequentially and in between we have like nonlinear activation relu and normalization there's so we're using bash norm a group norm and then depending on the test so that's this classification we need to apply what's called like this was originally proposing the point network so they have a list of points and we have a list of edges so in order to be invariant to the order of the global order of the edges we apply this global symmetric function which essentially is a Max or an average across like each feature act Division of all the edges I'm so for every single like for the first feature and all that is to take the average of the max and then you can apply like fully connected layers and then do like a cross entropy to classify with segmentation it's fully convolutional so there's no fully connected there's we have after the cooling we have some unfolding there's we do like that sort of a you net type architecture and then we condition would classify the this probability of edge belonging to a particular segmentation class and then the classification would obviously best finding what's the overall cost initially so an interest of time I'm gonna skip over this it's not so interesting so we showed like a couple of applications that the first thing for example in this Shrek dataset which is like 30 different classes of shapes like people ants I don't know aliens things like that what was interesting is when we extract here you see these are people and these are the intermediate mesh simplification the mesh pooling result so you can see actually the network removes the head of the people which is kind of funny and but for the ants it it's removing the legs so we start to notice that the network especially when it did well was doing something like similar across all the different classes sorry um the network was basically Siller within each of the classes like the same class it was doing a similar it's very interesting that you can have this insight so 22 to that you can reconstruct the shapes from this intermediate representation representations you can it's it's it's much better than four images because the sometimes it's not obvious whenever looking to the reconstructed intermediate it just for this application it's better yeah yeah I mean we were actually like working on something now we're we're converting images some ashes and trying to fit but yeah because there's no geometry in the image we have a similar problem yeah so because it's a job here so you can you can actually see something interesting happening so some results for segmentation we also like observed some other thing that we observe in classification which was of course these are all super byte so I um but what we noticed is this is obviously the tester that the like the back of the chair for example it was the first thing that the network started to collapse and then after that the it was the seat and then usually didn't really get to the to the base of the chair so it was also doing this like similar like parsing that was consistent over the particular class which was interesting also we applied this to to humans it actually this is really cool but this is a result of someone like posted on that our github page that they were using this for like real 3d body scan so even though this looks sort of like I guess academic and stuff like that they were able to use our preaching network and is that it works like well their their body scans so that's that's pretty cool don't have that much time up so maybe I'll skip over this but the general idea here was that we built a specific data set in order to sort of like okay so in the previous data sets we did like we didn't think that problems weren't necessarily emphasizing this like the power of what meshes can can really show or what measures can do so like in this data set what we did is we engraved like these simple shapes into cubes like and there's 23 different classes of shapes of the heart is one class and Apple is another and because the engraving is so shallow when you simple it with once you can actually see like you can't visually detect the difference so you did very well on this little classification and these are some of the intermediate like simplifications so you can see that like mostly the background edges are being like removed and most of the inside of the icon is being preserved this is just showing that we can for like the same exact shape we can actually create what's called different triangulations or different mushing so this same exact shape can represented in several different ways actually probably infinite ways but we showed that even just like with some simple of data augmentations we can be like robust to these types of things and you're using this different triangles for augmentation they don't set that correctly so for augmentations what we're doing is something called edge flips which is and search shifting around the vertices so we're not we're not actually generating this was pretty pretty impressive we we trained on this type of triangulation and then we tested like on this side so we didn't necessarily exactly see this type of calculation both of these are from the test set but this was triangulated in a way that was similar in the way that the training set was good and then this was really a different program so like what we did is we sort of like we took for example of this vertices here and we like moved there around a little bit and we did something called edge flips which is like taking this triangle and sort of it's flipping it all around it did essentially what it does is it creates we're gonna do mesh but it's not something too complicated very simple like always see if we could just generate a ton of different machines for the same object pretty easily so that would be a great way to do augmentation so a couple of interesting things about like this is what I hear at things that I noticed about their work so what we want to do so we want to create an example where we could really show the power of the mesh looks about to the cubes so what we want to do is classify a human based on their identity but we thought because of these students this is a data set which has 10 different people their body structure is really different so like you have tall people and short people and muscular people and it's really easy just from like global like sort of body geometry to to tell who's who so what we wanted to do is we wanted to force the classifiers to look at the small details in the face so we swapped the heads like all the different people and put them onto all of the different bodies and then we trained our network and also the points networks to classify the shapes based on the based on the face so they got a face which sometimes had had everybody and then they were supposed to classify which case it was and the result was that sort of we did really well and also the points didn't do so attic they got like ninety percent there's got like I don't 99 percent it was really easy and I we sort of gave up on it we didn't really understand why that was the case but when I looked at the simplifications I sort of realized what the issue was actually the networks like okay there are simplifying the body and then at some point the network simplified the features in the face so actually the network completely ignored there what we wanted it to look at and instead it probably was just looking at like the contour it was enough to just look at the contour of the face Amanda turn the person was it that was enough it didn't necessarily need to look at the fine green details so it was pretty cool to see like that explanation behind what what what we were observing let's get over there since we have time so basically in conclusion we built this sort of pollution which is it's a variant to similarity the rotation transition scale and and the ordering of vertices and different faces and then we have this mesh cooling layer that we built which can make the features you know strengthen the learning representation but also it gives us this sort of like visual insight and sort of clues about what's happening is in the network we have like a few different future directions for this work what want us to make it generative may be gone precious and another is to maybe make it work on general graphs and also another thing is like I said before bringing images to some meshes and doing this yeah I think your honor very interesting so does anyone have a question yeah I was like super intrigued when when when the topic came up in in this leg in this Lac and I was immediately reading the paper and thinking how I could use it in any way for my work because it just seems like so like different and and so impressive in the possibilities but I'm not quite sure how I could apply it but I was wondering um for for lidar data basically having like like the topography scans if if that could you kind of use to classify a certain landscape features stuff that it's probably not obvious from from images but rather from from radar based or later based topography scans if that that could could be useful for classifying features in the landscape stuff like that so what the lidar data is given us as point cloud or how is that there yeah the immediate step yeah would somehow generate like like surfaces I guess yeah so okay so I also would immediately like that but okay so there are there are some works like what you take a classic work which take 1000 converts them to mesh so that's one possibility that there was 3d body scans they were they must also they're also sampling some surface so they must come also somehow as a point cloud and then they're converted I'm not exactly sure yeah so I think I think it could work I think it like also the lighter I mean that's sort of a while since I thought about light arts but I think that they also like to give you a signal in terms of fate like the angle of the surface they bounced off okay I think there there might be some more information other than just like there there's a point in this distance yeah I think so too I'm no expert in this field at all most doing like optical remote-sensing stuff but all the different phases also further for the SAR radar and satellite cadence so maybe then if one finds an intermediate step that could work that would be interesting yeah yeah I'm also not a 3d expert but I guess you can make some simplification simplifications and some assumptions when you do something like a body scan so you can assume that this is one surface from the body and if you really want to do landscape classification it's probably hard to to tell the diff and surface is a party I think you basically have to do some kind of spatial smoothing right so a neighborhood lookups if that it's kind of continuous or use you kind of continuous in in places can bounce around widely but yeah thanks again Rana for this super interesting talk super cool table one of the of the most interesting and most cool part was that you can have look into this all those intermediate points and I mean you you said yourself that you found out why the classification and for the face did not work yeah yeah we also think this is like sort of I mean in general I feel like with networks like actually the first camera that was presented as well you can see that it's like we don't really understand why why CNN's are working the way to work yeah and like you could you know yeah we can tell it's a cat right because of the contour but the the networks like looking at something completely different and um I think it actually shows like that they're not as robust as we think they are and they didn't actually learn something super meaningful yeah I don't know that they really are generalizing - yeah I mean it's just like the example I showed with the textures right yeah people people did not knew that so okay yeah can I ask another one I was wondering maybe I missed it in the paper I was wondering about the computational complexity versus like these different viewpoints you mentioned that there was like the previous work that objects were kind of scanned around and then on 2d which is the regular classification like is it dependent on the number of vertices I suppose but like this update of your mesh after the pooling right is that like a huge computational drawn on the overall system or is like just a ballpark number so okay so in the paper I think we wrote like it takes roughly 0.1 seconds or forward and backward pass of a match which is okay that's for sure than we saw it in images like but it's it's sort of on par if not faster than like most 3d deep learning approaches today okay so like kind of a mechanoid it I forgot about it sorry like a maximum number of vertices you can handle and otherwise it's just too much like like you simplify it down to a maximum number of nodes or yes so in the beginning and that's why I skipped now in the beginning um so in order to sort of make the throwing everyone down samples images to roughly the same size so we're down to all the meshes to like roughly the same size like a couple of thousand of edges okay and yeah doesn't mean to be exact because our cooling layer can handle like different different resolutions but um yeah so the convolution itself is just as fast as regular image convolution the pooling is like yeah that's the bottleneck of the speakers it's a sequential operation with we're looking at the smallest edge collapsing it and continuing so there's definitely like room for and virgin optimizing down my husband okay thanks yeah all right I did not realize that but the pulley incorporation cannot be it's not parallel yeah so we did like a couple of things to make it work pretty fast like instead of every single like every single collapse instead of collapsing the edge and then updating all the data structures each time so we just only update the datastore just once at the end of every cooling layer so it's a bit faster than so yeah sure iterative like approach but yeah I could definitely be definitely done like faster any any more questions I'm just looking at in the chat so Alvaro is asking would adding different hairstyles and putting different hats help make the models focus more on the faces oh I don't think that you could include hairstyles um so like basically the network was learning to two ignored everything that wasn't important to it so in this case it could it ignore the entire body the the eye is in the nose and really just cared about this part so I think like you add a hat it would just ignore that as well yeah okay so I think this is going to have a huge potential in the future really because I think that so I'm really not into an expert in any means on 3d data but I all the approaches I've seen so far are basically infeasible because of computation computational power at this point and I think this is a very interesting step forward in this direction do you see any implications for computer tomography data for example so I'm not an expert on medical stuff I think though the way she works is it like has light-skinned slices right of something and rebuilds the volume yeah but I don't yeah so I don't know it certainly if you can convert that volume to mesh which is yeah actually pretty straightforward that's easier than 0.5 right and I guess the resolution there is pretty good so yeah I guess like it works pretty well obviously right now we're these are supervised tests I need like data and stuff like that but yeah yeah can I ask about data I mean again I'm also not an expert in on 3d data at all but I just imagined that there's far fewer datasets for training models on like like label 3d data sets right in versus image data yes so like everything that happens with images like it takes a couple years or it doesn't happen in 3d right image that came out in maybe was before I don't know CHS all this girl's was the the LX net right yeah and so like oh yeah Alex I was in 2000 like the equivalent allowance net for four for three deep learning what's in 2015 yeah like we're like a couple years behind everything so also the datasets like we have less datasets in the business so we have a bus like labeled data but it's also growing this like a bunch of excreting warehouse and ship them stanford or like building these huge repositories and now something very very recently just came out called the ABCD data set which is like millions of small parts that are related like 3d printing I might actually add something about the about the data sets the one important distinction is that the with images we have pretty consistent you know data structure and with 3d models even in the data sets we have very a lot of variation even in quality in the way the models are built so there are a couple of data sets existing like shape net for example but a lot of the research is done on stuff downloaded from something like 3d warehouse and this is you know just people sometimes someone like just using blender for the first time and sometimes if someone who is doing very nice watertight 3d model for printing and sometimes it's you know just a cube with a few spheres so that's another reason why the research is quite difficult that you have so much added complexity and variation in the data and so and and the number of models in a certain category is also small it's easy to get millions of photos of cats from the internet it's not as it will get even thousands of models of different cats but that's what I was wondering if there's any kind of driver like like for images it's obviously the big search engines that are kind of auto-generating these huge datasets or kind of harvesting this information if there's anything similar or on the horizon for 3d stuff but your 3d printing kind of makes sense yeah there is there are like so much different industries that use 3d data and this results there are some places where you can find 3d models created with photogrammetry for example so they they produce different types of 3d models you'll have people who are doing 3d models for gaming again different data model structure so this makes it difficult to then feed it through a network you know in a pretty consistent way so it's like you know whatever you put in the network you the results may vary yeah cool yeah interesting and Ron I want one last question question for myself are you coming from the field of computer graphics and 3d imaging or from the field of deep learning so I guess I started originally doing computer vision before deep learning okay and then sort of deep learning caught on but I like you know I was doing a classic machine learning - so yeah originally started out doing images and just regular machine learning and then images deep learning and then okay interesting it's I always like it when people do this kind of foundational work on this stuff because oh it's it's much harder yeah I mean like I always thought he was really cool so yeah it is absolutely it is but I mean somebody slides for sure yeah I mean when you do something in four images then it's so easy you can you can change choose from I don't know 100 different data sets to try your ideas and and develop your stuff and that's kind of you have to do so much work for the foundation in order to to do this so that's that's really cool yeah I mean like just even something as simple as like you know presenting the results you think it takes like a lot of time to like you know display that render then I guess in when this work is going on there will be more and more tools to to help you doing that right yeah and also so I have a question about your last slide about future work and generative models what type of work were you considering with with generative models and could it be used for recreating 3d models from from single single view images I I've seen some work with voxels doing the same or retrieving models from 2d image or sometimes multi-view data could this be used or recreating 3d models from images okay so I'll say when I wrote generative models I was thinking like god I'm like you know I'm thinking about gods but like conditional guns for example well you what do you have strong yeah yeah okay so conditional I mean okay that's like a personal thing that like a lot of people think a gun when it's really just like a one-to-one mat it's really just a network like psycho gone and smooth just done that work but so in the case of like Tiki and in converting it to a shape so I'm not I'm not sure like okay so much data works directly like on on the 3d mesh so I'm like if the input is a trying is if the input is an image or not not only image like conditional dance you have like two sources two sources of data and you produce to the image so you can have two something similar the question is can you combine like one source of 2d data with the generated model yeah yeah there's definitely a different way like I'm I'm thinking of a couple of different ways right now to combine it with B not necessary images but something else but so when it when I wrote generative models here I meant like maybe like from like from a noise vector like generating like having a actual mesh come out of some type of black box so not necessarily conditioned like on some specific image but just like sampling some noise distribution and then like having it some good meshes that come out of that black box yeah I there are I think some voxel voxel guns yeah it will be interesting to see it with your mesh Megan yeah yeah yeah I think it's gonna be hard because like much harder than Mussina and because like really small inconsistencies with the like with the angles of neighboring faces can make the mesh look really bad so like yeah I hope that so you're talking more about stuff like variational autoencoders so okay yeah very atomic or it's like one approach to two generation right thank you have an input meshing and you like collapse sets of some vector and then decode that to the original match and then like doing inference I guess you can sample that latent space and get like different meshes at the output but um so yeah that's one approach another purchase just like taking out a random vector and then just like inputting into a generator which will generate a mesh and then you have like a discriminator which whoa like say if it's really um so there's may different were thinking about like different ways to do I'm not exactly sure how we're gonna do that but it's sure going to be interesting yeah any hard yeah okay if there's no normal questions okay so the way to visualize CNN there's okay resort to some visualization techniques between the layers okay I think if I understood correctly your to okay so when I showed these visualizations this is like what's happening between the layers but in terms of just the tops all it's like the mesh itself but I'm not well this sort of I'm not specifically visualizing like which the actual feature activations like themselves right and I guess that's what okay he's between oh there's so like or also visualizing how the maybe look to themselves like we can we did I did try to fought like on these meshes the future activations but I couldn't we couldn't really get into visualizations that was informative it was not really so clear yeah it was a bit confusing so so you're just a color color-coded the edges or yeah four percent actually for segmentation it I think we did do something actually forgot that okay we did call it the edges the segmentation in between to see the high-value versus a low value choc traditions it was pretty similar to what we saw with assembling vacations which was why we didn't include it but I think if we had more time before the submission I would have added it yeah okay so if there are no questions I think we'll close for now Rana thanks a lot that was really very very interesting I'm really looking forward to see more more work on 3d data I think that's really going to be important in the future and yeah thanks everyone and see you next time thanks very Iwelcome to the ninth EMEA online meetup this time the main presentation is going to be about mesh CNN and network with an edge so it's all about applying convolutional nets on 3d data and that's going to be presented by Rana so if there's someone attending the first time we'll have a little introduction to the meetup and a small community discussion in ten minutes where we can discuss stuff you've seen or things you find interesting found interesting and then Rana is going to do the presentation and usually it's a 30 minute presentation 15 minutes question answers but normally we kind of mix this up so if someone has a question then maybe he can just ask a question via the chat or interrupt Rana if it's important to for you to understand something and yeah okay so for the next meetup we are still looking for a presenter so if you want to present a paper or propose a text special we recently did some pretty nice tech specials for example on Dhaka or on our general tools and so if you want presents something here then just reach out to me or Christian you don't have to be an author of the paper so Rana is really one of the authors of this paper but you don't have to be one if you want to just read through a paper and you don't have to be an expert this we threw pepper and presented to others you don't have to be an expert yeah at the moment there is a study group progressing on full-stack deep learning someone from the audience attending this study group I am personally I'm not attending no on ok so just reach out from slack there is the deep learning study group channel and there you can just ask how the progress is and what they are doing at the moment yeah just like I said if you have a question I'll mute yourself and ask grana or if you're catching some other time just submitted via the chat and I'm sure that cerana will also answer questions when you want to ask her after the presentation I have psych or LinkedIn or wherever yeah okay so does anyone have a topic talked about or something he saw in the last months that he found interesting just a mute yourself jump in I do have one or two things like always I found this paper very interesting the this is a paper where the office showed that the convolutional networks were used for image classification or image tasks are heavily relying on texture and not on on the shape of objects so he tried or they tried to to show the network pictures like this here for example so this picture would be classified as an elephant because of the texture texture and this would be classified as cat but while a human could see the cat in this picture we the work would be pretty sure that this is an elephant because of the texture and he so they showed that they're kind of bias towards is textured texture detection and he tried they tried some from more things like this for example so and they always compared this to to human recognition and as we all know modern CNN's are better for example on classifying images from the imagenet data set and better than humans and when when you start to show the network it just like this here for example only the silhouette of the of the cat or only the edges then it's the human again outperforms the those networks very very strongly that's very interesting I first thing I did not knew that I didn't know that that those nets are really good in finding textures you can see that in this visualizations of the intermediate features in the network but I didn't did not knew that this is so that they're heavily relying on those textures and they also proposed a method to overcome this which is also very interesting so they are using star transfer to kind of change the images - yeah more shape oriented images so you always use this kind of an augmentation method and you change the pictures with this time transfer and you kind of destroy the texture information in these images and so you get more bias towards the shape of the objects and not the texture and actually they found out that they can get better results on on image net than which the original just with the original images and yeah that found that very interesting it's a new insight I did not knew yeah there are some interest some very interesting figures on the fraction of yeah shape based decisions and texture based decisions of the network you can kind of test that yeah I recommend reading that paper also this one was very interesting I think the open image v5 dataset released by Google and they added an enormous amount of new segmentation masks for images yeah 2.8 million object instances that's really really a lot so for anyone doing object segmentation this could be interesting also yeah I was really amazed how this deep fake stuff that someone is do you guys hear the music in the background ah okay it's the video sorry start at the video so I'm really amazed on how good the deep fakes has become so this is a deep fake sorry and they are they did this for a museum and that's I recommend watching this that's really interesting it's really interesting how how good this works now and kind of scary also and for all of you who are using PI torch Facebook open sourced some of the internal tools and I didn't personally did not have a look into this but that seems pretty interesting they're open sourcing some tools they use for adaptive experimentations with with by torch so for everyone who's using patterns that could be very interesting okay are there any questions so far or anything you want to discuss can you trust like later on just post the links to your yes the tread yeah I will post that in it's like it's probably the best yeah okay so there are no more points then I would stop sharing my screen and Rana if you already you can click the share button resume chat and start with the main presentation span you hear me yes okay so yeah if you if anyone wants to interrupt so that's fine awesome okay so yeah i'm i'm rana I'm a PhD student at University and he's talking about her work my CNN which was accepted years ago hoping for something it next month so just sort of but an intro right convolutional neural networks work really well on images for a task like classification and segmentation and the reason that they work so well is they have a convolution and pooling layers which lead to a robust framework and so images are densely and uniformly sampled on a grid is so and what that means is that in evenly spaced intervals we have a pixel with RGB value and that makes it sort of trivial to apply convolution and put or synonyms but our world is 3d everything is like complex geometries around us tables and chairs and so many different industries like from robotics to autonomous vehicles and medical industries and 3d printing are interested in doing what's called a 3d deep learning so in 3d deep learning we also train our networks to classify and segment B shapes the problem is that the shapes are irregular so applying convolution and including is not necessarily so straightforward so so in a nutshell what's happened with fede deep learning I obviously can't talk about everything but on the on the left-hand side is sort of the first works which try to do 3d deep learning not directly on the shapes but converting to a more regular representations so the first types of works are multi view works take a 3d object and image it from many different views and then they pass it through parallel convolutional Noora or getting classified issue another type of approach like a metric approach takes the 3d shape and converts it to what's called box blows volume so you can see there's like a three by three grid here where every single style is zero or one or occupied or an occupy and so this is really computationally expensive because you have a lot of convolutions that are applied in 3d space which is like empty convolutions and the resolution is really poor so I think almost all the new works are in the class on the right hand side which are direct approaches so there's some bits of works on like manifolds which essentially like look at like these local descriptors on each people like like things that can move not like static mundane objects and we'll try to learn some type of disk character which will be invariant to like different emotions of people so if people are like sitting or standing on the same about the same feature vector then there's been a lot of work something like looks like Brad convolutional neural networks and that's probably the closest take them work to our work these works are you like showing problems on like genetic data or social networks like correctly just sort of general graphs not necessarily specific to to 3d and the last type of works are point clouds so the point cloud representation is just a list of 3d coordinates and space and the idea behind a point cloud is generally sort of sample of a shape like relatively uniformly all over the entire shape and that's how sort of we can get some type of understanding of the the shape structure right so this is obviously very sparse and irregular and so there's been a lot of works to sort of develop irregular operators for point clouds so point clouds are great but in this work we want to develop area operators for meshes so mesh is essentially a representation for a few object and I'll explain exactly what that is but general it's something that sparse and it's a regular and not not uniforms this sort of also difficult to think about how to apply coalition's on it so what is the mesh it's essentially a list of vertices 3d coordinates and space and faces so it's the three indices that make up in this case we're talking about triangular meshes so the faces are triangles they make up a triangle so actually this is this list should be permutation invariant so what that means is that it doesn't matter the order of the vertices or the order of the faces we should still get the same like 3d object in space right the problem is like if you look at this version see here on the left it has three neighbors right and this is our first thing out there right now is part like there's infinite possibilities the number of neighbors and there also is no order you don't know how to separate it by a convolution on brizzy's it's not very clear so our idea was to use edges edges are essentially there's three edges and triangle and it's just given by the two but it protects IDs so the nice thing about edges is that essentially every single edge in a triangular mesh is incident to exactly two faces and in every single face those two edges so each edge has this nice fixed size neighborhood of four edges except records are boundary and the boundary kneepad with sears ronald is just one question Ash so ash means that there has to be an angle between two two triangles or - as an edge so when you have an angle between two triangles or is also an edge for example here on the on a yeah kind of flat surface so that there is no angle between two triangles yeah yeah so an edge okay so the edge is define like from the data structure of the mesh so for example here there's no angle like between these two faces that's okay because this is that this is this particular edge for this okay um okay so why you smashes over a point clouds well I think there's two sort of main advantages to the mesh but that's the point that the first is that there are they're really efficient so we can represent like this large flat surface with only two triangles whereas like a point that we need to sample this surface uniformly and with a lot of points on the other hand if we have these like I need detailed regions we can use a lot of a lot more triangles to represent the details more accurately so it's an efficient representation and it also explicitly contains the surface information so for example if you were to save all this with a point cloud you might not necessarily know which points belong to left versus the right camel joint but with a mesh you actually if you look at the point here versus here you can see that the distance on the surface is very large so you can easily separate the the difference between the left and the right camel joints which is through what's called geodesic distance or this distance on the surface so that now that you understand why I'm a southern awesome now you can see that we want to go convolutional neural network directly on the mesh and so we need to develop some sort of like irregular operators to to do that okay so in our work edges are sort of like pixels and images with pixels and images every pixel just starts with an RGB value and with mushes certain you to make something up because um there's nothing specific specifically like defined for that I don't think it really matters but so we just went with something that seemed like a reasonable simple geometric feature so for example for every edge we extract these five five geometric features the angle between throwing into such the space in this case this angle this angle and also the ratio of this type to this base triangle on the same video so these features are invariant to similarity transformations which means we can apply like rotation on the mesh or translation or uniform scale and we'll stay at the same four dimensional feature version which is nice okay so how does refresher how does convolution working images if you have this blue input image in this gray you go through kernel it's a dot product which is an element-wise multiplication summation of the kernel and the facial support in the image and outcomes this feature activation map right and so the reason if this is easy for us to do because is because there's a consistent number of there's always nine pixels and there's always a consistent order that we can apply that the dot product to it so actually this reminds me of the work that was presented earlier that this convolution really aggregates a lot of local information about about the image into like these feature of future observations and that's what makes the salmon's for so long so we want to do the same thing for meshes okay so we we start looking at every single edge like I said before as for what's called wandering neighbors so each edge right so is an edge with a feature vector it starts out as a five dimensional feature vector but we're going to learn convolutional filters and so it's going to grow or in size and we want to apply a convolutional filter on the edge in feature vectors and the four neighboring edge feature vectors so the problem is is we don't necessarily want to be sensitive to like the initial ordering of vertices of faces and so we wanted to find a way to apply the convolution in a consistent way so in meshes there's something called face Direction on the face the normal direction which is each basis ordered counterclockwise so every single place has an order the problem is is we don't necessarily know which face to start with like that is ambiguous so the solution is we can okay essentially we know if we started at edge E and we wanted to look at the four neighbors we could either start with a or C right these two pairs are ambiguous and the next feature could either be like if it was a so VP and if I was see the next picture so the blue pairs of red cards are in the u.s. each other so what we do is we build instead applying the convolution directly on the features of any for example we apply it on the features of a plus C so a plus C element wise there's all the features in edge a and all the features image C elementwise add it together we get a new symmetric feature which doesn't care if it's able to see what sequence ad and we do the same thing with the subtraction and absolute value so we built this feel basically from these four neighboring edges for symmetric features which are invariant to the order the original order of the mesh so just to summarize everything that I said so far we have a mesh we go through the list of edges we extract some simple geometric features by dimensional feature every single edge we look at the one ring neighbors and the the normal direction and we use that to build a convolution which is no variant and we learn a certain number of filters we apply it on the edge and we continue this process right now we have we started with five dimensional feature vector for each edge and now we have K dimensional it's I think that it works so well on this this geometric features because that's not it's not intuitive because I mean those convolutions apply to to those kernels of by two images and yeah kind of those trains cascade or filters that kind of intuitively make sense very interesting that it also works on this kind of features geometric features off of the triangles yes I like some people I think we'll take images and convert them so like it just be colored space or some other cause based on the men like convolutional codes and so I don't know maybe maybe it will give some very small boost in performance or something but like I think most people really focus on the architecture of their network and then say that the network will learn the important features and it doesn't really matter what I what it uses an input the network flow like learn some abstract representation which is like I'm going to sort of like whatever is happening an input so I think the idea here was to show something similar that like for meshes we don't necessarily need to extract these very complex and high high you know high dimensional features that maybe are very like powerful and descriptive like we instead we want the network to learn like what's powerful and descriptive and we just and put something really simple yeah so okay so in in images we do like pulling and pulling is sort of the way it's defined right now in use right now works for regular images but it's not necessarily clear how to do it for irregular structure is like the mesh so we sort of define a general general definition of pooling and then we say that image pointing sort of a special case of that and our metric pooling is a special case of that so the three sort of core operations of bullying are first we define the pulley regions on the case of this example here the coil region that's given by this 2x2 filter and then in the next step we merge the features in each of the regions so if we take the max max point and then we need to redefine adjacencies so in images this is sort of trivial because you know this is 7 and this 9 are far away right and this before the max flow after the max pooling in the new like image grid there neighbors and we don't need to explicitly define anything with the mattress we need to to redefine the reason which is important is because it makes the network learn stronger and more robust representation okay so the inspiration for our for our pulling or masterfully was taken from the classic classic technique in computer graphics which is called my simplification West notification is essentially that goal is to reduce the number of mesh elements but we want to preserve the sippers of the overall original shape so what it's done is what's called okay there's several ways to do it but the most my friend was probably right about that course edge across essentially iterates over all the edges and removes the edge that will create the least distortion to the overall shape so the way that works is let's say we want to quote delete this red edge so this red edge will collapse to a single point and the two edges on in this base slow claps to the single engine the two edges so most fun Facebook lastest English so in one edge collapse there's three edges from so we want to instead of having a geometric edge across we wanted to have a the network on the edge pops and this is interesting for a couple reasons first of all it allows us to visualize what the network learned like what the network decided was important which edges to delete which gives us some like cute some cues about how robust the model really is like some works on like adversarial adversarial attacks and like when model interpret ability so like you think this is interesting because we can sort of see that when our and also of course as I said before bullying will allows when I work to strengthen the features that I learned and yeah so the way the way it works is essentially we just want to delete edges with a small speech activation that's what that means is for every single edge of the mesh we look at the feature with the edge which has the feature with a minimum norm the compute the norm for every single edge feature Activision and the minimum one is the one we decide to delete so the deletion has basically two main steps the first is to aggregate the features so in this case we pull this edge so the edge a B and E are averaged element wise and we get a new feature edge feature P and same for this on the bottom and then we need to update the mesh because now we have that different much so the way that works is so you can see that like the edge here for example now it has like new new adjacencies for the next convolution so we need to update the mesh data structures in order to get those new convolutional neighbors we also we also developed a on cooling layer which is pretty much the same as how it's done in images with a couple small changes which are minor but the the main idea is the same Ian's idea basically and we restore like so we sort of remember what the topology was before the fully image on pooling you sort of remember the indices of the Max and then you give that to them when we are here we remember the topology like what what was before that pooling we're and then we restore that completely and then we need to create these unpooled features so an image it's usually it started with zeros we sort of thought it was more it may be more sensitive to have like a weighted average of almost in the pooled feature vector but you know I don't know that makes so much but that's because afterwards comes convolution so when I did that I mentioned before is that because the network is choosing which edges to delete such a based on the their future activations if we trained for instance a network to classify whether bases and handles versus another one whether it has a neck so the simplifications like the mesh rulings will look different so this is just like a visualization mode what that might look like so in this top example alike but handles most of the handles are left and then those for example explosive and that yourself right so I'll just mention I would just show a couple of occasions of course there's like more information in the paper but we essentially take the convolution and pulling where's we put them together sequentially and in between we have like nonlinear activation relu and normalization there's so we're using bash norm a group norm and then depending on the test so that's this classification we need to apply what's called like this was originally proposing the point network so they have a list of points and we have a list of edges so in order to be invariant to the order of the global order of the edges we apply this global symmetric function which essentially is a Max or an average across like each feature act Division of all the edges I'm so for every single like for the first feature and all that is to take the average of the max and then you can apply like fully connected layers and then do like a cross entropy to classify with segmentation it's fully convolutional so there's no fully connected there's we have after the cooling we have some unfolding there's we do like that sort of a you net type architecture and then we condition would classify the this probability of edge belonging to a particular segmentation class and then the classification would obviously best finding what's the overall cost initially so an interest of time I'm gonna skip over this it's not so interesting so we showed like a couple of applications that the first thing for example in this Shrek dataset which is like 30 different classes of shapes like people ants I don't know aliens things like that what was interesting is when we extract here you see these are people and these are the intermediate mesh simplification the mesh pooling result so you can see actually the network removes the head of the people which is kind of funny and but for the ants it it's removing the legs so we start to notice that the network especially when it did well was doing something like similar across all the different classes sorry um the network was basically Siller within each of the classes like the same class it was doing a similar it's very interesting that you can have this insight so 22 to that you can reconstruct the shapes from this intermediate representation representations you can it's it's it's much better than four images because the sometimes it's not obvious whenever looking to the reconstructed intermediate it just for this application it's better yeah yeah I mean we were actually like working on something now we're we're converting images some ashes and trying to fit but yeah because there's no geometry in the image we have a similar problem yeah so because it's a job here so you can you can actually see something interesting happening so some results for segmentation we also like observed some other thing that we observe in classification which was of course these are all super byte so I um but what we noticed is this is obviously the tester that the like the back of the chair for example it was the first thing that the network started to collapse and then after that the it was the seat and then usually didn't really get to the to the base of the chair so it was also doing this like similar like parsing that was consistent over the particular class which was interesting also we applied this to to humans it actually this is really cool but this is a result of someone like posted on that our github page that they were using this for like real 3d body scan so even though this looks sort of like I guess academic and stuff like that they were able to use our preaching network and is that it works like well their their body scans so that's that's pretty cool don't have that much time up so maybe I'll skip over this but the general idea here was that we built a specific data set in order to sort of like okay so in the previous data sets we did like we didn't think that problems weren't necessarily emphasizing this like the power of what meshes can can really show or what measures can do so like in this data set what we did is we engraved like these simple shapes into cubes like and there's 23 different classes of shapes of the heart is one class and Apple is another and because the engraving is so shallow when you simple it with once you can actually see like you can't visually detect the difference so you did very well on this little classification and these are some of the intermediate like simplifications so you can see that like mostly the background edges are being like removed and most of the inside of the icon is being preserved this is just showing that we can for like the same exact shape we can actually create what's called different triangulations or different mushing so this same exact shape can represented in several different ways actually probably infinite ways but we showed that even just like with some simple of data augmentations we can be like robust to these types of things and you're using this different triangles for augmentation they don't set that correctly so for augmentations what we're doing is something called edge flips which is and search shifting around the vertices so we're not we're not actually generating this was pretty pretty impressive we we trained on this type of triangulation and then we tested like on this side so we didn't necessarily exactly see this type of calculation both of these are from the test set but this was triangulated in a way that was similar in the way that the training set was good and then this was really a different program so like what we did is we sort of like we took for example of this vertices here and we like moved there around a little bit and we did something called edge flips which is like taking this triangle and sort of it's flipping it all around it did essentially what it does is it creates we're gonna do mesh but it's not something too complicated very simple like always see if we could just generate a ton of different machines for the same object pretty easily so that would be a great way to do augmentation so a couple of interesting things about like this is what I hear at things that I noticed about their work so what we want to do so we want to create an example where we could really show the power of the mesh looks about to the cubes so what we want to do is classify a human based on their identity but we thought because of these students this is a data set which has 10 different people their body structure is really different so like you have tall people and short people and muscular people and it's really easy just from like global like sort of body geometry to to tell who's who so what we wanted to do is we wanted to force the classifiers to look at the small details in the face so we swapped the heads like all the different people and put them onto all of the different bodies and then we trained our network and also the points networks to classify the shapes based on the based on the face so they got a face which sometimes had had everybody and then they were supposed to classify which case it was and the result was that sort of we did really well and also the points didn't do so attic they got like ninety percent there's got like I don't 99 percent it was really easy and I we sort of gave up on it we didn't really understand why that was the case but when I looked at the simplifications I sort of realized what the issue was actually the networks like okay there are simplifying the body and then at some point the network simplified the features in the face so actually the network completely ignored there what we wanted it to look at and instead it probably was just looking at like the contour it was enough to just look at the contour of the face Amanda turn the person was it that was enough it didn't necessarily need to look at the fine green details so it was pretty cool to see like that explanation behind what what what we were observing let's get over there since we have time so basically in conclusion we built this sort of pollution which is it's a variant to similarity the rotation transition scale and and the ordering of vertices and different faces and then we have this mesh cooling layer that we built which can make the features you know strengthen the learning representation but also it gives us this sort of like visual insight and sort of clues about what's happening is in the network we have like a few different future directions for this work what want us to make it generative may be gone precious and another is to maybe make it work on general graphs and also another thing is like I said before bringing images to some meshes and doing this yeah I think your honor very interesting so does anyone have a question yeah I was like super intrigued when when when the topic came up in in this leg in this Lac and I was immediately reading the paper and thinking how I could use it in any way for my work because it just seems like so like different and and so impressive in the possibilities but I'm not quite sure how I could apply it but I was wondering um for for lidar data basically having like like the topography scans if if that could you kind of use to classify a certain landscape features stuff that it's probably not obvious from from images but rather from from radar based or later based topography scans if that that could could be useful for classifying features in the landscape stuff like that so what the lidar data is given us as point cloud or how is that there yeah the immediate step yeah would somehow generate like like surfaces I guess yeah so okay so I also would immediately like that but okay so there are there are some works like what you take a classic work which take 1000 converts them to mesh so that's one possibility that there was 3d body scans they were they must also they're also sampling some surface so they must come also somehow as a point cloud and then they're converted I'm not exactly sure yeah so I think I think it could work I think it like also the lighter I mean that's sort of a while since I thought about light arts but I think that they also like to give you a signal in terms of fate like the angle of the surface they bounced off okay I think there there might be some more information other than just like there there's a point in this distance yeah I think so too I'm no expert in this field at all most doing like optical remote-sensing stuff but all the different phases also further for the SAR radar and satellite cadence so maybe then if one finds an intermediate step that could work that would be interesting yeah yeah I'm also not a 3d expert but I guess you can make some simplification simplifications and some assumptions when you do something like a body scan so you can assume that this is one surface from the body and if you really want to do landscape classification it's probably hard to to tell the diff and surface is a party I think you basically have to do some kind of spatial smoothing right so a neighborhood lookups if that it's kind of continuous or use you kind of continuous in in places can bounce around widely but yeah thanks again Rana for this super interesting talk super cool table one of the of the most interesting and most cool part was that you can have look into this all those intermediate points and I mean you you said yourself that you found out why the classification and for the face did not work yeah yeah we also think this is like sort of I mean in general I feel like with networks like actually the first camera that was presented as well you can see that it's like we don't really understand why why CNN's are working the way to work yeah and like you could you know yeah we can tell it's a cat right because of the contour but the the networks like looking at something completely different and um I think it actually shows like that they're not as robust as we think they are and they didn't actually learn something super meaningful yeah I don't know that they really are generalizing - yeah I mean it's just like the example I showed with the textures right yeah people people did not knew that so okay yeah can I ask another one I was wondering maybe I missed it in the paper I was wondering about the computational complexity versus like these different viewpoints you mentioned that there was like the previous work that objects were kind of scanned around and then on 2d which is the regular classification like is it dependent on the number of vertices I suppose but like this update of your mesh after the pooling right is that like a huge computational drawn on the overall system or is like just a ballpark number so okay so in the paper I think we wrote like it takes roughly 0.1 seconds or forward and backward pass of a match which is okay that's for sure than we saw it in images like but it's it's sort of on par if not faster than like most 3d deep learning approaches today okay so like kind of a mechanoid it I forgot about it sorry like a maximum number of vertices you can handle and otherwise it's just too much like like you simplify it down to a maximum number of nodes or yes so in the beginning and that's why I skipped now in the beginning um so in order to sort of make the throwing everyone down samples images to roughly the same size so we're down to all the meshes to like roughly the same size like a couple of thousand of edges okay and yeah doesn't mean to be exact because our cooling layer can handle like different different resolutions but um yeah so the convolution itself is just as fast as regular image convolution the pooling is like yeah that's the bottleneck of the speakers it's a sequential operation with we're looking at the smallest edge collapsing it and continuing so there's definitely like room for and virgin optimizing down my husband okay thanks yeah all right I did not realize that but the pulley incorporation cannot be it's not parallel yeah so we did like a couple of things to make it work pretty fast like instead of every single like every single collapse instead of collapsing the edge and then updating all the data structures each time so we just only update the datastore just once at the end of every cooling layer so it's a bit faster than so yeah sure iterative like approach but yeah I could definitely be definitely done like faster any any more questions I'm just looking at in the chat so Alvaro is asking would adding different hairstyles and putting different hats help make the models focus more on the faces oh I don't think that you could include hairstyles um so like basically the network was learning to two ignored everything that wasn't important to it so in this case it could it ignore the entire body the the eye is in the nose and really just cared about this part so I think like you add a hat it would just ignore that as well yeah okay so I think this is going to have a huge potential in the future really because I think that so I'm really not into an expert in any means on 3d data but I all the approaches I've seen so far are basically infeasible because of computation computational power at this point and I think this is a very interesting step forward in this direction do you see any implications for computer tomography data for example so I'm not an expert on medical stuff I think though the way she works is it like has light-skinned slices right of something and rebuilds the volume yeah but I don't yeah so I don't know it certainly if you can convert that volume to mesh which is yeah actually pretty straightforward that's easier than 0.5 right and I guess the resolution there is pretty good so yeah I guess like it works pretty well obviously right now we're these are supervised tests I need like data and stuff like that but yeah yeah can I ask about data I mean again I'm also not an expert in on 3d data at all but I just imagined that there's far fewer datasets for training models on like like label 3d data sets right in versus image data yes so like everything that happens with images like it takes a couple years or it doesn't happen in 3d right image that came out in maybe was before I don't know CHS all this girl's was the the LX net right yeah and so like oh yeah Alex I was in 2000 like the equivalent allowance net for four for three deep learning what's in 2015 yeah like we're like a couple years behind everything so also the datasets like we have less datasets in the business so we have a bus like labeled data but it's also growing this like a bunch of excreting warehouse and ship them stanford or like building these huge repositories and now something very very recently just came out called the ABCD data set which is like millions of small parts that are related like 3d printing I might actually add something about the about the data sets the one important distinction is that the with images we have pretty consistent you know data structure and with 3d models even in the data sets we have very a lot of variation even in quality in the way the models are built so there are a couple of data sets existing like shape net for example but a lot of the research is done on stuff downloaded from something like 3d warehouse and this is you know just people sometimes someone like just using blender for the first time and sometimes if someone who is doing very nice watertight 3d model for printing and sometimes it's you know just a cube with a few spheres so that's another reason why the research is quite difficult that you have so much added complexity and variation in the data and so and and the number of models in a certain category is also small it's easy to get millions of photos of cats from the internet it's not as it will get even thousands of models of different cats but that's what I was wondering if there's any kind of driver like like for images it's obviously the big search engines that are kind of auto-generating these huge datasets or kind of harvesting this information if there's anything similar or on the horizon for 3d stuff but your 3d printing kind of makes sense yeah there is there are like so much different industries that use 3d data and this results there are some places where you can find 3d models created with photogrammetry for example so they they produce different types of 3d models you'll have people who are doing 3d models for gaming again different data model structure so this makes it difficult to then feed it through a network you know in a pretty consistent way so it's like you know whatever you put in the network you the results may vary yeah cool yeah interesting and Ron I want one last question question for myself are you coming from the field of computer graphics and 3d imaging or from the field of deep learning so I guess I started originally doing computer vision before deep learning okay and then sort of deep learning caught on but I like you know I was doing a classic machine learning - so yeah originally started out doing images and just regular machine learning and then images deep learning and then okay interesting it's I always like it when people do this kind of foundational work on this stuff because oh it's it's much harder yeah I mean like I always thought he was really cool so yeah it is absolutely it is but I mean somebody slides for sure yeah I mean when you do something in four images then it's so easy you can you can change choose from I don't know 100 different data sets to try your ideas and and develop your stuff and that's kind of you have to do so much work for the foundation in order to to do this so that's that's really cool yeah I mean like just even something as simple as like you know presenting the results you think it takes like a lot of time to like you know display that render then I guess in when this work is going on there will be more and more tools to to help you doing that right yeah and also so I have a question about your last slide about future work and generative models what type of work were you considering with with generative models and could it be used for recreating 3d models from from single single view images I I've seen some work with voxels doing the same or retrieving models from 2d image or sometimes multi-view data could this be used or recreating 3d models from images okay so I'll say when I wrote generative models I was thinking like god I'm like you know I'm thinking about gods but like conditional guns for example well you what do you have strong yeah yeah okay so conditional I mean okay that's like a personal thing that like a lot of people think a gun when it's really just like a one-to-one mat it's really just a network like psycho gone and smooth just done that work but so in the case of like Tiki and in converting it to a shape so I'm not I'm not sure like okay so much data works directly like on on the 3d mesh so I'm like if the input is a trying is if the input is an image or not not only image like conditional dance you have like two sources two sources of data and you produce to the image so you can have two something similar the question is can you combine like one source of 2d data with the generated model yeah yeah there's definitely a different way like I'm I'm thinking of a couple of different ways right now to combine it with B not necessary images but something else but so when it when I wrote generative models here I meant like maybe like from like from a noise vector like generating like having a actual mesh come out of some type of black box so not necessarily conditioned like on some specific image but just like sampling some noise distribution and then like having it some good meshes that come out of that black box yeah I there are I think some voxel voxel guns yeah it will be interesting to see it with your mesh Megan yeah yeah yeah I think it's gonna be hard because like much harder than Mussina and because like really small inconsistencies with the like with the angles of neighboring faces can make the mesh look really bad so like yeah I hope that so you're talking more about stuff like variational autoencoders so okay yeah very atomic or it's like one approach to two generation right thank you have an input meshing and you like collapse sets of some vector and then decode that to the original match and then like doing inference I guess you can sample that latent space and get like different meshes at the output but um so yeah that's one approach another purchase just like taking out a random vector and then just like inputting into a generator which will generate a mesh and then you have like a discriminator which whoa like say if it's really um so there's may different were thinking about like different ways to do I'm not exactly sure how we're gonna do that but it's sure going to be interesting yeah any hard yeah okay if there's no normal questions okay so the way to visualize CNN there's okay resort to some visualization techniques between the layers okay I think if I understood correctly your to okay so when I showed these visualizations this is like what's happening between the layers but in terms of just the tops all it's like the mesh itself but I'm not well this sort of I'm not specifically visualizing like which the actual feature activations like themselves right and I guess that's what okay he's between oh there's so like or also visualizing how the maybe look to themselves like we can we did I did try to fought like on these meshes the future activations but I couldn't we couldn't really get into visualizations that was informative it was not really so clear yeah it was a bit confusing so so you're just a color color-coded the edges or yeah four percent actually for segmentation it I think we did do something actually forgot that okay we did call it the edges the segmentation in between to see the high-value versus a low value choc traditions it was pretty similar to what we saw with assembling vacations which was why we didn't include it but I think if we had more time before the submission I would have added it yeah okay so if there are no questions I think we'll close for now Rana thanks a lot that was really very very interesting I'm really looking forward to see more more work on 3d data I think that's really going to be important in the future and yeah thanks everyone and see you next time thanks very I\n"

MeshCNN - A Network with an Edge @ TWiML Online Meetup EMEA

Random Videos