Generating Ground-Level Images From Overhead Imagery Using GANs with Yi Zhu - TWiML Talk #172

The Power of Generative Models: Exploring the Frontiers of AI Art and Image Generation

Recently, there has been an influx of interest in generative models, particularly those using Generative Adversarial Networks (GANS) to generate realistic images. These models have shown remarkable promise in producing high-quality images that are difficult to distinguish from real photographs. However, despite their capabilities, GANS also present several challenges and limitations. One major challenge is the difficulty of training these models, which requires significant computational resources and expertise.

In order to overcome these challenges, researchers have been exploring various strategies for improving the performance of GANS. One approach involves using techniques such as conditional GANS, which allows the model to generate images that are conditioned on specific attributes or characteristics. This can be particularly useful in applications where the desired output has a clear structure or pattern. For example, if we want to generate an image of a forest with a single house inside it, we can provide the model with the attribute "forest" and "house," and it will produce an image that meets these criteria.

Another approach involves using features from a latent space to improve the quality and coherence of generated images. Since GANS do not have an encoder or explicit data distribution, they rely on interpolating between different feature spaces to generate new images. However, this can result in images that are lacking in semantic meaning or context. In order to overcome this limitation, researchers have been exploring techniques for generating features that are semantically related, such as using clustering algorithms or visualizing the feature space.

One of the key benefits of using features from a latent space is that they provide a smooth and continuous representation of the input data. This can be particularly useful in applications where we need to manipulate or modify existing images. For example, if we want to generate an image of a forest with a larger body of water, we can start with a smaller version of the image and gradually increase the size using the features from the latent space.

Despite the challenges and limitations of GANS, researchers continue to explore new techniques for improving their performance. One promising direction involves using information retrieval techniques to guide the generation process. For example, if we want to generate an image of a forest with a specific type of tree or animal, we can provide the model with relevant keywords or metadata that will help it produce an accurate and informative image.

In addition to GANS, researchers have also been exploring other types of generative models, such as those using Variational Autoencoders (VAEs). These models rely on learning a probabilistic representation of the input data, which can be used for tasks such as dimensionality reduction or feature extraction. However, VAEs are typically more computationally intensive than GANS and may require significant amounts of training data to achieve good performance.

Another area of research involves exploring the use of geospatial information in generative models. In this approach, researchers aim to generate images that are not only realistic but also semantically meaningful and contextually relevant. For example, if we want to generate an image of a forest with a specific type of tree or animal, we can provide the model with geospatial metadata such as the location and altitude of the region.

The training of GANS has been shown to be a challenging task, particularly when compared to other types of machine learning models. However, researchers have identified several strategies for improving the performance of these models. One key technique involves using techniques such as batch normalization or data augmentation to improve the stability and robustness of the training process.

In addition to the challenges of training GANS, researchers have also identified several limitations and drawbacks. For example, some GANS can produce images that are overly realistic but lack coherence or context. In order to overcome this limitation, researchers have been exploring techniques for generating features that are semantically related, such as using clustering algorithms or visualizing the feature space.

The use of generative models has significant implications for a wide range of applications, from computer vision and image processing to art and design. These models have the potential to revolutionize fields such as photography, filmmaking, and advertising, by enabling the creation of realistic and high-quality images that are tailored to specific needs and requirements.

As researchers continue to explore new techniques for improving the performance of GANS and other generative models, we can expect significant advances in the field. With their ability to generate realistic and semantically meaningful images, these models have the potential to transform industries and revolutionize the way we create and interact with visual content.

In recent years, there has been a growing interest in using Generative Adversarial Networks (GANS) for image generation. These models have shown remarkable promise in producing high-quality images that are difficult to distinguish from real photographs. However, despite their capabilities, GANS also present several challenges and limitations. One major challenge is the difficulty of training these models, which requires significant computational resources and expertise.

One approach involves using techniques such as conditional GANS, which allows the model to generate images that are conditioned on specific attributes or characteristics. This can be particularly useful in applications where the desired output has a clear structure or pattern. For example, if we want to generate an image of a forest with a single house inside it, we can provide the model with the attribute "forest" and "house," and it will produce an image that meets these criteria.

The power of generative models lies in their ability to produce high-quality images that are tailored to specific needs and requirements. With their capacity for image generation, these models have significant implications for a wide range of applications, from computer vision and image processing to art and design.

"WEBVTTKind: captionsLanguage: enhello and welcome to another episode of tormal talk the podcast why interview interesting people doing interesting things in machine learning and artificial intelligence I'm your host Sam Carrington a quick update for our faithful listeners and fans we recently asked you to help us in our bid to secure our nomination in the People's Choice podcast Awards well thanks to you we're a finalist in the best technology podcast category along with other noteworthy shows like reco decode and the verge cast the award ceremony will be streamed live on September 30th which turns out to be international podcast day keep your fingers crossed for us and a huge thanks to everyone who voted you have our utmost gratitude today we're joined by a zoo a PhD candidate at UC Merced focused on geospatial image analysis in our conversation en I discussed his recent paper what is it like down there generating dense ground-level views and image features from overhead imagery using conditional generative adversarial networks he and I discussed the goals of this research which is to Train effective land-use classifiers on proximate or ground-level images and how he uses conditional gans along with images source from social media to generate artificial ground-level images for this task we also explore future research directions such as the use of reversible generative networks as proposed in the recently released open AI glow paper to produce higher resolution images enjoy alright everyone I am on the line with ease uu e is a PhD candidate at UC Merced a welcome to this week in machine learning in AI hi Sam hi everybody thank you for having me it's really exciting to be here absolutely I'm really looking forward to digging into your work today to get us started why don't you tell us a little bit about your background and how you got started in machine learning so actually my major at first is awareness communication because my undergraduate study was on like signal processing and information theory but then at a year of 2012 when deep learning first to show his great potential on image net challenge so I was fascinated by simplicity and surprisingly good results so I started like kygo challenges using deep learning including the famous dog and cat classification challenge so although at that time most the challenge winners are still using random forests of XG boost but later I found myself like attractive to this machine learning field and especially deep learning so I'm thinking so why not change a major so my understanding so images are steel signals captured by optical sensors so they are not that different from my previous study so I switched my major to computer vision and start my PhD study at UC Merced in 2014 so basically I have two lines of researchers directions right now one is geospatial image analysis the other is a video analysis but I think today I would just a focus first at one you mentioned that you started competing in cattle competitions what were some of the competitions that you competed in so the first one will be the dog and cat challenge so that's a less object like a recognition problems like a binary thing and later I also competing the challenge of the so the eye contact of the driver so to try to understand if the driver is sleepness or is it a safe driver or something so that's also like autonomous driving challenge and then also for the insurance to see a car's image and so whether it is like a damage it or not is it's a new car so - so how much price should it sell for a second-hand car so that's a for the insurance company so there are several more but I cannot remember the details yeah okay awesome I don't know the last two the driver I contact or the insurance one but the dog and cat one is one that we have gone through I'm working with a group of folks to kind of go through the fast AI course and they talk about that dog and cats one in there quite extensively how'd you do in the Kaggle challenges you did so firstly I just try like a linear regression so that's the most basic model like I think most of people still use it today for like normal questions and then I also try like a random forest and XG boost so as times and most of winners are using those two techniques and and then modeling assembling to get to the bestest score they can but then I also try elective learning right so convolutional neural networks for this image recognition task and at first they cannot compete with random forests because the images are not that large so you know like a deep learning is a data hungry model right so if you have more data and if you have like a clean data so you can get a very good result but if you don't have enough data so maybe sometimes it doesn't compete with those traditional algorithms yes that basically the for algorithms I'm using for cargo challenges how did you rank in the competition's did you rank highly in any of them I didn't rank like a top one or top ten something but I'm usually in the top 10% okay nice and so your research now is on understanding images using deep learning and you recently published a paper the paper is called what is it like down there generating dense ground-level views and image features from overhead imagery using conditional gans that there's a ton in there the title is quite long I mean maybe let's get started I think you know we've talked about Ganz quite a bit on the podcast but why don't you start us out by talking about conditional Ganz okay yeah so yeah I have seen those like against episode here and there so it's very hot topic like last year and there's a trend is still going on so so as we know so again consists of two components so one generator or one discriminator but the thing is when you generate something so you're generating from a random noise so each time you will get different results different output but sometimes you really want to have some like fixed output right so that is what conditional can for for like for example sometimes a at a fashion like a shoe designer they want to design using again to design a new shoe but if you did the output sometimes it will begin something really random so if we can give the model some information on prying information like some texturing bearings like on I want a shoot so then the output will definitely be a shoe image so that's what can for so see again so the condition again learn the distribution condition on some auxiliary information so those are the delivery information and can be class labels or which is the most standard form or texting bearings for generating images or like image to image translation and so here in our work so we use condition again because we know what we want we want to ground level images corresponding to the overhead image so given me a overhead image I want to generate a similar-looking ground level images so the overhead image it will be considered as a prior information yeah so that's why we choose like condition again and what you're going for kind of visually is if you send in to the system the image of you know what an area that looks like farmland you want to generate ground level pictures that look like farmland similarly if you're sending in overhead images of an urban setting you want it to generate urban images yes exactly tell us about the different data sources that you use to make all this happen okay so the ground level image is a problem there is there sparse so on what we can have is a satellite images right satellite images stance so in everywhere so it has all the testified images at every spatial location so that is our even image input source so the ground truth and cover map is from the LCM data set so our our study region is a 771 by 71 kilometer in length and so the so geo graphic images are labeled as urban or rural based so then we can do a very simple like a binary classification problems so our input sources are like an a lot so from like Google Maps statistics and until graph API and also the satellite images so those are the three image sources we're using the LCM data set is that your overhead images are those your ground images those are the overhead images okay and the ground level images is from the Geo graph to where the that is a website by some London researchers so those are for the whole United Kingdom so the lengths our classes are provided on a one kilometer grade so for every grade they have the user provided for on the level images so so that's pretty accurate so that is our ground level images and the LCM is overhead images so we have the corresponding relationship to train our models to do some experiments and so I realized that what you're doing here with Gantt is fundamentally generative but part of the the way the problem is set up sounds like an information retrieval problem like you have this overhead image and you have this corpus of ground-level images kind of find the best one are those related in any way oh so there are there is difference there so for like a image retrieval problem is so given me overhead images I want to find the best matching or like ground level images right so there is like there's a database of the ground level images so we're trying to find it a most imagine one so but for generate models it's not where we want some specific images so we don't want some specific ground level images to corresponding to the overhead image we want the whole data distribution to fit so the we want the generator to generate some real looking images to fit the data distribution so we don't care about so which images we're generating we just want it to look realistic for that category so that's the only difference and do you do any kind of information retrieval as part of the solution in other words is your again conditioned only on the overhead image or is it conditioned on the subset of images that you retrieve from a database based on the over image that's a good question so actually right now we only like like a based our model own again so so we're not we're not using any information retrieval thing so but in the future we want to use that because the land use classification problem is a really challenging it's really really hard so maybe later I will introduce my another work so it's a it's also about large-scale land use classification but to the classification accuracy is only like a twenty percent thirty percent so it's pretty challenging so we need to have some like a human prior knowledge inside it so like the information retrieval inside information you mentioned about so we will do that in the future work so in this work what are some of the big challenges that you had to overcome in in this paper and this research okay so can I like a back up a little bit to talk about the motivation of this work so so that would be more clear okay yeah so so the motivation to like all of our work is because although we have like enormous amount of images like from Flickr Instagram so all those images or those videos we could use it to do some geospatial analysis but if we want to do a detailed land use classification map so land use I mean is for examples of whether a building is a hospital or whether it's a shopping mall or something so so how do we use the land so it whether it's residential or for office use so but overhead images cannot handle such case because overhead images is from like above right so from above you can only see a building so you cannot see inside of the building so you don't know what's that designed for so that's what we come up with so we want to use social multimedia to do this kind of a Colonials classification because social media is captured by phones or cameras so they can see inside the building to easily infer what it is of this building so where it is but the challenge the most big challenge of using social media is is uneven spatial distribution it's so simple as speaking because most images are coming from the famous landmarks right so now to the general spatial locations their returns of people having the Golden Gate Bridge in San Francisco like after tower in Paris so it's very easy to tell these landmarks from the photographs but for like residential areas or privacy pre preserved regions which so we don't have enough images so there's a very serious problems if we don't have images so how can we infer the location and all those land use classification information so that's the challenge we really face about so traditional there are several ways to do that so one simple way is to interpolate the features and so if we have for example we have several images at location a and we have several images at another location B and these two locations are close to each other but between these two locations there is no images at all so in this case we just computed the image features of these other images at those two locations and interpolate them along this line but but here we made assumption the assumption is we hope the image features will change smoothly in the spatial domain but actually in most cases is not a case so if we have several images and form like a residential and at location a and a residential and location B then if you do the like interpolation things it will interpolate so between these two locations all the areas are residential based but actually in most cases then the the space in between is like forest or park or just a river or something so it's so I can that natural things so the other kind of method is so we try to use other information like remote sensing images or Google Street wheels because those two information sources are like a dance so everywhere in the on earth so there's a work from that professor Mason Jacobs lab and sSAB 2017 so they just use Google Street wheels to do this kind of things but we find that so all those techniques are based on image features and are based on the assumption I just said so the image features change smoothly in the spatial domain but usually it that's not the case so that's why we do this work so if we cannot use the image features why not we just use images but so the images are missing at those locations so how do we come up with new images so fortunately we have Ganz yeah so that's the motivation of the work you mentioned that early in your research career you spent some time looking at information theory and the like and it strikes me in that context that this is a pretty difficult problem and that there's just not enough information in these satellite images to do a very good job coming up with ground level images how do you get around that yeah so so both like overhead imageries and grownup images has this like advantages and drawbacks so as I said so the overhead imagery is very accurate and it is dancing everywhere so we can totally use this economy for information but if you want to see inside a building so overhead images cannot do that but ground level images we can see inside a building but the biggest problem is the uneven distribution so from the like information theory side of the contour wheel so we should use both information sources to how to combine them in the best way to get the best result yeah this is a good point yeah maybe that maybe to make my question more concrete it has to do with what you considered your error function to be or something along those lines and I guess the the thing that I am trying to articulate here that strikes me as being particularly interesting and challenging here is with a ground level image you know beyond just trying to generate like Hylian you know whether a particular image looks kind of urban or looks like you know greenery or forests or things like that you know there are things that you're trying to generate that aren't at all represented in your in your input so for example your satellite image you know has no building facades but if you're trying to generate imagery around an urban area those ground level images will have building facades so it's just kind of making that stuff up and I'm wondering does that affect play into kind of how you build out the model and what the loss function is and stuff like that does that make any sense yeah yeah yeah it totally makes sense yeah yeah that's a good point so but actually our work is in the very like initial stage so currently we're just using a very traditional conditional again to do that and the last function is just like a generating realistic looking images so we're not considering like whether there is a discrepancy between the growing level images or overhead images or any other like objective function we're just using conditional game without other stuff so the the loss function only thinking about realism versus not realism is that kind of axiomatic for conditional Ganz yes yeah for most of conditional games it's just so for the discriminative Nader its job is just to say so whether this image is fake or real so it's a binary problem and the objective function is it's just that so entropy loss okay got it so you you built this Gann based system to produce these images we were just talking about loss functions like what did you find in terms of its performance how did the system do yeah so in terms of the image quality so I I don't like have a monitor to show the quality here but basically it makes sense so we can generate like some realistic images like ground level images according to the overhead images and but we don't have like a like a evaluation metric so how real they are or how accurate they are so we use another task so that's a land use classification right because if we can do a better land use classification given these like two fake images so we can create fake images all over the ground right so if we can use these images to like a do better land use classification so that both accuracy can be a good indicator of how our model works so let me share the performance of our land use classification problem so if you're if we are using the condition again generated features to do the Lanier's classification we can achieve like a land color land cover classification accuracy with like 73 percent accuracy so actually is kind of like it's a reasonable it's not high but it's reasonable the problems we're thinking here is because the generating our generated images are not realistic enough so there's some of our future work so we have like a three future directions to go so I can talk about later if you like what in the land use classification how many classes are there it's it's only binary classification so like as I mention it's a rural and urban areas so because that's that's only the ground truth we have okay so you've got a bunch of satellite data you feed it into your conditional again it generates an image that is meant to be representative of a ground level view of the area that you indicate and then you are sending that into a classifier that is meant to determine whether it's urban or rural and the is the seventy three percent is that the accuracy of your classifier or the accuracy of the generator based on a trained classifier oh it's a the classifier it's a liar okay the classifier yourself that's what I thought you were saying and so you know how did the the the Gannet self perform relative to with that classifier okay so the baseline of that is so we're compared to like a traditional approaches right so the first approach is I'm talking about is interpolated features so if we have some images we just do the feature extraction first and interpolate the features on those regions without images so the baseline is like sixty five percent okay so sixty five so if we're using Ganz without conditional like we're getting like a like - a 3% improvement like to the almost a 68% and if we're using conditional again we're having like 70% and if we're using conditional Gann generated features we're having the best accuracy like 73 percent so that's the progress of our work and I just want to make sure I understand this I thought what you were saying was that the the classifier itself like totally separate from the Gann part of the system had a 73 percent accuracy did you model the the land use classifier separately and that was 73 percent yes so the land use classifier is totally separate from again so again again is about generator discriminator the job is to like create really great these images yes yes and then we use the features from the discriminate later as the input to the land use classifier right I guess what strikes me as odd is that or at least curious is that if I understood you correctly your ultimate accuracy of your Gann turned out to be exactly the same as the accuracy of your classifier 70% maybe I'm saying wrong so because for again there's no like accuracy right so again for the discriminator it's just a real or fake so we don't care about that accuracy or not so it's usually very high right like 80% or 90% it's where high but what we are care about is an annual specifier yeah so what I'm saying the accuracy is all for ten years classifier is not for can classify okay I think I'm confusing the issue here and not I'm having a good job explaining so I guess I'm thinking that there are as we established there are two separate systems there's the generating system input is an overhead image and its output is a ground-level image and there's a classifying system and its input is a ground-level image and his output is a land-use classify marry land use class vacation yes and so the performance of the the began is a subset of that first system and it's responsible for generating these images and it's kind of judged on whether the images are realistic or not but that whole first system the generator system as a whole right you're giving it a satellite image and it's spitting out a ground a level image we can measure that its accuracy with respect to producing kind of the correct land-use right right yes so that that's one kind of accuracy measure and then there's another which is given any kind of image whether it's generated from our general generator or not you know is this land-use classifier model accurate in classifying the input image correctly and then there's like a third performance metric which is the end-to-end right given a satellite image can we identify the yen the land use correctly and so which of these is the 73% it's a second one so for the third one we're not doing end-to-end right now so we're just we're doing two stage so first stage is scan the second stage at the end is we're not doing end to end at this point yeah I guess if you're not doing end to end you're not really looking at the first the performance of the generator either because that really is danton uh-huh okay got it is the main focus of the research around modeling and testing this the classifier then or is it the generator system as a classifier right because we're so for a geospatial analysis we're more caring about the land use system so basically it's a owning system right so usually the government and city will make a zoning map so every year or so so that is the most helpful things so for the generator it's just a technique we use so if we don't use scale we can use other generator to generate images it's kind of funny in that like you know gans are so popular and quote-unquote sexy term it's almost like a head-fake that kind of pulls your attention away from what you're actually trying to do in this paper yeah yeah yeah actually it is okay so I'm not sure you've even talked much about the classification model so for land use is a pretty basic I can see a convolutional network so like we use ResNet 101 to do that so actually we have like a previous work like a three years ago so is it it's also doing land use classification with the convolution and your network I think at that time we're the first batch of work that use deep learning for this kind of work we also receive the best poster word in the ACM six spatial conference yeah so basically is you at that time it's also the similar problems we're using social media we use deep learning we do the land use classification but the biggest problem is we don't have the ground to us right if we don't have grown shoes we cannot evaluate our model so at that time which is the chicken like a university campus to do the land use classification tasks it's a trivial no example it's a toy example but do we get like a very good accuracy so since that time we start to build a very large data set and actually we took like two years to build a data set and continue this line of land use classification system yeah but in this work in this game paper we're just using a very standard resident 101 network to do the land use classification there's the social media images come into play in this paper oh yes so for the ground level images from the geographic data set so those are all like the ground level images so social media right those social media are contributed by just rest in the United Kingdom so anybody can submit image to the website and the website will show it so where the photo is taken and when it's about so-so yeah for that data set it is totally like social multimedia and remind me how where that comes into play in the system the system is when you do the discriminator right you need to tell and this image is fake or real right but how do you determine whether it's real effect you have to do can some ground level images to know it is real or fake right okay yeah that's what they come into play so although the input is overhead imagery but the ground truth like it's something you're compared to is a ground level images so that's when the social multimedia coming to play and those images are they're not labeled at all with regard to land use or anything like that it's just that's the training the discriminator to understand real versus fake images yes yes that's a beauty Afghan right so we don't care about the labeling we don't need to know about land use classes we just need to know this is a real image that is a fake one okay cool I think I think I've got it now so the to sum it all up right you've got this this challenge of being able to develop accurate land use classifiers but you've got this problem of sparse data right so you've got all of this area that you might want to classify based on satellite images but that you don't have specific ground level data for so you generate some realistic looking ground level data using the conditional gain and then use that to train your classifier and you've been able to kind of incrementally improve the performance of the classifier using this kind of data as opposed to previous data sources that are trying to approximate this data that you don't have yes yes correct yeah got it exactly got it okay awesome yeah I don't know why this was so difficult for me to wrap my head around but I think part of it had to do with this condition head-fake so most people just focus on again so what can can structure you're using are you using the state of art again so yeah so actually we're doing something like for geospatial analysis of the land use classification so that's how we're finally came awesome are there other interesting aspects of this paper that we haven't touched on yet yes so for this paper there's no not but for the future work right I want to talk about a future work a little bit so that's very interesting so the first thing about a future work is so right now our generated ground level images like is not really enough so they lack the image details so for some like houses or like animals they don't they don't look real enough so there's like a plenty of room for improvement so currently we're thinking so where we want to use a attack mate called a progressive gun so from a media though so that master is the key idea is to grow both a generator and discriminator progressively so from a small network from a small resolution we add new layers to the model and increase in Inked win the model so you know progressive manner so this can both speed up the twinning procedure and stabilize a model because care is really hard to train so sometimes it's just a model claps so this could eventually lead us to a very good image resolution because currently our generated image is only like 32 by 32 or 64 by 64 so it's very coarse so but the eventually we'll want to have some images like a 1k by 1k or 2k or even 2k so because most of the remote sensing imagery is it's like a 2k by like 3k so it's very large and it's very details so we won't ever generated ground level images is also large and details so I think that will bring up our performance by a large margin okay then yeah so that's a first direction and there's a second direction is so there's also a recent work by open a I cut glow so if they're using reversible generative models they're not using gas but reversible generative models so that model the most the good thing about it is they're late in space I mean the features so the features are useful for downstream tasks because in gas so the data points can usually not be directly represented in a latent space so because they have no encoder and they don't have all the data distribution but for reverse of generative models they can either they can interpolate between this feature space so it's very smooth so in that case we can directly use the features we don't need to use images so I hope that can be a better solution and then I think like a lot of people are interested in this paper as well so that the glow one it can generate very realistic images and the third direction will be using more as you talked about so maybe using like an information retrieval thing we're using some like a text text information right because so for example if we want on a generate like forest like image so if we just give him the overhead imagery so maybe he cannot like it give us a good image but if we can say so I want a forest a wheel with like with a with one house inside it so maybe in our generated grown-up images we will see exactly one house in a forest like a scenery so that's very promising so that's the third directions to go so so this works this can work is our initial attempt to do this Stan's best interpolations in this direction so we have a lot of work coming up and so hopefully we can get better results in the future you mentioned with regard to the low paper I've seen it but I haven't looked at it in any detail at all you mentioned that part of what it allows you to do is to get these smooth representations in a feature space yes and are you would you then be trying to use that feature space or is the only benefit to you of that that it produces better images I think we will try both so better images and other features I think maybe the image maybe the features makes more sense because for Glo work they're reversible right so the reversible means if you have some input and go into the output it can also use output to get you the input so the the reversible makes the features more robust so it's make sense it is expensive like explainable so in that case the features might be more powerful than the gain features are the features in this feature space are they semantically related like is it kind of an embedding in the feature space where you can get these semantic relationships between these different types of generated images yeah definitely I think they're the features should be like semantically related so if we do like a like a feature space visualization we can see so like like a tree or forests are clustered together and the river water lake those images are clustered together so the features are definitely cementing related and so do you think you'll get to a point where you can start with a ground truth image of a field and say like I want a little bit more you know rivers and then get a stream and a little bit more rivers and get like a bigger body of water or something like that well you're really like smart so that that's something where we're trying to do right now oh okay that's yeah that's more like a manipulative of the image right so yeah if you if we can do that so that's very interesting too that geospatial communities just to kind of wrap things up I'm curious you mentioned how difficult training the the Gans has been can you maybe share with us some things that you've you've learned as you've tried to to work with Gans and conditional Gans okay sure so for guests so because we're using a very standard condition again so which is proposed like 2015 or 16 so it's very earlier stage of the Ken so the training is like it's tables I have a lot of problems so as mentioning another paper I forgot the exact name but they should be like like good practices during like 20 incrementing games so they proposed like oh we should use try date convolution not cooling because cooling can hurt like an image resolution thing during the town sampling and we should like using a smaller crop size and a smaller learning rate something like that but but I don't think that's a major problem right now because most of game models is like easier to Train at this moment so the because the last matrix changes to the versus 10 lost so that is a very stabilized and lost function and also the network architecture change so right now we can trim very deep networks using games and for example I can rest at 101 and that the output can be several hundred resolutions or even 1k resolutions so the training of the game is not that hard right now awesome well II thank you so much for taking the time to share with us what you're working on is really interesting stuff yeah no problem all right everyone that's our show for today for more information on e or any of the topics covered in this episode head over to Twilley Icom slash talks last 172 as always thanks so much for listening and catch you next timehello and welcome to another episode of tormal talk the podcast why interview interesting people doing interesting things in machine learning and artificial intelligence I'm your host Sam Carrington a quick update for our faithful listeners and fans we recently asked you to help us in our bid to secure our nomination in the People's Choice podcast Awards well thanks to you we're a finalist in the best technology podcast category along with other noteworthy shows like reco decode and the verge cast the award ceremony will be streamed live on September 30th which turns out to be international podcast day keep your fingers crossed for us and a huge thanks to everyone who voted you have our utmost gratitude today we're joined by a zoo a PhD candidate at UC Merced focused on geospatial image analysis in our conversation en I discussed his recent paper what is it like down there generating dense ground-level views and image features from overhead imagery using conditional generative adversarial networks he and I discussed the goals of this research which is to Train effective land-use classifiers on proximate or ground-level images and how he uses conditional gans along with images source from social media to generate artificial ground-level images for this task we also explore future research directions such as the use of reversible generative networks as proposed in the recently released open AI glow paper to produce higher resolution images enjoy alright everyone I am on the line with ease uu e is a PhD candidate at UC Merced a welcome to this week in machine learning in AI hi Sam hi everybody thank you for having me it's really exciting to be here absolutely I'm really looking forward to digging into your work today to get us started why don't you tell us a little bit about your background and how you got started in machine learning so actually my major at first is awareness communication because my undergraduate study was on like signal processing and information theory but then at a year of 2012 when deep learning first to show his great potential on image net challenge so I was fascinated by simplicity and surprisingly good results so I started like kygo challenges using deep learning including the famous dog and cat classification challenge so although at that time most the challenge winners are still using random forests of XG boost but later I found myself like attractive to this machine learning field and especially deep learning so I'm thinking so why not change a major so my understanding so images are steel signals captured by optical sensors so they are not that different from my previous study so I switched my major to computer vision and start my PhD study at UC Merced in 2014 so basically I have two lines of researchers directions right now one is geospatial image analysis the other is a video analysis but I think today I would just a focus first at one you mentioned that you started competing in cattle competitions what were some of the competitions that you competed in so the first one will be the dog and cat challenge so that's a less object like a recognition problems like a binary thing and later I also competing the challenge of the so the eye contact of the driver so to try to understand if the driver is sleepness or is it a safe driver or something so that's also like autonomous driving challenge and then also for the insurance to see a car's image and so whether it is like a damage it or not is it's a new car so - so how much price should it sell for a second-hand car so that's a for the insurance company so there are several more but I cannot remember the details yeah okay awesome I don't know the last two the driver I contact or the insurance one but the dog and cat one is one that we have gone through I'm working with a group of folks to kind of go through the fast AI course and they talk about that dog and cats one in there quite extensively how'd you do in the Kaggle challenges you did so firstly I just try like a linear regression so that's the most basic model like I think most of people still use it today for like normal questions and then I also try like a random forest and XG boost so as times and most of winners are using those two techniques and and then modeling assembling to get to the bestest score they can but then I also try elective learning right so convolutional neural networks for this image recognition task and at first they cannot compete with random forests because the images are not that large so you know like a deep learning is a data hungry model right so if you have more data and if you have like a clean data so you can get a very good result but if you don't have enough data so maybe sometimes it doesn't compete with those traditional algorithms yes that basically the for algorithms I'm using for cargo challenges how did you rank in the competition's did you rank highly in any of them I didn't rank like a top one or top ten something but I'm usually in the top 10% okay nice and so your research now is on understanding images using deep learning and you recently published a paper the paper is called what is it like down there generating dense ground-level views and image features from overhead imagery using conditional gans that there's a ton in there the title is quite long I mean maybe let's get started I think you know we've talked about Ganz quite a bit on the podcast but why don't you start us out by talking about conditional Ganz okay yeah so yeah I have seen those like against episode here and there so it's very hot topic like last year and there's a trend is still going on so so as we know so again consists of two components so one generator or one discriminator but the thing is when you generate something so you're generating from a random noise so each time you will get different results different output but sometimes you really want to have some like fixed output right so that is what conditional can for for like for example sometimes a at a fashion like a shoe designer they want to design using again to design a new shoe but if you did the output sometimes it will begin something really random so if we can give the model some information on prying information like some texturing bearings like on I want a shoot so then the output will definitely be a shoe image so that's what can for so see again so the condition again learn the distribution condition on some auxiliary information so those are the delivery information and can be class labels or which is the most standard form or texting bearings for generating images or like image to image translation and so here in our work so we use condition again because we know what we want we want to ground level images corresponding to the overhead image so given me a overhead image I want to generate a similar-looking ground level images so the overhead image it will be considered as a prior information yeah so that's why we choose like condition again and what you're going for kind of visually is if you send in to the system the image of you know what an area that looks like farmland you want to generate ground level pictures that look like farmland similarly if you're sending in overhead images of an urban setting you want it to generate urban images yes exactly tell us about the different data sources that you use to make all this happen okay so the ground level image is a problem there is there sparse so on what we can have is a satellite images right satellite images stance so in everywhere so it has all the testified images at every spatial location so that is our even image input source so the ground truth and cover map is from the LCM data set so our our study region is a 771 by 71 kilometer in length and so the so geo graphic images are labeled as urban or rural based so then we can do a very simple like a binary classification problems so our input sources are like an a lot so from like Google Maps statistics and until graph API and also the satellite images so those are the three image sources we're using the LCM data set is that your overhead images are those your ground images those are the overhead images okay and the ground level images is from the Geo graph to where the that is a website by some London researchers so those are for the whole United Kingdom so the lengths our classes are provided on a one kilometer grade so for every grade they have the user provided for on the level images so so that's pretty accurate so that is our ground level images and the LCM is overhead images so we have the corresponding relationship to train our models to do some experiments and so I realized that what you're doing here with Gantt is fundamentally generative but part of the the way the problem is set up sounds like an information retrieval problem like you have this overhead image and you have this corpus of ground-level images kind of find the best one are those related in any way oh so there are there is difference there so for like a image retrieval problem is so given me overhead images I want to find the best matching or like ground level images right so there is like there's a database of the ground level images so we're trying to find it a most imagine one so but for generate models it's not where we want some specific images so we don't want some specific ground level images to corresponding to the overhead image we want the whole data distribution to fit so the we want the generator to generate some real looking images to fit the data distribution so we don't care about so which images we're generating we just want it to look realistic for that category so that's the only difference and do you do any kind of information retrieval as part of the solution in other words is your again conditioned only on the overhead image or is it conditioned on the subset of images that you retrieve from a database based on the over image that's a good question so actually right now we only like like a based our model own again so so we're not we're not using any information retrieval thing so but in the future we want to use that because the land use classification problem is a really challenging it's really really hard so maybe later I will introduce my another work so it's a it's also about large-scale land use classification but to the classification accuracy is only like a twenty percent thirty percent so it's pretty challenging so we need to have some like a human prior knowledge inside it so like the information retrieval inside information you mentioned about so we will do that in the future work so in this work what are some of the big challenges that you had to overcome in in this paper and this research okay so can I like a back up a little bit to talk about the motivation of this work so so that would be more clear okay yeah so so the motivation to like all of our work is because although we have like enormous amount of images like from Flickr Instagram so all those images or those videos we could use it to do some geospatial analysis but if we want to do a detailed land use classification map so land use I mean is for examples of whether a building is a hospital or whether it's a shopping mall or something so so how do we use the land so it whether it's residential or for office use so but overhead images cannot handle such case because overhead images is from like above right so from above you can only see a building so you cannot see inside of the building so you don't know what's that designed for so that's what we come up with so we want to use social multimedia to do this kind of a Colonials classification because social media is captured by phones or cameras so they can see inside the building to easily infer what it is of this building so where it is but the challenge the most big challenge of using social media is is uneven spatial distribution it's so simple as speaking because most images are coming from the famous landmarks right so now to the general spatial locations their returns of people having the Golden Gate Bridge in San Francisco like after tower in Paris so it's very easy to tell these landmarks from the photographs but for like residential areas or privacy pre preserved regions which so we don't have enough images so there's a very serious problems if we don't have images so how can we infer the location and all those land use classification information so that's the challenge we really face about so traditional there are several ways to do that so one simple way is to interpolate the features and so if we have for example we have several images at location a and we have several images at another location B and these two locations are close to each other but between these two locations there is no images at all so in this case we just computed the image features of these other images at those two locations and interpolate them along this line but but here we made assumption the assumption is we hope the image features will change smoothly in the spatial domain but actually in most cases is not a case so if we have several images and form like a residential and at location a and a residential and location B then if you do the like interpolation things it will interpolate so between these two locations all the areas are residential based but actually in most cases then the the space in between is like forest or park or just a river or something so it's so I can that natural things so the other kind of method is so we try to use other information like remote sensing images or Google Street wheels because those two information sources are like a dance so everywhere in the on earth so there's a work from that professor Mason Jacobs lab and sSAB 2017 so they just use Google Street wheels to do this kind of things but we find that so all those techniques are based on image features and are based on the assumption I just said so the image features change smoothly in the spatial domain but usually it that's not the case so that's why we do this work so if we cannot use the image features why not we just use images but so the images are missing at those locations so how do we come up with new images so fortunately we have Ganz yeah so that's the motivation of the work you mentioned that early in your research career you spent some time looking at information theory and the like and it strikes me in that context that this is a pretty difficult problem and that there's just not enough information in these satellite images to do a very good job coming up with ground level images how do you get around that yeah so so both like overhead imageries and grownup images has this like advantages and drawbacks so as I said so the overhead imagery is very accurate and it is dancing everywhere so we can totally use this economy for information but if you want to see inside a building so overhead images cannot do that but ground level images we can see inside a building but the biggest problem is the uneven distribution so from the like information theory side of the contour wheel so we should use both information sources to how to combine them in the best way to get the best result yeah this is a good point yeah maybe that maybe to make my question more concrete it has to do with what you considered your error function to be or something along those lines and I guess the the thing that I am trying to articulate here that strikes me as being particularly interesting and challenging here is with a ground level image you know beyond just trying to generate like Hylian you know whether a particular image looks kind of urban or looks like you know greenery or forests or things like that you know there are things that you're trying to generate that aren't at all represented in your in your input so for example your satellite image you know has no building facades but if you're trying to generate imagery around an urban area those ground level images will have building facades so it's just kind of making that stuff up and I'm wondering does that affect play into kind of how you build out the model and what the loss function is and stuff like that does that make any sense yeah yeah yeah it totally makes sense yeah yeah that's a good point so but actually our work is in the very like initial stage so currently we're just using a very traditional conditional again to do that and the last function is just like a generating realistic looking images so we're not considering like whether there is a discrepancy between the growing level images or overhead images or any other like objective function we're just using conditional game without other stuff so the the loss function only thinking about realism versus not realism is that kind of axiomatic for conditional Ganz yes yeah for most of conditional games it's just so for the discriminative Nader its job is just to say so whether this image is fake or real so it's a binary problem and the objective function is it's just that so entropy loss okay got it so you you built this Gann based system to produce these images we were just talking about loss functions like what did you find in terms of its performance how did the system do yeah so in terms of the image quality so I I don't like have a monitor to show the quality here but basically it makes sense so we can generate like some realistic images like ground level images according to the overhead images and but we don't have like a like a evaluation metric so how real they are or how accurate they are so we use another task so that's a land use classification right because if we can do a better land use classification given these like two fake images so we can create fake images all over the ground right so if we can use these images to like a do better land use classification so that both accuracy can be a good indicator of how our model works so let me share the performance of our land use classification problem so if you're if we are using the condition again generated features to do the Lanier's classification we can achieve like a land color land cover classification accuracy with like 73 percent accuracy so actually is kind of like it's a reasonable it's not high but it's reasonable the problems we're thinking here is because the generating our generated images are not realistic enough so there's some of our future work so we have like a three future directions to go so I can talk about later if you like what in the land use classification how many classes are there it's it's only binary classification so like as I mention it's a rural and urban areas so because that's that's only the ground truth we have okay so you've got a bunch of satellite data you feed it into your conditional again it generates an image that is meant to be representative of a ground level view of the area that you indicate and then you are sending that into a classifier that is meant to determine whether it's urban or rural and the is the seventy three percent is that the accuracy of your classifier or the accuracy of the generator based on a trained classifier oh it's a the classifier it's a liar okay the classifier yourself that's what I thought you were saying and so you know how did the the the Gannet self perform relative to with that classifier okay so the baseline of that is so we're compared to like a traditional approaches right so the first approach is I'm talking about is interpolated features so if we have some images we just do the feature extraction first and interpolate the features on those regions without images so the baseline is like sixty five percent okay so sixty five so if we're using Ganz without conditional like we're getting like a like - a 3% improvement like to the almost a 68% and if we're using conditional again we're having like 70% and if we're using conditional Gann generated features we're having the best accuracy like 73 percent so that's the progress of our work and I just want to make sure I understand this I thought what you were saying was that the the classifier itself like totally separate from the Gann part of the system had a 73 percent accuracy did you model the the land use classifier separately and that was 73 percent yes so the land use classifier is totally separate from again so again again is about generator discriminator the job is to like create really great these images yes yes and then we use the features from the discriminate later as the input to the land use classifier right I guess what strikes me as odd is that or at least curious is that if I understood you correctly your ultimate accuracy of your Gann turned out to be exactly the same as the accuracy of your classifier 70% maybe I'm saying wrong so because for again there's no like accuracy right so again for the discriminator it's just a real or fake so we don't care about that accuracy or not so it's usually very high right like 80% or 90% it's where high but what we are care about is an annual specifier yeah so what I'm saying the accuracy is all for ten years classifier is not for can classify okay I think I'm confusing the issue here and not I'm having a good job explaining so I guess I'm thinking that there are as we established there are two separate systems there's the generating system input is an overhead image and its output is a ground-level image and there's a classifying system and its input is a ground-level image and his output is a land-use classify marry land use class vacation yes and so the performance of the the began is a subset of that first system and it's responsible for generating these images and it's kind of judged on whether the images are realistic or not but that whole first system the generator system as a whole right you're giving it a satellite image and it's spitting out a ground a level image we can measure that its accuracy with respect to producing kind of the correct land-use right right yes so that that's one kind of accuracy measure and then there's another which is given any kind of image whether it's generated from our general generator or not you know is this land-use classifier model accurate in classifying the input image correctly and then there's like a third performance metric which is the end-to-end right given a satellite image can we identify the yen the land use correctly and so which of these is the 73% it's a second one so for the third one we're not doing end-to-end right now so we're just we're doing two stage so first stage is scan the second stage at the end is we're not doing end to end at this point yeah I guess if you're not doing end to end you're not really looking at the first the performance of the generator either because that really is danton uh-huh okay got it is the main focus of the research around modeling and testing this the classifier then or is it the generator system as a classifier right because we're so for a geospatial analysis we're more caring about the land use system so basically it's a owning system right so usually the government and city will make a zoning map so every year or so so that is the most helpful things so for the generator it's just a technique we use so if we don't use scale we can use other generator to generate images it's kind of funny in that like you know gans are so popular and quote-unquote sexy term it's almost like a head-fake that kind of pulls your attention away from what you're actually trying to do in this paper yeah yeah yeah actually it is okay so I'm not sure you've even talked much about the classification model so for land use is a pretty basic I can see a convolutional network so like we use ResNet 101 to do that so actually we have like a previous work like a three years ago so is it it's also doing land use classification with the convolution and your network I think at that time we're the first batch of work that use deep learning for this kind of work we also receive the best poster word in the ACM six spatial conference yeah so basically is you at that time it's also the similar problems we're using social media we use deep learning we do the land use classification but the biggest problem is we don't have the ground to us right if we don't have grown shoes we cannot evaluate our model so at that time which is the chicken like a university campus to do the land use classification tasks it's a trivial no example it's a toy example but do we get like a very good accuracy so since that time we start to build a very large data set and actually we took like two years to build a data set and continue this line of land use classification system yeah but in this work in this game paper we're just using a very standard resident 101 network to do the land use classification there's the social media images come into play in this paper oh yes so for the ground level images from the geographic data set so those are all like the ground level images so social media right those social media are contributed by just rest in the United Kingdom so anybody can submit image to the website and the website will show it so where the photo is taken and when it's about so-so yeah for that data set it is totally like social multimedia and remind me how where that comes into play in the system the system is when you do the discriminator right you need to tell and this image is fake or real right but how do you determine whether it's real effect you have to do can some ground level images to know it is real or fake right okay yeah that's what they come into play so although the input is overhead imagery but the ground truth like it's something you're compared to is a ground level images so that's when the social multimedia coming to play and those images are they're not labeled at all with regard to land use or anything like that it's just that's the training the discriminator to understand real versus fake images yes yes that's a beauty Afghan right so we don't care about the labeling we don't need to know about land use classes we just need to know this is a real image that is a fake one okay cool I think I think I've got it now so the to sum it all up right you've got this this challenge of being able to develop accurate land use classifiers but you've got this problem of sparse data right so you've got all of this area that you might want to classify based on satellite images but that you don't have specific ground level data for so you generate some realistic looking ground level data using the conditional gain and then use that to train your classifier and you've been able to kind of incrementally improve the performance of the classifier using this kind of data as opposed to previous data sources that are trying to approximate this data that you don't have yes yes correct yeah got it exactly got it okay awesome yeah I don't know why this was so difficult for me to wrap my head around but I think part of it had to do with this condition head-fake so most people just focus on again so what can can structure you're using are you using the state of art again so yeah so actually we're doing something like for geospatial analysis of the land use classification so that's how we're finally came awesome are there other interesting aspects of this paper that we haven't touched on yet yes so for this paper there's no not but for the future work right I want to talk about a future work a little bit so that's very interesting so the first thing about a future work is so right now our generated ground level images like is not really enough so they lack the image details so for some like houses or like animals they don't they don't look real enough so there's like a plenty of room for improvement so currently we're thinking so where we want to use a attack mate called a progressive gun so from a media though so that master is the key idea is to grow both a generator and discriminator progressively so from a small network from a small resolution we add new layers to the model and increase in Inked win the model so you know progressive manner so this can both speed up the twinning procedure and stabilize a model because care is really hard to train so sometimes it's just a model claps so this could eventually lead us to a very good image resolution because currently our generated image is only like 32 by 32 or 64 by 64 so it's very coarse so but the eventually we'll want to have some images like a 1k by 1k or 2k or even 2k so because most of the remote sensing imagery is it's like a 2k by like 3k so it's very large and it's very details so we won't ever generated ground level images is also large and details so I think that will bring up our performance by a large margin okay then yeah so that's a first direction and there's a second direction is so there's also a recent work by open a I cut glow so if they're using reversible generative models they're not using gas but reversible generative models so that model the most the good thing about it is they're late in space I mean the features so the features are useful for downstream tasks because in gas so the data points can usually not be directly represented in a latent space so because they have no encoder and they don't have all the data distribution but for reverse of generative models they can either they can interpolate between this feature space so it's very smooth so in that case we can directly use the features we don't need to use images so I hope that can be a better solution and then I think like a lot of people are interested in this paper as well so that the glow one it can generate very realistic images and the third direction will be using more as you talked about so maybe using like an information retrieval thing we're using some like a text text information right because so for example if we want on a generate like forest like image so if we just give him the overhead imagery so maybe he cannot like it give us a good image but if we can say so I want a forest a wheel with like with a with one house inside it so maybe in our generated grown-up images we will see exactly one house in a forest like a scenery so that's very promising so that's the third directions to go so so this works this can work is our initial attempt to do this Stan's best interpolations in this direction so we have a lot of work coming up and so hopefully we can get better results in the future you mentioned with regard to the low paper I've seen it but I haven't looked at it in any detail at all you mentioned that part of what it allows you to do is to get these smooth representations in a feature space yes and are you would you then be trying to use that feature space or is the only benefit to you of that that it produces better images I think we will try both so better images and other features I think maybe the image maybe the features makes more sense because for Glo work they're reversible right so the reversible means if you have some input and go into the output it can also use output to get you the input so the the reversible makes the features more robust so it's make sense it is expensive like explainable so in that case the features might be more powerful than the gain features are the features in this feature space are they semantically related like is it kind of an embedding in the feature space where you can get these semantic relationships between these different types of generated images yeah definitely I think they're the features should be like semantically related so if we do like a like a feature space visualization we can see so like like a tree or forests are clustered together and the river water lake those images are clustered together so the features are definitely cementing related and so do you think you'll get to a point where you can start with a ground truth image of a field and say like I want a little bit more you know rivers and then get a stream and a little bit more rivers and get like a bigger body of water or something like that well you're really like smart so that that's something where we're trying to do right now oh okay that's yeah that's more like a manipulative of the image right so yeah if you if we can do that so that's very interesting too that geospatial communities just to kind of wrap things up I'm curious you mentioned how difficult training the the Gans has been can you maybe share with us some things that you've you've learned as you've tried to to work with Gans and conditional Gans okay sure so for guests so because we're using a very standard condition again so which is proposed like 2015 or 16 so it's very earlier stage of the Ken so the training is like it's tables I have a lot of problems so as mentioning another paper I forgot the exact name but they should be like like good practices during like 20 incrementing games so they proposed like oh we should use try date convolution not cooling because cooling can hurt like an image resolution thing during the town sampling and we should like using a smaller crop size and a smaller learning rate something like that but but I don't think that's a major problem right now because most of game models is like easier to Train at this moment so the because the last matrix changes to the versus 10 lost so that is a very stabilized and lost function and also the network architecture change so right now we can trim very deep networks using games and for example I can rest at 101 and that the output can be several hundred resolutions or even 1k resolutions so the training of the game is not that hard right now awesome well II thank you so much for taking the time to share with us what you're working on is really interesting stuff yeah no problem all right everyone that's our show for today for more information on e or any of the topics covered in this episode head over to Twilley Icom slash talks last 172 as always thanks so much for listening and catch you next time\n"

Generating Ground-Level Images From Overhead Imagery Using GANs with Yi Zhu - TWiML Talk #172

Random Videos