Applied Machine Learning for Publishers with Naveed Ahmad - TWiML Talk #182

A Project-Based Approach to Content Recommendation: Lessons Learned from San Francisco Chronicle and Video Content Recommendations

As part of San Francisco's user interface rewrite, a few months ago, we released a revamped version of SF Chronicle. This project marked an opportunity to deploy content recommendation technology as part of the deployment. The goal was to extract entities recognized by Google NLP and generate recommendations for content based on these shared entities.

We started by adding fields for each of the entities recognized by Google NLP to our BigQuery data. Then, we used BigQuery ML to generate a recommendation for from a given piece of content to other recommended pieces of content based on these shared entities that have been recognized by the NLP service. However, this project didn't use BQ ML; instead, it relied on plain SQL running which looked at the number of overlaps of entities between two contents and scored them based on how relevant one content is to another.

To serve the data, we pushed out the computation every 15 minutes and built out a table in a Postgres database. The web service layer uses memcache as well as this Postgres database to serve these recommendations. Another project that we've done is video content recommendations. We took our video converted it to sound, extracted text from that sound using Google's NLP to extract entities from the sound. Our content has the NLP tags and our video used voice transcription and using that text to also extract entities, allowing us to recommend videos to content based on this technology.

We transcribed the audio using the Google Speech-to-Text API, which converts any sound into text. However, we've found that like full objective was to have captioning on the videos, so while it didn't work well in certain cases, most of the time it did. We've taken this text and applied NLP to extract categories and tags from it, and those tags are used for eventually being able to recommend textual content along with the videos.

Looking forward, we want to use applied deep learning for recommendations. There's a lot of research done in past couple of years for application of deep learning and recommendation systems, and we're actually doing some research and want to launch soon with a recommendation system that applies deep learning. We recommended using existing frameworks like TensorFlow or higher-level APIs such as NLP image video API's or BQ ML and Auto ML. It's a trade-off between hiring data scientists to rebuild from scratch versus paying for the service and applying it.

We're also exploring more use cases for BigQuery ML, such as churn prediction, propensity modeling, and likelihood of user subscription to our services. Overall, our future plans are exciting, and we're looking forward to implementing these technologies in various parts of our system.

"WEBVTTKind: captionsLanguage: enhello and welcome to another episode of tormal talk the podcast why interview interesting people doing interesting things in machine learning and artificial intelligence I'm your host Sam Carrington in today's show we're joined by the Vidocq mod senior director of data engineering and machine learning at Hearst Newspapers a few months ago na'vi gave a talk at the google cloud next conference on how publishers can take advantage of machine learning in this conversation we dive into the role of ml at Hearst including their motivations for implementing it and some of their early projects the challenges of data acquisition within a large organization and the benefits they enjoy from using Google's bigquery as their data warehouse enjoy the show all right everyone I am on the line with Naveed Ahmad Navid is senior director of data engineering and machine learning at Hearst Navid welcome to this week in machine learning in AI thank you thank you for having me on your show so you've been in the publishing industry for about 10 years now can you tell us a little bit about both your current role and some of the past things you've done in publishing sure so I'm currently a senior director for data engineering and machine learning at Hearst being here for about two years of my rule has been building the data warehouse building personalization and also doing predictive analysis using that data and before Hearst I was at you know times I worked in the subscription space where they love working on their CRM system and also some of the aspects of machine learning like churn modeling and doing email content detection and before that I was in Thomson Reuters where I all built their data used a distribution platform as well was part of their CMS team a contributor to recommendation systems and before that I was mostly in telecommunication and so you recently presented at the Google cloud next conference on applied machine learning or publishers before we jump into that topic maybe he should take a second to provide an overview of Hearst for those who aren't familiar with the company so Hearst I'd especially work in Hearst newspaper department at Hearst is a very large organization with more than 300 businesses which includes magazines investment and television channel and we within even newspapers there's about 40-plus websites which includes names like San Francisco Chronicle and Houston Chronicle and Times Union it is and Hearst headquarters located here at 57th Street in New York and that's some background about Hearst so one of the things that you mentioned in describing your background and in our conversation before the interview started was the role of the data warehouse in enabling you to perform the types of machine learning that you want to be able to perform at Hearst can you talk a little bit about the data warehouse and the the process for establishing it yeah so one of the first tasks for me when I joined her as was to build a data warehouse especially I used Google bigquery as the Detroit data warehouse and the idea was that we need all the data sets in one place to be able to then thoughts like to be able to look at what is the relationship between newsletter and the web and subscriptions and the first a use case was to build business and tell it on like a regular reporting on top of this data and the same data can be used to that's used to give data reports about current state can be used to do predictive modeling so use it query and basically we those data our data sets include Google Analytics we are content our newsletters so all sorts of data related to our business and is in bigquery and this forms as a foundation for machine learning so the data warehouse is kind of central to your ability to perform machine learning and an analytics and it was one of the first things that you established at Hearst can you talk a little bit about any challenges that you experienced trying to centralize all of the data from these various sources the challenges were that our data was sitting in different formats like before you know warehouse people would get either data from individual systems or there with those Excel worksheets going around so really figuring out the data sources and then building a platform for ETL that was rich Allen some of the bigger data sources like Google Analytics and DFP they were easier to get into bigquery but some some of the ones we have to build specific code to ET all that into bigquery and what is DFP double-click for publishers ok so that's an advertising your advertising system yeah this is all the log date for each advertisement that impressions and so the the data warehouse has your your analytics so the clickstream data for people that are visiting the site it's got information about the advertising interactions and maybe those quick streams does it also contain content information yeah so all our CMS content is in bigquery all our newsletter driver data that's in bigquery and is that data replicated to bigquery for analytics or Dizz at its native place and you're publishing directly to and from bigquery yeah it is replicated like the other we have a separate CMS system it has their own database so we're replicating that bigquery yeah and these are these are very different types of data what's the the goal for pulling this all into a single place yeah so each of these data they have identifiers that link them up with other data sets for example a newsletter they're hashed email address you can look it up with subscription data and from subscription data we can look a link it up to Google Analytics data so the idea is to be able to connect all the aspects of a user using these identifiers it sounds like the primary focus from an analytics of predictive analytics and a machine learning perspective is based on this this link data this user centric link data yes absolutely and so maybe let's talk through some of the different things that you've done with the data from machine learning perspective what are the different challenges that you're trying to address so we built a few different machine learning based products and predictive models using those data we have in bigquery so a number one the one of currently is a churn model churn prediction the churn prediction is that be able to predict how likely a subscriber is to cancel their subscription and it's very relevant in the media space because it's a bit challenging environment to keep subscribers since we have subscription data how people the different attributes of their usage behavior how they how many subs use letter dates have signed up to also customer service data associated with the each subscriber and the way it works is we take like a year old worth of data and take their attributes a month and feed it into a machine learning model which can and we know that in the next month how many of those subscribers were canceled and versus who didn't cancel so this is a binary classification problem so and using six months of data we can we have training data of people subscribe us who didn't cancel versus cancel we can build a predictive model on it so one of the things that we've done is since we're on bigquery we've used this new the launch feature called bigquery ml which was released in Google next back in August and what the great thing about bigquery ml is that can do a training and prediction model right using SQL syntax ie you don't have to write code or Python code and it does some of the like machine learning goodies like normalization of features and also fine-tuning the the model and and one of the things that I emphasize this enables people who don't even have machine learning background like people and business intelligence or who have background in SQL can easily write machine learning code just by using SQL syntax so we built out our first model on San Francisco and we're just working on integrating with our marketing and CRM systems so and so this was one of the use cases for machine learning then we've also built one of the things that I've done over here is use if I could jump in before you go to the next one one of the things that really struck me at the Google next conference and you know not to turn this into a big query commercial but I was really surprised by the enthusiasm for bigquery like people seem to really love that database can you like maybe net out for me why folks are excited about it and the bigquery ml piece that they just announced yeah so in general bigquery is a fully managed data warehouse so you don't have to have a DBA to fine tune or optimize the database and it's based on Google's own internal Dremel technology what I've heard is heavily Dremel technologies heavily used inside of Google and so they've exposed that as bigquery and their philosophy is it's full scan any data set basically spawns off machines or compute in the backend and is able to very quickly get your results especially if you have large data sets and terabytes or petabytes scale it only takes a few minutes to run sequel so it's very easy in terms of maintenance it's based on sequel syntax and that's why it's very an attractive option for data mining and bigquery amounts are they build out machine learning on top of bigquery which even further it's very fast because it's using the same compute infrastructure and and it abstracts out the machine learning as an SQL sequel so it becomes I foresee that will enable machine learning for a lot of people especially we're on bigquery so is the idea then you'll have like you know where you might have aggregators in sequel like you know average or max or something like that you can apply some kind of model to or is it different yeah it's absolutely so just like these functions they mentioned they've introduced a few different functions to be able to train and predict and even get machine learning and tricks like once the machine learning model is built to get like for example was the AUC of this machine learning or precision and recall so again do all of those things just by a function call with the sequel syntax yeah as compared to using a tool like scikit-learn or tensorflow a whoa the other advantages you don't have to take the data outside of the system you do everything in one place like if you were using scikit-learn or tensorflow there's this whole process of extracting the data massaging into a format that that machine learning framework can understand build a machine learning model pull that data back into your data warehouse so this whole cycle just gets reduced because you're doing everything in line and bigquery you're about to go into another use case beyond well actually back on the on the churn yeah I'm curious this isn't necessarily a machine learning question you know I've talked about term prediction in many cases across many different industries and understand broadly how it's applicable but I'm curious in the case of a publisher you know what specifically is Hearst going to do on the business side once it's able to predict that a user has a high likelihood to churn yeah that's a good question and we're working with marketing and people in the subscription business what are the different ways so some of the ways is that we in a marketing platform we were able to give messaging and be able to send emails is typically a user who is not engaged a lot are some of the people those are some of the attributes that are indicative that a person is going to cancel their subscription so the email to nudge somebody hey you know this did you know that there is a certain feature or educating our print subscribers that they have free digital benefits are some of the things that they can we can employ on getting people to lesser cancel their subscription you built this on bigquery ml using this new sequel like interface can you talk about how that experience of building these models using a sequel type of interface differed from traditional approaches you mentioned scikit-learn like how are they different and I guess I'm particularly curious about the different skill level involved but also any differences in the way you need to manage the models or the way you productize them those kinds of things yeah that's a good question so so one thing is getting started with BQ ml is really fast I actually built the first prototype within a day I got a very basic churn model up and running I've done this before in my previous job using scikit-learn and and the difficulty really was getting aggregating data from different sources like in my previous job getting data from like an Oracle database and you know Hadoop database so seem like that took most of the time getting data from different data sources typically the machine learning technology piece is the easier part and the more time is spent on getting the data and then figuring out the right features to use in your mission in our machine learning model and so so a using bigquery ml getting used to the sequel syntax is like anybody when you know sequel can be trained to do it within I think half a day I think the tricky part is like for someone new is to learn some of the machine learning concepts for example if you train a machine learning model what how do you measure that it's you exceeded like basic constant like precision recall and accuracy and and a you see area under curve so those those are more where more time is spent like figuring out the features and then doing measurement of the of the result and I feel like that doesn't defer from versus using scikit-learn or bigquery ml the the feature engineering and and the measurement but a the this getting used to bigquery ml is very quick you don't have to use any language and then the cycle I was talking about that data export train model data import that cycle gets very ready really reduced so the number of experiments you can do using the query ml you can do a lot more experiments with this technology and as part of that because you presumably they're scaling out the training behind the scenes and it's just a lot faster or are there other aspects to reducing that cycle time yeah a is that it's a lot faster because of the bigquery technology and the second is the what I mentioned before exporting data for a machine learning model outside of the system training it and fetching that data back and right so that gets significantly reduced and the third one is in our case since our data was all sitting in one place we didn't have to incur this ETL cost of getting data from different sources everything was sitting at bakery what are some of the some of the other types of machine learning projects that you've done at Hearst yeah so another one was application of natural language processing there's a case study published back in November thing if you just google Hearst and Google that that'd probably be the first link so we've applied Nash Google's natural language processing to all our content so each our tag basically we're using two features of natural language one is classification of content putting this in two broad categories like if content is about food and wine or is it about real estate and then also more detailed version of it that it tags entities and the content and the entities could be proper nouns or common nouns with their metadata with like their how much salient they are to that article and also their Wikipedia linker for their popular entities so we did that for all our content from all our web sites and that's stored in our CMS as well as also replicated into bigquery for further analysis so some of the use cases on this NLP data is so we build our bi business intelligent reports so example if you want to see how is for example a particular personality trending over time so using Google NLP data and Google Analytics data and our content data and bigquery we can build our reports and also if we want to see which content categories get more traffic like versus the content that we publish like which categories should we be focusing on for publishers so we build all sorts of reports using these three data sets so this is one of the use cases for google NOP the other one is which is more interesting is that we also pass like when we render our ads on things we've integrated that with double-click for publishers so whenever an ad is rendered we also pass the key value pair which category that ad belongs to the category of the content that ads being displayed so we built so over a month this builds a database where we can say show me to all the ads that are displayed on for example Olympics content or ads on our food contents so if there's a new advertiser that comes on board and he says that I want to run a campaign on let's say basketball or Olympics content we build this capability using our tag tagging technology so they can just specify a criteria and double-click for publishers said specify for this campaign advertise this campaign on all the content which is related to let's say tennis or basketball so that's another use case for natural language processing I know you've done some work both with Google's Auto ml NLP that allows you to kind of fine-tune some of their models but the the classification was there off the shelf NLP service that you that one if I'm not mistaken you're not able to train that using your own data is that right yeah it's a good question so the auto ml for NLP which is another tool released recently it works on top of the classification so they give a cat a taxonomy of about 700 categories by default but let's say if somebody wants to train their own category like for some unique or novel content they can use auto ml for natural language to train their own custom categories so and that works on top of their the existing categories so if you make a web service call to this auto ml it not only returns their default categories about also the trained version as well and so you found that for the types of articles that you were initially trying to categorize the worthies 700 built-in categories were they sufficient were they was it you know what was the experience of trying to map your business to this pre-canned AI service so most of most of the time it works very well so some of the categories like the like for most of our use cases those categories work fine there is a few categories which don't work well especially their sensitive content so they don't really split out so at any article about crime or you know some violence everything gets categorized into sensitive content we would have ideally liked that to be split down into more granular but Google thinks that they're like some restrictions they just don't want to split that out but one of the categories that we wanted that I did a POC was to detect evergreen content because we had labeled about several thousand articles with evergreen label so what evergreen means is content that has a life shelf of more than a few days or a few weeks for examine the article which is a review of a museum or an article about some real estate so those articles have a longer shelf life and it's very important in the newspaper if you can detect that of that type of content because that kind of content can be used in recommendations even an article written two years ago can be reused which otherwise would just be sitting there and nobody reads that content so I built using Auto ml for text I built a classifier which basically detects evergreen worsen on evergreen content and our editorial they helped us label this content we are all the different markets to take their content and label them with evergreen content initially I did a very quick prototype using tensorflow and then I had a hunch that this their Google was working on this Auto ml feature for NLP and then once this feature came out of doing a POC was really quick like within a day I was able to train an auto ml model that can differentiate evergreen contents it's really the hard part in this is to get the label data and I the interesting thing is once this model is trained I even try it on CNN and New York Times content and it was able to differentiate evergreen and non evergreen content and right now we're incorporating that into one of our recommendation systems you know one of the things that that I noticed a couple of times in our conversation you're the senior director of data engineering and ml but it sounds like it on a couple for a couple of these POCs it's you just kind of played around and put some stuff together you make it sound really easy it can be easy to experiment but then if you actually want to use this in a way that the business is going to depend on it there's at least traditional a lot more that needs to go into it in terms of engineering I guess there's maybe several questions in here but you know part of it is like does the does the cloud change that dynamic in your perspective or you know are these projects that you built and then kind of threw them over the wall to you know some team that then had to maintain these projects your questions especially about Auto ml or just about recommendations and the other stuff I've been talking about specifically the I think with churn prediction and the auto ml the impression I got was that it almost sounded like you got bored one weekend and kind of work buddies and kind of you know you came up with these models in a you know a day or so yeah so the they weren't done in a day they were sort of like the trend model we started when B qmo with an alpha state only available for a certain customer is about I think it was about May we started building it so and it took a couple of months like refining and fine-tuning it until and when now we were working on another market building out doing this prediction for Houston auto AML weed it was in phaser we started on like last year we said we need to collect the data since we want to be able to do saying so one mini project was to figure out how to get this data and get it labeled and and then once we had a label it was just sitting there for a while till I did some prototyping work for we're using tensorflow and then when auto ml came out it was since we had did already had done the hard work on labeling that data it was very something very quick to do so I guess one piece is that the project wasn't done like in in in a weekend there were different phases and happened in different times and the other thing is in general the newer features in cloud especially the the higher level API that's much more quicker to do prototyping and getting things out quicker and auto ml is a feature that's meant for people to do things much quicker like you don't have to learn tensorflow to be able to do it it's just upload the data give it labeled data set and then it takes a few hours to train a model and start using it you've done some work on recommendation systems as well yeah so as I was talking so one of the use cases with the NLP data that's sitting in our big query where data warehouses do recommendation system so the the idea is that have to piece of content if they have overlapping NLP entities they're highly likely to be related to each other so this is content to content recommendation and since we had this again since we had this in our bigquery database we built out a recommendation system which actually works using a big SQL which takes entities from one set of articles and matches up with another one incorporates saliency score and so there's there's a whole bunch of rules that we built out and our SQL and produces a table and recommendations and that's fronted by a web service layer that our website San Francisco Chronicle is I think three websites that are using that content content recommendation and this was released as part of San Francisco's user interface rewrite so a few months ago we released a revamp of SF Chronicle so this was project this content content recommendation was part of the deployment so you've got this content in the in bigquery and then you're using Google NLP to essentially you're adding fields for each of the entities that are recognized and then you're using bigquery ml to generate a recommendation for from a given piece of content to other recommended pieces of content based on these shared entities that have been recognized by the NLP service yeah so and just a minor correction that this one doesn't use BQ ml this is a plain SQL running which basically is looking at the number of overlaps of entities between two contents got it got it so it's basically scoring of how we based on the overlap how relevant one content is to another one okay and then you said your front ending that with a website you serving the data directly up from bigquery or do you push it out to some cache or database yeah so yeah so we push it out this computation is done on a periodic basis about every 15 minutes and that's about the frequency that we're getting your content so we so build out a table in a Postgres database and then that web service layer is using memcache as well as this Postgres database to serve these out you've also done some video content recommendations yeah so it's so another project we've done is we've taken our video converted that to sound and then extracted text from that sound and then applied Google's NLP to extract entities from the sound so so our content has the NLP tags and our video using voice transcription and using that text to also extract entities we can recommend video to content based on this technology and so to transcribe the audio did you use the the Google speech-to-text API for that yeah yeah so Google has a Google sound API which given any sound it can convert that or text and what was your experience getting reliable results from that yeah so our like full objective was to also have captioning on the videos so we did most of the time it works very well the gut was a sound transcription but in certain cases it doesn't like it if I could there was some few cases when it didn't work very well so we haven't used this for captioning but we took this text and and it seems to work well if we apply NLP on it if we just want to extract categories and tags from it for that purpose it works well and those tags are used for eventually being able to recommend a textual content along with the videos any final thoughts or words of wisdom that you shared as you were wrapping up your presentation on these use cases yeah so some of the advice I gave was that like some of the things that we want to do in the future is to use applied deep learning for recommendations there's a lot of research done in the past couple of years for application of deep learning and recommendation we're actually doing some research and we want to launch very soon with a recommendation system that applies use a deep learning so I recommended that we should use like and everybody should look into deep learning and tensorflow so it's really everything you don't have to build from scratch if a problem is already already solved by higher-level APRs such as the NLP image video api's or BQ ml and auto ml we should be using that it's really a trade-off do you want to hire data scientist to rebuild a thing or do you want to just pay that for that service and apply it and only in cases where the problem is novel or the data set is novel is when you want to build your in-house machine learning model for example we want to be able to do recommendations using deep learn something that we're building in-house and other things that we're exploring as more use cases for NLP we identify that we can use this with our newsletters to be able to generate newsletters automatically that's something that we're looking at too and also more uses for bigquery ml just like churn we can do the converses do propensity modeling how likelihood what's the likelihood of a user to subscribe to our whatever websites so this is overall our our future plans awesome well if you thanks so much for taking the time to share what you've been up to with us has been really interesting and I appreciate it ok thank you very much alright everyone that's our show for today for more information on the bead or any of the topics covered in this episode visit twin will a Icom slash talks last one eighty two if you're a fan of the podcast we'd like to encourage you to visit your Apple or Google podcast app and leave us a 5-star rating and review your reviews help inspire us to create more and better content and they help new listeners find the show as always thanks so much for listening and catch you next time youhello and welcome to another episode of tormal talk the podcast why interview interesting people doing interesting things in machine learning and artificial intelligence I'm your host Sam Carrington in today's show we're joined by the Vidocq mod senior director of data engineering and machine learning at Hearst Newspapers a few months ago na'vi gave a talk at the google cloud next conference on how publishers can take advantage of machine learning in this conversation we dive into the role of ml at Hearst including their motivations for implementing it and some of their early projects the challenges of data acquisition within a large organization and the benefits they enjoy from using Google's bigquery as their data warehouse enjoy the show all right everyone I am on the line with Naveed Ahmad Navid is senior director of data engineering and machine learning at Hearst Navid welcome to this week in machine learning in AI thank you thank you for having me on your show so you've been in the publishing industry for about 10 years now can you tell us a little bit about both your current role and some of the past things you've done in publishing sure so I'm currently a senior director for data engineering and machine learning at Hearst being here for about two years of my rule has been building the data warehouse building personalization and also doing predictive analysis using that data and before Hearst I was at you know times I worked in the subscription space where they love working on their CRM system and also some of the aspects of machine learning like churn modeling and doing email content detection and before that I was in Thomson Reuters where I all built their data used a distribution platform as well was part of their CMS team a contributor to recommendation systems and before that I was mostly in telecommunication and so you recently presented at the Google cloud next conference on applied machine learning or publishers before we jump into that topic maybe he should take a second to provide an overview of Hearst for those who aren't familiar with the company so Hearst I'd especially work in Hearst newspaper department at Hearst is a very large organization with more than 300 businesses which includes magazines investment and television channel and we within even newspapers there's about 40-plus websites which includes names like San Francisco Chronicle and Houston Chronicle and Times Union it is and Hearst headquarters located here at 57th Street in New York and that's some background about Hearst so one of the things that you mentioned in describing your background and in our conversation before the interview started was the role of the data warehouse in enabling you to perform the types of machine learning that you want to be able to perform at Hearst can you talk a little bit about the data warehouse and the the process for establishing it yeah so one of the first tasks for me when I joined her as was to build a data warehouse especially I used Google bigquery as the Detroit data warehouse and the idea was that we need all the data sets in one place to be able to then thoughts like to be able to look at what is the relationship between newsletter and the web and subscriptions and the first a use case was to build business and tell it on like a regular reporting on top of this data and the same data can be used to that's used to give data reports about current state can be used to do predictive modeling so use it query and basically we those data our data sets include Google Analytics we are content our newsletters so all sorts of data related to our business and is in bigquery and this forms as a foundation for machine learning so the data warehouse is kind of central to your ability to perform machine learning and an analytics and it was one of the first things that you established at Hearst can you talk a little bit about any challenges that you experienced trying to centralize all of the data from these various sources the challenges were that our data was sitting in different formats like before you know warehouse people would get either data from individual systems or there with those Excel worksheets going around so really figuring out the data sources and then building a platform for ETL that was rich Allen some of the bigger data sources like Google Analytics and DFP they were easier to get into bigquery but some some of the ones we have to build specific code to ET all that into bigquery and what is DFP double-click for publishers ok so that's an advertising your advertising system yeah this is all the log date for each advertisement that impressions and so the the data warehouse has your your analytics so the clickstream data for people that are visiting the site it's got information about the advertising interactions and maybe those quick streams does it also contain content information yeah so all our CMS content is in bigquery all our newsletter driver data that's in bigquery and is that data replicated to bigquery for analytics or Dizz at its native place and you're publishing directly to and from bigquery yeah it is replicated like the other we have a separate CMS system it has their own database so we're replicating that bigquery yeah and these are these are very different types of data what's the the goal for pulling this all into a single place yeah so each of these data they have identifiers that link them up with other data sets for example a newsletter they're hashed email address you can look it up with subscription data and from subscription data we can look a link it up to Google Analytics data so the idea is to be able to connect all the aspects of a user using these identifiers it sounds like the primary focus from an analytics of predictive analytics and a machine learning perspective is based on this this link data this user centric link data yes absolutely and so maybe let's talk through some of the different things that you've done with the data from machine learning perspective what are the different challenges that you're trying to address so we built a few different machine learning based products and predictive models using those data we have in bigquery so a number one the one of currently is a churn model churn prediction the churn prediction is that be able to predict how likely a subscriber is to cancel their subscription and it's very relevant in the media space because it's a bit challenging environment to keep subscribers since we have subscription data how people the different attributes of their usage behavior how they how many subs use letter dates have signed up to also customer service data associated with the each subscriber and the way it works is we take like a year old worth of data and take their attributes a month and feed it into a machine learning model which can and we know that in the next month how many of those subscribers were canceled and versus who didn't cancel so this is a binary classification problem so and using six months of data we can we have training data of people subscribe us who didn't cancel versus cancel we can build a predictive model on it so one of the things that we've done is since we're on bigquery we've used this new the launch feature called bigquery ml which was released in Google next back in August and what the great thing about bigquery ml is that can do a training and prediction model right using SQL syntax ie you don't have to write code or Python code and it does some of the like machine learning goodies like normalization of features and also fine-tuning the the model and and one of the things that I emphasize this enables people who don't even have machine learning background like people and business intelligence or who have background in SQL can easily write machine learning code just by using SQL syntax so we built out our first model on San Francisco and we're just working on integrating with our marketing and CRM systems so and so this was one of the use cases for machine learning then we've also built one of the things that I've done over here is use if I could jump in before you go to the next one one of the things that really struck me at the Google next conference and you know not to turn this into a big query commercial but I was really surprised by the enthusiasm for bigquery like people seem to really love that database can you like maybe net out for me why folks are excited about it and the bigquery ml piece that they just announced yeah so in general bigquery is a fully managed data warehouse so you don't have to have a DBA to fine tune or optimize the database and it's based on Google's own internal Dremel technology what I've heard is heavily Dremel technologies heavily used inside of Google and so they've exposed that as bigquery and their philosophy is it's full scan any data set basically spawns off machines or compute in the backend and is able to very quickly get your results especially if you have large data sets and terabytes or petabytes scale it only takes a few minutes to run sequel so it's very easy in terms of maintenance it's based on sequel syntax and that's why it's very an attractive option for data mining and bigquery amounts are they build out machine learning on top of bigquery which even further it's very fast because it's using the same compute infrastructure and and it abstracts out the machine learning as an SQL sequel so it becomes I foresee that will enable machine learning for a lot of people especially we're on bigquery so is the idea then you'll have like you know where you might have aggregators in sequel like you know average or max or something like that you can apply some kind of model to or is it different yeah it's absolutely so just like these functions they mentioned they've introduced a few different functions to be able to train and predict and even get machine learning and tricks like once the machine learning model is built to get like for example was the AUC of this machine learning or precision and recall so again do all of those things just by a function call with the sequel syntax yeah as compared to using a tool like scikit-learn or tensorflow a whoa the other advantages you don't have to take the data outside of the system you do everything in one place like if you were using scikit-learn or tensorflow there's this whole process of extracting the data massaging into a format that that machine learning framework can understand build a machine learning model pull that data back into your data warehouse so this whole cycle just gets reduced because you're doing everything in line and bigquery you're about to go into another use case beyond well actually back on the on the churn yeah I'm curious this isn't necessarily a machine learning question you know I've talked about term prediction in many cases across many different industries and understand broadly how it's applicable but I'm curious in the case of a publisher you know what specifically is Hearst going to do on the business side once it's able to predict that a user has a high likelihood to churn yeah that's a good question and we're working with marketing and people in the subscription business what are the different ways so some of the ways is that we in a marketing platform we were able to give messaging and be able to send emails is typically a user who is not engaged a lot are some of the people those are some of the attributes that are indicative that a person is going to cancel their subscription so the email to nudge somebody hey you know this did you know that there is a certain feature or educating our print subscribers that they have free digital benefits are some of the things that they can we can employ on getting people to lesser cancel their subscription you built this on bigquery ml using this new sequel like interface can you talk about how that experience of building these models using a sequel type of interface differed from traditional approaches you mentioned scikit-learn like how are they different and I guess I'm particularly curious about the different skill level involved but also any differences in the way you need to manage the models or the way you productize them those kinds of things yeah that's a good question so so one thing is getting started with BQ ml is really fast I actually built the first prototype within a day I got a very basic churn model up and running I've done this before in my previous job using scikit-learn and and the difficulty really was getting aggregating data from different sources like in my previous job getting data from like an Oracle database and you know Hadoop database so seem like that took most of the time getting data from different data sources typically the machine learning technology piece is the easier part and the more time is spent on getting the data and then figuring out the right features to use in your mission in our machine learning model and so so a using bigquery ml getting used to the sequel syntax is like anybody when you know sequel can be trained to do it within I think half a day I think the tricky part is like for someone new is to learn some of the machine learning concepts for example if you train a machine learning model what how do you measure that it's you exceeded like basic constant like precision recall and accuracy and and a you see area under curve so those those are more where more time is spent like figuring out the features and then doing measurement of the of the result and I feel like that doesn't defer from versus using scikit-learn or bigquery ml the the feature engineering and and the measurement but a the this getting used to bigquery ml is very quick you don't have to use any language and then the cycle I was talking about that data export train model data import that cycle gets very ready really reduced so the number of experiments you can do using the query ml you can do a lot more experiments with this technology and as part of that because you presumably they're scaling out the training behind the scenes and it's just a lot faster or are there other aspects to reducing that cycle time yeah a is that it's a lot faster because of the bigquery technology and the second is the what I mentioned before exporting data for a machine learning model outside of the system training it and fetching that data back and right so that gets significantly reduced and the third one is in our case since our data was all sitting in one place we didn't have to incur this ETL cost of getting data from different sources everything was sitting at bakery what are some of the some of the other types of machine learning projects that you've done at Hearst yeah so another one was application of natural language processing there's a case study published back in November thing if you just google Hearst and Google that that'd probably be the first link so we've applied Nash Google's natural language processing to all our content so each our tag basically we're using two features of natural language one is classification of content putting this in two broad categories like if content is about food and wine or is it about real estate and then also more detailed version of it that it tags entities and the content and the entities could be proper nouns or common nouns with their metadata with like their how much salient they are to that article and also their Wikipedia linker for their popular entities so we did that for all our content from all our web sites and that's stored in our CMS as well as also replicated into bigquery for further analysis so some of the use cases on this NLP data is so we build our bi business intelligent reports so example if you want to see how is for example a particular personality trending over time so using Google NLP data and Google Analytics data and our content data and bigquery we can build our reports and also if we want to see which content categories get more traffic like versus the content that we publish like which categories should we be focusing on for publishers so we build all sorts of reports using these three data sets so this is one of the use cases for google NOP the other one is which is more interesting is that we also pass like when we render our ads on things we've integrated that with double-click for publishers so whenever an ad is rendered we also pass the key value pair which category that ad belongs to the category of the content that ads being displayed so we built so over a month this builds a database where we can say show me to all the ads that are displayed on for example Olympics content or ads on our food contents so if there's a new advertiser that comes on board and he says that I want to run a campaign on let's say basketball or Olympics content we build this capability using our tag tagging technology so they can just specify a criteria and double-click for publishers said specify for this campaign advertise this campaign on all the content which is related to let's say tennis or basketball so that's another use case for natural language processing I know you've done some work both with Google's Auto ml NLP that allows you to kind of fine-tune some of their models but the the classification was there off the shelf NLP service that you that one if I'm not mistaken you're not able to train that using your own data is that right yeah it's a good question so the auto ml for NLP which is another tool released recently it works on top of the classification so they give a cat a taxonomy of about 700 categories by default but let's say if somebody wants to train their own category like for some unique or novel content they can use auto ml for natural language to train their own custom categories so and that works on top of their the existing categories so if you make a web service call to this auto ml it not only returns their default categories about also the trained version as well and so you found that for the types of articles that you were initially trying to categorize the worthies 700 built-in categories were they sufficient were they was it you know what was the experience of trying to map your business to this pre-canned AI service so most of most of the time it works very well so some of the categories like the like for most of our use cases those categories work fine there is a few categories which don't work well especially their sensitive content so they don't really split out so at any article about crime or you know some violence everything gets categorized into sensitive content we would have ideally liked that to be split down into more granular but Google thinks that they're like some restrictions they just don't want to split that out but one of the categories that we wanted that I did a POC was to detect evergreen content because we had labeled about several thousand articles with evergreen label so what evergreen means is content that has a life shelf of more than a few days or a few weeks for examine the article which is a review of a museum or an article about some real estate so those articles have a longer shelf life and it's very important in the newspaper if you can detect that of that type of content because that kind of content can be used in recommendations even an article written two years ago can be reused which otherwise would just be sitting there and nobody reads that content so I built using Auto ml for text I built a classifier which basically detects evergreen worsen on evergreen content and our editorial they helped us label this content we are all the different markets to take their content and label them with evergreen content initially I did a very quick prototype using tensorflow and then I had a hunch that this their Google was working on this Auto ml feature for NLP and then once this feature came out of doing a POC was really quick like within a day I was able to train an auto ml model that can differentiate evergreen contents it's really the hard part in this is to get the label data and I the interesting thing is once this model is trained I even try it on CNN and New York Times content and it was able to differentiate evergreen and non evergreen content and right now we're incorporating that into one of our recommendation systems you know one of the things that that I noticed a couple of times in our conversation you're the senior director of data engineering and ml but it sounds like it on a couple for a couple of these POCs it's you just kind of played around and put some stuff together you make it sound really easy it can be easy to experiment but then if you actually want to use this in a way that the business is going to depend on it there's at least traditional a lot more that needs to go into it in terms of engineering I guess there's maybe several questions in here but you know part of it is like does the does the cloud change that dynamic in your perspective or you know are these projects that you built and then kind of threw them over the wall to you know some team that then had to maintain these projects your questions especially about Auto ml or just about recommendations and the other stuff I've been talking about specifically the I think with churn prediction and the auto ml the impression I got was that it almost sounded like you got bored one weekend and kind of work buddies and kind of you know you came up with these models in a you know a day or so yeah so the they weren't done in a day they were sort of like the trend model we started when B qmo with an alpha state only available for a certain customer is about I think it was about May we started building it so and it took a couple of months like refining and fine-tuning it until and when now we were working on another market building out doing this prediction for Houston auto AML weed it was in phaser we started on like last year we said we need to collect the data since we want to be able to do saying so one mini project was to figure out how to get this data and get it labeled and and then once we had a label it was just sitting there for a while till I did some prototyping work for we're using tensorflow and then when auto ml came out it was since we had did already had done the hard work on labeling that data it was very something very quick to do so I guess one piece is that the project wasn't done like in in in a weekend there were different phases and happened in different times and the other thing is in general the newer features in cloud especially the the higher level API that's much more quicker to do prototyping and getting things out quicker and auto ml is a feature that's meant for people to do things much quicker like you don't have to learn tensorflow to be able to do it it's just upload the data give it labeled data set and then it takes a few hours to train a model and start using it you've done some work on recommendation systems as well yeah so as I was talking so one of the use cases with the NLP data that's sitting in our big query where data warehouses do recommendation system so the the idea is that have to piece of content if they have overlapping NLP entities they're highly likely to be related to each other so this is content to content recommendation and since we had this again since we had this in our bigquery database we built out a recommendation system which actually works using a big SQL which takes entities from one set of articles and matches up with another one incorporates saliency score and so there's there's a whole bunch of rules that we built out and our SQL and produces a table and recommendations and that's fronted by a web service layer that our website San Francisco Chronicle is I think three websites that are using that content content recommendation and this was released as part of San Francisco's user interface rewrite so a few months ago we released a revamp of SF Chronicle so this was project this content content recommendation was part of the deployment so you've got this content in the in bigquery and then you're using Google NLP to essentially you're adding fields for each of the entities that are recognized and then you're using bigquery ml to generate a recommendation for from a given piece of content to other recommended pieces of content based on these shared entities that have been recognized by the NLP service yeah so and just a minor correction that this one doesn't use BQ ml this is a plain SQL running which basically is looking at the number of overlaps of entities between two contents got it got it so it's basically scoring of how we based on the overlap how relevant one content is to another one okay and then you said your front ending that with a website you serving the data directly up from bigquery or do you push it out to some cache or database yeah so yeah so we push it out this computation is done on a periodic basis about every 15 minutes and that's about the frequency that we're getting your content so we so build out a table in a Postgres database and then that web service layer is using memcache as well as this Postgres database to serve these out you've also done some video content recommendations yeah so it's so another project we've done is we've taken our video converted that to sound and then extracted text from that sound and then applied Google's NLP to extract entities from the sound so so our content has the NLP tags and our video using voice transcription and using that text to also extract entities we can recommend video to content based on this technology and so to transcribe the audio did you use the the Google speech-to-text API for that yeah yeah so Google has a Google sound API which given any sound it can convert that or text and what was your experience getting reliable results from that yeah so our like full objective was to also have captioning on the videos so we did most of the time it works very well the gut was a sound transcription but in certain cases it doesn't like it if I could there was some few cases when it didn't work very well so we haven't used this for captioning but we took this text and and it seems to work well if we apply NLP on it if we just want to extract categories and tags from it for that purpose it works well and those tags are used for eventually being able to recommend a textual content along with the videos any final thoughts or words of wisdom that you shared as you were wrapping up your presentation on these use cases yeah so some of the advice I gave was that like some of the things that we want to do in the future is to use applied deep learning for recommendations there's a lot of research done in the past couple of years for application of deep learning and recommendation we're actually doing some research and we want to launch very soon with a recommendation system that applies use a deep learning so I recommended that we should use like and everybody should look into deep learning and tensorflow so it's really everything you don't have to build from scratch if a problem is already already solved by higher-level APRs such as the NLP image video api's or BQ ml and auto ml we should be using that it's really a trade-off do you want to hire data scientist to rebuild a thing or do you want to just pay that for that service and apply it and only in cases where the problem is novel or the data set is novel is when you want to build your in-house machine learning model for example we want to be able to do recommendations using deep learn something that we're building in-house and other things that we're exploring as more use cases for NLP we identify that we can use this with our newsletters to be able to generate newsletters automatically that's something that we're looking at too and also more uses for bigquery ml just like churn we can do the converses do propensity modeling how likelihood what's the likelihood of a user to subscribe to our whatever websites so this is overall our our future plans awesome well if you thanks so much for taking the time to share what you've been up to with us has been really interesting and I appreciate it ok thank you very much alright everyone that's our show for today for more information on the bead or any of the topics covered in this episode visit twin will a Icom slash talks last one eighty two if you're a fan of the podcast we'd like to encourage you to visit your Apple or Google podcast app and leave us a 5-star rating and review your reviews help inspire us to create more and better content and they help new listeners find the show as always thanks so much for listening and catch you next time you\n"

Applied Machine Learning for Publishers with Naveed Ahmad - TWiML Talk #182

Random Videos