Building a Unified NLP Framework at LinkedIn with Huiji Gao - #481

**The Evolution of Text Analysis: An Update from D-text**

D-text, an open-source deep learning library for natural language processing and machine learning, has been making waves in the industry with its innovative approach to text analysis. In this update, we'll delve into the latest developments and advancements in the field.

**Practical Applications and Deployment Models**

From an engineer's perspective, D-text can be used as a library that is incorporated into existing text stacks, allowing for easy integration and deployment. This makes it an ideal solution for users who want to leverage its capabilities without requiring extensive customization. Additionally, D-text can be utilized as part of offline training stacks, enabling users to quickly explore results and experiment with different parameters.

One common use case for D-text is in the realm of early offline analysis and quick explorations. Users can organize their training data into a format that meets the requirements set by D-text, which includes a starter parameter that serves as the default setting for most use cases. Advanced parameters are also available, catering to more specialized applications where machine learning experts may want to fine-tune performance.

**Performance Differences and Customization**

It's worth noting that offline usage of D-text has led to some observed performance differences. However, drawing conclusions about which model is better can be challenging due to the complexity of the data and the many variables at play. To address this challenge, D-text provides an opportunity for users to customize their models further, allowing them to compare different variants.

**Future Developments and Research Directions**

Looking ahead, researchers are exploring ways to optimize hyperparameters using advanced optimizers like gradient descent and Adam. Another area of focus is the application of common knowledge from other domains to extract patterns that can be used more efficiently in future workflows.

In terms of research orientation, the work on D-text's optimizer is currently at a stage where it can be considered basic research aimed at developing next-generation optimization techniques for specific types of problems. The goal is to create specialized optimizers that can tackle the unique challenges posed by different domains and applications.

**Conclusion**

The continued evolution of D-text highlights the importance of ongoing research and development in the field of natural language processing and machine learning. By exploring new approaches to text analysis, we can unlock greater efficiency and effectiveness in our models, ultimately leading to improved performance and accuracy. As we move forward, it will be exciting to see how these advancements shape the future of text analysis and its applications in real-world scenarios.

"WEBVTTKind: captionsLanguage: enall right everyone i am here with huiji gao huiji is a senior engineering manager for mln ai at linkedin fuigi welcome to the choma ai podcast thank you it's my pleasure to be here hey i'm really looking forward to digging into our conversation we're going to talk quite a bit about nlp and some of the software and models that your team is building but before we do that i'd love to have you jump in and share a little bit about your background and how you came to work in ml yeah thank you i have always been interested in this machine learning and the applications back to my college i remember kind of like 15 years ago i was in the uh studying the information retrieval and i was using ci for the conditional random field model to develop name the entity coordination then in my phd study i was in this machine learning and the data mining lab starting the recommender systems so in my major research topic at that time is how to develop advanced recommender system with spatial tempo social and accounting information and in addition to several papers and also a book called mining human mobility in location based social networks we also developed several algorithms to help do better recommendation for the poi for locations during my study i had an internship at linkedin and that time i was working on how to model users interests in companies to help job recommendations and after that after my graduation i joined the linkedin as a full time that is about six years ago i started yeah i started my journey in ads in the s domain with as ctr prediction and targeting problems i was quite amused by amazed by the linkedin's rich data set we have a large amount of members with their profiles and members they connect to other members they also consume content and they also uh say search or looking at learning courses while different members they have different lawyers for example a member could be a job seeker looking for jobs could it be a recruiter looking for good candidates and it could be a salesman and the member could also be a contender generator or consumer so um with all that all that say comprehensive data set at linkedin that was also my first time to get into this large scale distribution training area to learn how to train a model with distributed training and how to process the data um about two years after i stayed in the ads domain i moved to search problems so that time i was developing ranking algorithms for uh people recommendation appeal for people search so the product is is called people search um about three years ago i started to manage a team called the ai algorithm foundation team and this team is to develop advanced ai technologies on national language processing ranking and personalization to power all the linkedin search and recommender systems in addition to the product impact we have so far we have also open sourced a set of technologies one of the most recently released open source packages called dtext det xt it is a intelligent text understanding framework for people to for our engineers to easily generate models for ranking classification and language generation so it has also uh received several stars on github so far we have about 1 000 stars we also have published several papers on the related work and one of them gets the best paper award last year in cikm nice nice uh so let's maybe dig into dtext and start with what are the main motivations for the tool what were you trying to accomplish when you started down that road yes um i think we started about started to think about this idea uh two and a half years ago uh as as you know that linking we have a rich data set in all kinds of search and recommended systems with respect to different entities users are interested in for example for people we have people search people accommodation it's called pymk people you may know and for jobs we have a job search and the job recommendation is called jingdi jobs you may be interested in and for content we have content search and say content recommendation is like feed and ads we also have other entities like schools like groups now all these informations it is kind of like represented by a large amount of text data and from the member side we also have member profiles and also members interest members past behavior by interacting with this text data so uh how to understand this data is crucial for us to improve all this kind of search recommended system at linkedin and at that time i would say majority of the system is leveraging the energy in terms of the keyword match or skip gram embedding that is not considering a lot of contextual information among words when we generate embedding so where there are a large amount of applications in search recommendation so we can name a few for example ranking is one and understand the user's intent in 10 classification sentiment analysis spam detection auto complete machine translations all these problems they were needed to be part they can be part and to get a better performance by nrp so we are thinking about um build a say end to end framework that you can target on two goals first is to provide a better semantic understanding with advanced embedding technology second is to provide a better framework that can leverage embedding disease invading technology for all kinds of applications as i mentioned before so that is d-text it is a offline framework to help our engineers train offline models uh with these nrp technology for ranking classification and language generation got it got it and so is was what was the landscape for using embeddings for these kinds of ranking problems at linkedin before dtx yes uh before the text uh the way is uh we started with keyword matching like let me give you example is that if you search for a huiji software engineer linkedin this is a query and we look at the documents that contains the keyword fuji or software engineer or linkedin but you see sometimes there are cinemas for example software engineer and the software developer they are the similar meaning if we rely on the keyword match then the job with software develop will not return yeah so sometimes uh this is the early stage and then um we start to look at different inventing technologies for example skip graham that is also one of the state art state of that embedding technologies and there's a i would say the the cause of this kind of main technology it cannot uh better capture the context among words like context between software engineer and linkedin between huiji and software engineer so there are a lot of new technologies coming out in the past few years like how to use the scene for embedding and how to use stm for embedding and i would say about two years ago the the most of the promising technology is burnt it can help to enhance this kind of contextual modeling among any two words in an input text sentence and then generate a better embedding to represent this sentence so that is our uh kind of um target when we develop a d-text and pretty much uh also captured the journey when we started this text in text text modeling and so is dtex based on uh bert what's the the underlying model and the relationship between the framework and the model yes yes so dtext is based on but and it can also extend to support cnn and stm yeah there are kind of two ways to use this kind of birth embeddings or other ceiling embeddings one way is we extract this embedding as the first step and then use this embedding as features put them into our current ranking framework it could be say a tree model or a linear model like a logistic regression model this kind of two-stage modeling as you can see it may lose some information because the embedding was generated by a different algorithm without the context of the other information that will be used in the ranking model yeah so dtext is a model that can help generate this kind of linking and classification model using through end to end process we start from the law text data get the embedding with bird or cn stm and zen together with other existing hand craft features eventually they share the same loss function as their parameters updated simultaneously and so the talk a little bit about the kind of broader context for using dtext is it um you mentioned you did some work on distributed systems earlier at linkedin is are you using it in that kind of distributed context or is it used as a standalone type of system yes it is in a distributed context and actually in order to enable enable that to ensure we have a good efficiency of training we actually did a lot of work to power that so one of the challenges as you can imagine that i'm just using bert as an example uh you can imagine that because of the model of this contextual information among words you know it itself takes a lot of time to influence so so one way we have looked at is to start with a more flexible button structure for example uh the original bird released by google has a 12 layers version and 24 layers version and there are some gap by using this model directly at linkedin first is with this large number of layers the influence and training time itself would be quite time consuming and some of them cannot meet our product request secondly is uh when you say can't meet your product requirements is that from a latency and inference time right yeah from latency and also from the offline training say we when you want to periodically train the model since the train itself will need to meet a certain requirements if one could be trained every four hours but the train itself takes 40 hours then they cannot yeah so this is the first gap the second gap is from the model itself because original model is trained with wikipedia data and as we know the linking we have very uh say to me specific taxonomy so the skills titles and the um and also users experience our company names and how can we incorporate this information into the birth model is a challenge so in the end we decided to treat a birth model using linkedin data so we call this body model as libert there are several benefits one is as we mentioned it has better semantic representation on linking data and the second is because we are training it from scratch so we are able to uh say start with a more flexible structure like three layers yeah or six layers all starting with 256 lengths of embedding size yeah that is indeed what we do and the one in our production i think currently used is a six-layer liberte model now this can help us to significantly shorten our time by pre-training such language model and also shorten the time for online inference so it sounds like the general idea there is that you know because you've got the resources to do it and a very significant data set you're giving up the advantage of externally pre-trained burp model and maybe doing some fine-tuning and just training from scratch on a simplified model and you found that that gives you uh better results that's right yeah do you how do you um you know when you when you think about the the attempts trying to get bird to work uh standard birth to work versus liebert how do you compare the two what uh performance metrics did you use when you were evaluating the the new approach yeah that actually refers to another challenge of using birds in d-text so in the beginning uh the model performance is not good comparing to say other state-of-art ranking model using cnm or astm the reason is in the beginning what we realized is the way to um to do leave to do but fine tuning in ranking tasks so like what we mentioned before is about how to treat a model preaching a birth model within data after we pretend the model the typical way to use the birth model is to fine tune it in a specific task and in and in this current scenario it is how to better use the in ranking task but the challenge at that time is in the original google's so in original vertex paper uh it lists several ways to use birth model for like ranking and the naming and limited recognition and as a way to use blood model for ranking user we need to concatenate the information from the member side and the document side together as one sentence so imagine i'm a user i'm here to search for something i have a query here and in order to use bridge we will need to concatenate this query information with every document information together then we send it to brett and to get a genetic representation for the whole sentence then we decided to how to read this specific document but you see this becomes a bigger challenge for online for online latency because we are not able to do a lot of pre-computing and here we see a query and only after we see a query we will have to start content this query with every document candidate and then do the scoring which is usually beyond the second or even tens of seconds that is not acceptable by the production right and the reason why this this comes about is because you have uh you know you you have to include the user and you have so many users that you can't pre-compute right that's right yeah yeah because of this strategy on coupling with user typed information this this forbid us to pre-compute a lot of information which we used to be able to do before using birth model and and of course the reason of doing this for for ranking for linking model is because the reason of carbonism together is because uh bert is good at capturing the contextual information among different words by concatenation together it does pose a good opportunity to capture the content information contextual information but it makes it a challenge to get into product so so what we do is we tried a lot of different framework settings and initially we found it this way that is all currently the kind we used in d-text we um split the information from member side and curryside to different input fields and with the input input fields we still send them to bridge to get the embeddings however we concatenate them together after we get invading actually we close them together after we get invading so to generate the similarities among these different fields as based on the embedding extracted from the bread and from this way we kind of like break the context earlier but then model this context later after we get invalid and there's a risk because uh it is uncertain that the weather use this framework can still capture the original content information that bird is supposed to capture so we did several experiments and eventually found that this way the way we are using d-text indeed it is able to do so now can you can you go through that again what your what specifically are you doing that's different from the standard part of concatenating the argument yes yes so in the state of the word it is a concatenated law text and then getting the embedding then for ranking yeah and in d-text we get the embedding from each text field before concatenation and at the end the search query and the data yep yes and then we compute the similarities among these different embeddings so this is kind of like to learn their contextual coupling after we get the embedding and then these similarities becomes new features that we can use later for the ranking model and so does that allow you to do more pre-computation on the document side exactly exactly in this case as you can see when we when we um compute the protein matting it is no longer based on the concatenated sentence it is based on each individual field so the query set is computed separately member side is also separated and then after the model is trained we can pre-compute these membrane embeddings and then store them in certain stores that can be fetched online while curry side because it is short so we are able to do the real-time influence with birth that is doable and in this case as you can imagine when online a user comes and a user search with a query we use this model to do real-time inference for this query's bulletin betting and then we fetch the member embedding from the pre-computed store and after that we do the inference for ranking so at end of the day what what kind of performance you know properties does this have um yeah presumably it met your goal but what that's right compared to what you saw with the original bird yeah so uh our metrics is based on the different uh product applications for example when we evaluated the people search and a job search we look at the matrix called ndcg uh that is a kind of like a typical evaluation matrix for search linking when we work uh when we use detects for feed and as we look at the ctr uh aoc so that is uh that is kind of like look at how much likely a user would click on a specific content given the impression for this is for offline and for online we also have different measures for different products for example search-wise we have search sessions and s that is revenue and also online ctr um example as we as i can share is when we use this model for people search we observed a significant lift on the online searching success rate yeah i was also interested in the uh inference performance in terms of time how did that compare from before and after yeah yeah good point um answer is i'm not able to provide this result because the baseline model is not able to shift into product due to the latency cancer yeah it is not doable yeah so you didn't even get far enough with the baseline model yeah yeah yeah for um for the current model the online latency is within 20 milliseconds okay yeah only you know what as i can imagine it would be more than 10 seconds something are there other ways that you have uh optimized it besides from the kind of high level architecture yes we optimized on the product stack so even with this kind of detect structure even with this decoupling of the user side and the document side we still encounter challenges about online scoring because of the large amount of candidates so like for such systems some of the systems they have hundreds of thousands of candidates to rank at real time so in that case we actually set up a few ranking stages so for example we have first pass ranker which is using a very simple model to quickly filter out those definitely relevant documents and then we use d-text to do a more fine-grained edited ranking based on this uh say using the model size of candidates and is the pre-filtering model is that also a learned model it is a lender model yeah it is uh there are several options to use this model for example a tree model with simple structure maybe just a few layers and a logistic regression model and the features that could be very small a few tens to hundreds because the purpose is not to get a better ranking at this stage it is mainly to filter out those irrelevant documents so the next stage of ranking by dtx can be uh can be can be launched without a latency concern and so what the the what the features look like for that uh pre-model oh yeah those features could be for example uh like a user's uh um uh user's region and the matching between the query and the doc or documents in terms keyword mesh this could also be uh the user some contextual information for example whether uh this user is uh uh is is searching the query at a certain time point day of the week hour of the day yeah and the users the social connections these are the information can be used i'm just giving you some examples so the the first one that you mentioned is you know very intuitive you might want to if the user's uh issuing a query in english you might want to limit your search to english and it sounds like this pre-model or this pre-filter is one of the mechanisms that you might use to do something like that exactly yes um and so tell us a little bit about um you know maybe in past conversations i've talked with folks uh at linkedin and uh the idea of graphical models comes up a lot in the the way you build machine learning on the the user graph and content does in what ways does uh did these models uh detect them and libert do they uh kind of interoperate or um you know are they related to being used in a graphical context yeah um currently uh we are still exploring this direction okay and the text is a is not using graphical embedding technology uh but we do see the motivation of using graphic embedding together with text information to generate a better embeddings so let me give you an example the way we are looking at is we can leverage users interactions their actions to certain content click like share and share social connection to different people these are kind of like linking information if you look at this graph and there are also know the information that is know the properties for example when a user is clicking on another job post and in addition to this linking information that can be used for graphic neural network modeling if you also look at the know the information from members from the job posts this text information energy to a joint modeling then you can imagine in the end we can generate a more comprehensive embeddings with this huge economic graph at linkedin and this the benefit of doing so is maybe in the future we can eliminate certain filters by doing this kind of modeling together for example maybe in the retrieval side we want to first filter out by the social connections and then we do a ranking within this specific range of social connections but if we have this graphic neural embedding with the text information then we can have better representation we may not no longer need this kind of filters in the beginning so this is some ideas still at the early stage to discuss yeah but we do have some future plans to improve the text on that direction yeah along the same lines is d-text uh is it tightly coupled to bert or could you swap in another uh type of model uh language model in to replace bert you can yeah it can you can switch the bullet model to other models like a cnn like libert like a stm and even for the bottom model itself it is quite compatible to different kinds of birth model like a google spread model microsoft tuning model and the lobota model developed by facebook these are all different options that have been offered to the clients and with clients using those different models is there any intuition you could share around you know when they tend to use one versus the other uh how do they compare relative to one another for your use cases yeah yeah so so far all these exploration is actually more on the towards the offline side for production wise because as we as as we do have this product constraint so far the liebert model is still the first choice because of the flexible structure three layers six layers and performance wise i would say for a fair comparison it may be very hard because the liberty is trained on linkedin data and the other birth model is trained on other domain specific data so we do observe that offline wise uh there are some performance difference but it's sometimes very hard to draw conclusion under this kind of setup to see like which model is better or not yeah are there other models that you hope to customize in the way that you have with uh bert and libert yes we do yeah we do so because these models they are also open sourced it actually provide us opportunity to customize them further and compare with debt and provide to our other linkedin clients yeah as long as we have enough for say uh engineering resource to support that got it talk a little bit about the the practical use of this from uh engineers perspective is this uh a library that they include in their notebooks and and use in that way is it a online system that they're querying what's the deployment model yeah there are two ways to use this one is as you mentioned use it as a jupiter notebook and this way is more about earlier offline analysis and a quick explore of the results so in that way users can just say organize their training data set into a format that has required by d-text and in d-text we have were a lot of parameters but we also have this to say um starters parameter that's the default parameter we already set up for most of the use cases and we also have advanced parameters that can for used for like by machine learning experts to better tune the performance yeah um so users can use them uh in the jupyter notebook or in our um offline training stack so this can be used as a library you can incorporate it into your current text stack and uh adding it as like a as a workflow with this parameter setup then you run and get to the model and can you give us a sense for how often are users kind of using off-the-shelf basic parameters versus you know the other end of the spectrum um you know folks that may want to tweak the advanced parameters or even you know fine-tune i'm trying to get a sense for with these types of models and your type of use case um you know do you find that everybody has to you know fine-tune quite a bit because they need to they need to do that to get the level of performance that's acceptable or is it you know good enough off the shelf for a lot of people yeah that's a that's a very good question here so the tune first of all i think fine tuning itself takes some effort by working dealing with these parameters it will require some of the expertise from our engineers but it also depends on the application as we observe for example in the some of the search applications they can share similar patterns on tuning these parameters and some of the other applications say like as feeders they have very large scale data and also very sparse features then the tuning may take some time that's why our current uh one of our current efforts is to develop advanced optimizer for d-text and this can also be genetically used by other deep learning algorithms so this optimizer is trying to help our ai engineers found a better setting of those key parameters say batch size learning rate and uh and the model selection iterations and we want to make it like more optimal and more and less and it makes the model less sensitive to those parameters in terms of the user's usage and that is already ongoing and we actually uh see some quite promising results by using this advanced optimizer yeah one example as you can see is we are exploring the limb layer wise the optimizer for yeah and so is the optimizer um you know what can you say about the general approach to optimization is it you know along the lines of a bayesian optimization or you know that's that's yeah yeah or is it more like heuristics if your problem looks like this then your parameters should look like that oh yeah so the optimizer i'm referring here is more like a green descent and adam uh yeah kind of optimizes yeah got it so um not specifically the hyper parameter yeah i know that yeah yeah that is uh i would say another direction we are looking at that is for example how to use um common knowledge of this hyphen tuning from different domain to extract patterns so we can more efficiently cure another workflow in a different domain uh this is high prime technology that's also i would say at the early stage of exploring okay but at the lower level is this work on the optimizer is that do you think of that as kind of uh you know basic research trying to come up with a next generation optimizer or is it more you know structurally something about the types of problems you're solving um you know creates an opportunity to specialize the optimizer uh for those problems it is uh currently it is more research oriented okay situation yeah okay awesome awesome well luigi thanks so much for taking the time to share a bit about what you are up to thank you sam yeah it's a great pleasure to be here awesome thank you thank you youall right everyone i am here with huiji gao huiji is a senior engineering manager for mln ai at linkedin fuigi welcome to the choma ai podcast thank you it's my pleasure to be here hey i'm really looking forward to digging into our conversation we're going to talk quite a bit about nlp and some of the software and models that your team is building but before we do that i'd love to have you jump in and share a little bit about your background and how you came to work in ml yeah thank you i have always been interested in this machine learning and the applications back to my college i remember kind of like 15 years ago i was in the uh studying the information retrieval and i was using ci for the conditional random field model to develop name the entity coordination then in my phd study i was in this machine learning and the data mining lab starting the recommender systems so in my major research topic at that time is how to develop advanced recommender system with spatial tempo social and accounting information and in addition to several papers and also a book called mining human mobility in location based social networks we also developed several algorithms to help do better recommendation for the poi for locations during my study i had an internship at linkedin and that time i was working on how to model users interests in companies to help job recommendations and after that after my graduation i joined the linkedin as a full time that is about six years ago i started yeah i started my journey in ads in the s domain with as ctr prediction and targeting problems i was quite amused by amazed by the linkedin's rich data set we have a large amount of members with their profiles and members they connect to other members they also consume content and they also uh say search or looking at learning courses while different members they have different lawyers for example a member could be a job seeker looking for jobs could it be a recruiter looking for good candidates and it could be a salesman and the member could also be a contender generator or consumer so um with all that all that say comprehensive data set at linkedin that was also my first time to get into this large scale distribution training area to learn how to train a model with distributed training and how to process the data um about two years after i stayed in the ads domain i moved to search problems so that time i was developing ranking algorithms for uh people recommendation appeal for people search so the product is is called people search um about three years ago i started to manage a team called the ai algorithm foundation team and this team is to develop advanced ai technologies on national language processing ranking and personalization to power all the linkedin search and recommender systems in addition to the product impact we have so far we have also open sourced a set of technologies one of the most recently released open source packages called dtext det xt it is a intelligent text understanding framework for people to for our engineers to easily generate models for ranking classification and language generation so it has also uh received several stars on github so far we have about 1 000 stars we also have published several papers on the related work and one of them gets the best paper award last year in cikm nice nice uh so let's maybe dig into dtext and start with what are the main motivations for the tool what were you trying to accomplish when you started down that road yes um i think we started about started to think about this idea uh two and a half years ago uh as as you know that linking we have a rich data set in all kinds of search and recommended systems with respect to different entities users are interested in for example for people we have people search people accommodation it's called pymk people you may know and for jobs we have a job search and the job recommendation is called jingdi jobs you may be interested in and for content we have content search and say content recommendation is like feed and ads we also have other entities like schools like groups now all these informations it is kind of like represented by a large amount of text data and from the member side we also have member profiles and also members interest members past behavior by interacting with this text data so uh how to understand this data is crucial for us to improve all this kind of search recommended system at linkedin and at that time i would say majority of the system is leveraging the energy in terms of the keyword match or skip gram embedding that is not considering a lot of contextual information among words when we generate embedding so where there are a large amount of applications in search recommendation so we can name a few for example ranking is one and understand the user's intent in 10 classification sentiment analysis spam detection auto complete machine translations all these problems they were needed to be part they can be part and to get a better performance by nrp so we are thinking about um build a say end to end framework that you can target on two goals first is to provide a better semantic understanding with advanced embedding technology second is to provide a better framework that can leverage embedding disease invading technology for all kinds of applications as i mentioned before so that is d-text it is a offline framework to help our engineers train offline models uh with these nrp technology for ranking classification and language generation got it got it and so is was what was the landscape for using embeddings for these kinds of ranking problems at linkedin before dtx yes uh before the text uh the way is uh we started with keyword matching like let me give you example is that if you search for a huiji software engineer linkedin this is a query and we look at the documents that contains the keyword fuji or software engineer or linkedin but you see sometimes there are cinemas for example software engineer and the software developer they are the similar meaning if we rely on the keyword match then the job with software develop will not return yeah so sometimes uh this is the early stage and then um we start to look at different inventing technologies for example skip graham that is also one of the state art state of that embedding technologies and there's a i would say the the cause of this kind of main technology it cannot uh better capture the context among words like context between software engineer and linkedin between huiji and software engineer so there are a lot of new technologies coming out in the past few years like how to use the scene for embedding and how to use stm for embedding and i would say about two years ago the the most of the promising technology is burnt it can help to enhance this kind of contextual modeling among any two words in an input text sentence and then generate a better embedding to represent this sentence so that is our uh kind of um target when we develop a d-text and pretty much uh also captured the journey when we started this text in text text modeling and so is dtex based on uh bert what's the the underlying model and the relationship between the framework and the model yes yes so dtext is based on but and it can also extend to support cnn and stm yeah there are kind of two ways to use this kind of birth embeddings or other ceiling embeddings one way is we extract this embedding as the first step and then use this embedding as features put them into our current ranking framework it could be say a tree model or a linear model like a logistic regression model this kind of two-stage modeling as you can see it may lose some information because the embedding was generated by a different algorithm without the context of the other information that will be used in the ranking model yeah so dtext is a model that can help generate this kind of linking and classification model using through end to end process we start from the law text data get the embedding with bird or cn stm and zen together with other existing hand craft features eventually they share the same loss function as their parameters updated simultaneously and so the talk a little bit about the kind of broader context for using dtext is it um you mentioned you did some work on distributed systems earlier at linkedin is are you using it in that kind of distributed context or is it used as a standalone type of system yes it is in a distributed context and actually in order to enable enable that to ensure we have a good efficiency of training we actually did a lot of work to power that so one of the challenges as you can imagine that i'm just using bert as an example uh you can imagine that because of the model of this contextual information among words you know it itself takes a lot of time to influence so so one way we have looked at is to start with a more flexible button structure for example uh the original bird released by google has a 12 layers version and 24 layers version and there are some gap by using this model directly at linkedin first is with this large number of layers the influence and training time itself would be quite time consuming and some of them cannot meet our product request secondly is uh when you say can't meet your product requirements is that from a latency and inference time right yeah from latency and also from the offline training say we when you want to periodically train the model since the train itself will need to meet a certain requirements if one could be trained every four hours but the train itself takes 40 hours then they cannot yeah so this is the first gap the second gap is from the model itself because original model is trained with wikipedia data and as we know the linking we have very uh say to me specific taxonomy so the skills titles and the um and also users experience our company names and how can we incorporate this information into the birth model is a challenge so in the end we decided to treat a birth model using linkedin data so we call this body model as libert there are several benefits one is as we mentioned it has better semantic representation on linking data and the second is because we are training it from scratch so we are able to uh say start with a more flexible structure like three layers yeah or six layers all starting with 256 lengths of embedding size yeah that is indeed what we do and the one in our production i think currently used is a six-layer liberte model now this can help us to significantly shorten our time by pre-training such language model and also shorten the time for online inference so it sounds like the general idea there is that you know because you've got the resources to do it and a very significant data set you're giving up the advantage of externally pre-trained burp model and maybe doing some fine-tuning and just training from scratch on a simplified model and you found that that gives you uh better results that's right yeah do you how do you um you know when you when you think about the the attempts trying to get bird to work uh standard birth to work versus liebert how do you compare the two what uh performance metrics did you use when you were evaluating the the new approach yeah that actually refers to another challenge of using birds in d-text so in the beginning uh the model performance is not good comparing to say other state-of-art ranking model using cnm or astm the reason is in the beginning what we realized is the way to um to do leave to do but fine tuning in ranking tasks so like what we mentioned before is about how to treat a model preaching a birth model within data after we pretend the model the typical way to use the birth model is to fine tune it in a specific task and in and in this current scenario it is how to better use the in ranking task but the challenge at that time is in the original google's so in original vertex paper uh it lists several ways to use birth model for like ranking and the naming and limited recognition and as a way to use blood model for ranking user we need to concatenate the information from the member side and the document side together as one sentence so imagine i'm a user i'm here to search for something i have a query here and in order to use bridge we will need to concatenate this query information with every document information together then we send it to brett and to get a genetic representation for the whole sentence then we decided to how to read this specific document but you see this becomes a bigger challenge for online for online latency because we are not able to do a lot of pre-computing and here we see a query and only after we see a query we will have to start content this query with every document candidate and then do the scoring which is usually beyond the second or even tens of seconds that is not acceptable by the production right and the reason why this this comes about is because you have uh you know you you have to include the user and you have so many users that you can't pre-compute right that's right yeah yeah because of this strategy on coupling with user typed information this this forbid us to pre-compute a lot of information which we used to be able to do before using birth model and and of course the reason of doing this for for ranking for linking model is because the reason of carbonism together is because uh bert is good at capturing the contextual information among different words by concatenation together it does pose a good opportunity to capture the content information contextual information but it makes it a challenge to get into product so so what we do is we tried a lot of different framework settings and initially we found it this way that is all currently the kind we used in d-text we um split the information from member side and curryside to different input fields and with the input input fields we still send them to bridge to get the embeddings however we concatenate them together after we get invading actually we close them together after we get invading so to generate the similarities among these different fields as based on the embedding extracted from the bread and from this way we kind of like break the context earlier but then model this context later after we get invalid and there's a risk because uh it is uncertain that the weather use this framework can still capture the original content information that bird is supposed to capture so we did several experiments and eventually found that this way the way we are using d-text indeed it is able to do so now can you can you go through that again what your what specifically are you doing that's different from the standard part of concatenating the argument yes yes so in the state of the word it is a concatenated law text and then getting the embedding then for ranking yeah and in d-text we get the embedding from each text field before concatenation and at the end the search query and the data yep yes and then we compute the similarities among these different embeddings so this is kind of like to learn their contextual coupling after we get the embedding and then these similarities becomes new features that we can use later for the ranking model and so does that allow you to do more pre-computation on the document side exactly exactly in this case as you can see when we when we um compute the protein matting it is no longer based on the concatenated sentence it is based on each individual field so the query set is computed separately member side is also separated and then after the model is trained we can pre-compute these membrane embeddings and then store them in certain stores that can be fetched online while curry side because it is short so we are able to do the real-time influence with birth that is doable and in this case as you can imagine when online a user comes and a user search with a query we use this model to do real-time inference for this query's bulletin betting and then we fetch the member embedding from the pre-computed store and after that we do the inference for ranking so at end of the day what what kind of performance you know properties does this have um yeah presumably it met your goal but what that's right compared to what you saw with the original bird yeah so uh our metrics is based on the different uh product applications for example when we evaluated the people search and a job search we look at the matrix called ndcg uh that is a kind of like a typical evaluation matrix for search linking when we work uh when we use detects for feed and as we look at the ctr uh aoc so that is uh that is kind of like look at how much likely a user would click on a specific content given the impression for this is for offline and for online we also have different measures for different products for example search-wise we have search sessions and s that is revenue and also online ctr um example as we as i can share is when we use this model for people search we observed a significant lift on the online searching success rate yeah i was also interested in the uh inference performance in terms of time how did that compare from before and after yeah yeah good point um answer is i'm not able to provide this result because the baseline model is not able to shift into product due to the latency cancer yeah it is not doable yeah so you didn't even get far enough with the baseline model yeah yeah yeah for um for the current model the online latency is within 20 milliseconds okay yeah only you know what as i can imagine it would be more than 10 seconds something are there other ways that you have uh optimized it besides from the kind of high level architecture yes we optimized on the product stack so even with this kind of detect structure even with this decoupling of the user side and the document side we still encounter challenges about online scoring because of the large amount of candidates so like for such systems some of the systems they have hundreds of thousands of candidates to rank at real time so in that case we actually set up a few ranking stages so for example we have first pass ranker which is using a very simple model to quickly filter out those definitely relevant documents and then we use d-text to do a more fine-grained edited ranking based on this uh say using the model size of candidates and is the pre-filtering model is that also a learned model it is a lender model yeah it is uh there are several options to use this model for example a tree model with simple structure maybe just a few layers and a logistic regression model and the features that could be very small a few tens to hundreds because the purpose is not to get a better ranking at this stage it is mainly to filter out those irrelevant documents so the next stage of ranking by dtx can be uh can be can be launched without a latency concern and so what the the what the features look like for that uh pre-model oh yeah those features could be for example uh like a user's uh um uh user's region and the matching between the query and the doc or documents in terms keyword mesh this could also be uh the user some contextual information for example whether uh this user is uh uh is is searching the query at a certain time point day of the week hour of the day yeah and the users the social connections these are the information can be used i'm just giving you some examples so the the first one that you mentioned is you know very intuitive you might want to if the user's uh issuing a query in english you might want to limit your search to english and it sounds like this pre-model or this pre-filter is one of the mechanisms that you might use to do something like that exactly yes um and so tell us a little bit about um you know maybe in past conversations i've talked with folks uh at linkedin and uh the idea of graphical models comes up a lot in the the way you build machine learning on the the user graph and content does in what ways does uh did these models uh detect them and libert do they uh kind of interoperate or um you know are they related to being used in a graphical context yeah um currently uh we are still exploring this direction okay and the text is a is not using graphical embedding technology uh but we do see the motivation of using graphic embedding together with text information to generate a better embeddings so let me give you an example the way we are looking at is we can leverage users interactions their actions to certain content click like share and share social connection to different people these are kind of like linking information if you look at this graph and there are also know the information that is know the properties for example when a user is clicking on another job post and in addition to this linking information that can be used for graphic neural network modeling if you also look at the know the information from members from the job posts this text information energy to a joint modeling then you can imagine in the end we can generate a more comprehensive embeddings with this huge economic graph at linkedin and this the benefit of doing so is maybe in the future we can eliminate certain filters by doing this kind of modeling together for example maybe in the retrieval side we want to first filter out by the social connections and then we do a ranking within this specific range of social connections but if we have this graphic neural embedding with the text information then we can have better representation we may not no longer need this kind of filters in the beginning so this is some ideas still at the early stage to discuss yeah but we do have some future plans to improve the text on that direction yeah along the same lines is d-text uh is it tightly coupled to bert or could you swap in another uh type of model uh language model in to replace bert you can yeah it can you can switch the bullet model to other models like a cnn like libert like a stm and even for the bottom model itself it is quite compatible to different kinds of birth model like a google spread model microsoft tuning model and the lobota model developed by facebook these are all different options that have been offered to the clients and with clients using those different models is there any intuition you could share around you know when they tend to use one versus the other uh how do they compare relative to one another for your use cases yeah yeah so so far all these exploration is actually more on the towards the offline side for production wise because as we as as we do have this product constraint so far the liebert model is still the first choice because of the flexible structure three layers six layers and performance wise i would say for a fair comparison it may be very hard because the liberty is trained on linkedin data and the other birth model is trained on other domain specific data so we do observe that offline wise uh there are some performance difference but it's sometimes very hard to draw conclusion under this kind of setup to see like which model is better or not yeah are there other models that you hope to customize in the way that you have with uh bert and libert yes we do yeah we do so because these models they are also open sourced it actually provide us opportunity to customize them further and compare with debt and provide to our other linkedin clients yeah as long as we have enough for say uh engineering resource to support that got it talk a little bit about the the practical use of this from uh engineers perspective is this uh a library that they include in their notebooks and and use in that way is it a online system that they're querying what's the deployment model yeah there are two ways to use this one is as you mentioned use it as a jupiter notebook and this way is more about earlier offline analysis and a quick explore of the results so in that way users can just say organize their training data set into a format that has required by d-text and in d-text we have were a lot of parameters but we also have this to say um starters parameter that's the default parameter we already set up for most of the use cases and we also have advanced parameters that can for used for like by machine learning experts to better tune the performance yeah um so users can use them uh in the jupyter notebook or in our um offline training stack so this can be used as a library you can incorporate it into your current text stack and uh adding it as like a as a workflow with this parameter setup then you run and get to the model and can you give us a sense for how often are users kind of using off-the-shelf basic parameters versus you know the other end of the spectrum um you know folks that may want to tweak the advanced parameters or even you know fine-tune i'm trying to get a sense for with these types of models and your type of use case um you know do you find that everybody has to you know fine-tune quite a bit because they need to they need to do that to get the level of performance that's acceptable or is it you know good enough off the shelf for a lot of people yeah that's a that's a very good question here so the tune first of all i think fine tuning itself takes some effort by working dealing with these parameters it will require some of the expertise from our engineers but it also depends on the application as we observe for example in the some of the search applications they can share similar patterns on tuning these parameters and some of the other applications say like as feeders they have very large scale data and also very sparse features then the tuning may take some time that's why our current uh one of our current efforts is to develop advanced optimizer for d-text and this can also be genetically used by other deep learning algorithms so this optimizer is trying to help our ai engineers found a better setting of those key parameters say batch size learning rate and uh and the model selection iterations and we want to make it like more optimal and more and less and it makes the model less sensitive to those parameters in terms of the user's usage and that is already ongoing and we actually uh see some quite promising results by using this advanced optimizer yeah one example as you can see is we are exploring the limb layer wise the optimizer for yeah and so is the optimizer um you know what can you say about the general approach to optimization is it you know along the lines of a bayesian optimization or you know that's that's yeah yeah or is it more like heuristics if your problem looks like this then your parameters should look like that oh yeah so the optimizer i'm referring here is more like a green descent and adam uh yeah kind of optimizes yeah got it so um not specifically the hyper parameter yeah i know that yeah yeah that is uh i would say another direction we are looking at that is for example how to use um common knowledge of this hyphen tuning from different domain to extract patterns so we can more efficiently cure another workflow in a different domain uh this is high prime technology that's also i would say at the early stage of exploring okay but at the lower level is this work on the optimizer is that do you think of that as kind of uh you know basic research trying to come up with a next generation optimizer or is it more you know structurally something about the types of problems you're solving um you know creates an opportunity to specialize the optimizer uh for those problems it is uh currently it is more research oriented okay situation yeah okay awesome awesome well luigi thanks so much for taking the time to share a bit about what you are up to thank you sam yeah it's a great pleasure to be here awesome thank you thank you you\n"

Building a Unified NLP Framework at LinkedIn with Huiji Gao - #481

Random Videos