#44 Project Jupyter and Interactive Computing (with Brian Granger)

The Importance of Jupiter Lab and Open Source Contributions in Data Science

Brian, my guest on today's podcast, is a data scientist who has been working extensively with Jupiter Lab, an open-source tool for data science. Brian shared his expertise on how to use Jupiter Lab and its benefits over traditional Jupyter Notebooks. He also highlighted the importance of engaging with open-source projects and contributing to their growth.

For individuals who are interested in using Jupiter Lab, it's essential to understand its capabilities and limitations. Brian emphasized that Jupiter Lab is a more powerful tool than traditional Notebooks, offering features such as advanced visualization, data manipulation, and collaboration tools. He also noted that Jupiter Lab is designed to be user-friendly, making it accessible to users of all skill levels.

Another key aspect of Jupiter Lab is its ability to provide insights into the decision-making process in organizations. Brian discussed how data science can be used to inform robust decision-making, but also highlighted the challenges that many organizations face when trying to integrate data science into their workflows. He emphasized the importance of verifying results and ensuring that data is split correctly to avoid overfitting.

In addition to its technical capabilities, Jupiter Lab also has a strong focus on community building and collaboration. Brian noted that open-source projects like Jupiter Lab rely heavily on contributions from users and developers. He encouraged listeners to engage with these communities, provide feedback, and contribute to the growth of these projects. By doing so, individuals can help shape the future of data science tools and ensure that they continue to evolve and improve.

For organizations that are considering adopting Jupiter Lab or other open-source data tools, Brian offered some valuable advice. He emphasized the importance of understanding the long-term sustainability of these projects and how to support core contributors. This includes providing financial backing, offering resources and expertise, and advocating for the project's growth within the organization. By taking these steps, organizations can help ensure that their data science tools continue to evolve and improve over time.

In conclusion, Jupiter Lab is a powerful tool that offers many benefits for data scientists and organizations. Its ability to provide insights into decision-making processes, advanced visualization capabilities, and collaboration tools make it an attractive option for those looking to enhance their workflows. By engaging with open-source projects like Jupiter Lab and contributing to their growth, individuals can help shape the future of data science tools and ensure that they continue to evolve and improve.

Furthermore, Brian's advice on how to support core contributors and promote long-term sustainability is crucial for organizations that want to adopt these tools. By understanding the challenges faced by open-source projects and taking steps to address them, organizations can help ensure that their data science tools remain robust and effective over time. As Brian noted, many open-source projects struggle with financial sustainability, and it's essential to find ways to support these efforts.

Finally, for those who are looking to switch from traditional Notebooks to Jupiter Lab, Brian encouraged them to make the transition. He noted that Jupiter Lab is a powerful tool that offers many benefits over traditional Notebooks, including advanced visualization capabilities, collaboration tools, and a more user-friendly interface. By making the switch, individuals can take advantage of these benefits and enhance their workflows.

Cassie Kozakov: Data Science Decision Making and Decision Intelligence

In our next episode, we'll be talking to Cassie Kozakov, Chief Decision Scientist at Google Cloud. Cassie will share her insights on data science decision-making and decision intelligence. She'll discuss the different models for integrating data science into decision-making processes and weigh in on the pros and cons of each approach.

Cassie also emphasizes the importance of verifying results and ensuring that data is split correctly to avoid overfitting. She notes that this is a critical aspect of data science, as it can have a significant impact on the accuracy and effectiveness of predictions.

In addition to its technical capabilities, Cassie also highlights the social and managerial aspects of decision intelligence. She emphasizes the importance of considering the broader context in which decisions are being made and ensuring that data-driven insights are integrated into the decision-making process.

Cassie will also discuss best practices for working with data, including how to avoid overfitting, how to handle missing data, and how to communicate complex results effectively.

Overall, our conversation with Cassie Kozakov promises to be informative and insightful. We'll explore the intersection of data science, decision-making, and decision intelligence, and examine the ways in which these fields can be used to drive better outcomes in organizations.

The Importance of Sustainability in Open Source Projects

As we continue to discuss the importance of open-source projects like Jupiter Lab, it's essential to touch on the issue of sustainability. Brian noted that many open-source projects struggle with financial sustainability, and this can have a significant impact on their ability to evolve and improve over time.

For organizations that are considering adopting open-source tools, it's essential to think about how they can support these efforts. This includes providing financial backing, offering resources and expertise, and advocating for the project's growth within the organization.

By taking these steps, organizations can help ensure that their data science tools continue to evolve and improve over time. They can also help promote long-term sustainability and create a more robust ecosystem of open-source projects.

In conclusion, our conversation with Brian highlights the importance of Jupiter Lab and its role in advancing the field of data science. By engaging with open-source projects like Jupiter Lab and contributing to their growth, individuals can help shape the future of data science tools and ensure that they continue to evolve and improve over time.

"WEBVTTKind: captionsLanguage: enin this episode of data framed a date account podcast I'll be speaking with Bryan Granger co-founder and co-lead of project Jupiter physicist and co-creator of the Altair packaged for statistical visualization in Python will speak about data science interactive computing open source software and project Jupiter with over 2.5 million public Jupiter notebooks on github alone project Jupiter is a force to be reckoned with what is interactive computing and why is it important for data science work what are all the moving parts of the Jupiter ecosystem from notebooks to Jupiter lab to Jupiter hub and binder and why are they so relevant as more and more institutions adopt open source software for interactive computing and data science from Netflix running around a hundred thousand Jupiter notebook batch jobs a day to Lagos Nobel prize-winning discovery of gravitational waves and publishing all their results reproducibly using notebooks project Jupiter is everywhere I'm Hugo Bern Anderson a data scientist the data camp and this is data frame welcome to data framed a weekly data cam podcast exploring what data science looks like on the ground for working data scientists and what problems are console I'm your host Hugo Bound Anderson you can follow me on Twitter that's you go back and data camp at data camp you can find all our episodes and show notes at data camp comm slash community slash podcast before we get started I just wanted to let you all know that we have something new for our podcast listeners this week a trial of data camp for business now what is data camp for business but basically all the content you get as an individual subscriber plus tons of amazing tools for you and your colleagues to learn data science skills together all you need to do is email sales at data count comm with the subject podcast and you can redeem a free two week trial of data camp for you and up to 25 of your colleagues that's sales at data camp comm with the subject podcast and you can redeem a free two week trial of data camp for you and up to 25 of your colleagues we look forward to hearing from you hi there Brian and welcome to data framed it's such a pleasure to have you on the show and we're here today to talk about project Jupiter about interactive computing and in fact you sent me a great slide deck today of yours that you've been giving recently and something we're going to be focusing in on is actually a slide that you have there and I'm just going to quote this before we get started you wrote we are entering an era where large complex organizations need to scale interactive computing with data to their entire organization in a manner that is collaborative secure and human centered now these are all touch points we're going to be speaking about during this conversation but before we get into all of this and before we get into a conversation about Jupiter Jupiter notebooks Jupiter lab and all of these things I'd like to know a bit about you so first maybe you could tell me a bit about what you're known for in the data community yeah problem so I'm a professor at Cal Poly for the last close to 15 years I've been involved in a number of open source projects in the scientific computing and data science space in the early years I was involved in Simpa which is a symbolic computer algebra library for python and then also ipython defacto interactive shell for python and then the more recent years was one of the cofounders of project Jupiter and in the last few years I've also co-founded altair with Jake Vander Plaats which is a statistical visualization library so the big theme though is open source tools for humans working with coded data and speaking of Altair you're actually currently working on an LTI cause for data camp right yes the you're being a little bit optimistic about your verb tense there it's been a little bit stalled with all the different activities we have going on in the Jupiter world but yeah I think I'm around maybe 2/3 to 3/4 done with the data camp course for altair and so as a project lead for project Jupiter I'm wondering what type of skills come into play there because I know you have very strong background you're a physicist you have a lot of data analytic skills a lot of design and engineering and entrepreneurial skills presumably come into this role as well so I'm just wondering what type of things you need in order to do this job yeah it certainly has evolved over the years in the sort of early days of ipython and Jupiter we were spending most of our time doing software engineering there was a very small amount of design work UI UX design work and when it's only a handful of people in principle there's organizational work and community work to be done but it's at a very small scale that is in the background of relative to the software engineering as Jupiter has grown though the I would say the demand for more time and effort on the organizational and community side as well as the design aspects of the project have really increased one of the challenges in working on open source is that projects like Jupiter or Altair tend to attract really top-notch developers and software engineers and so that aspect of the project tends to be reasonably well staffed that doesn't mean that we all have as much time to put it into the projects on the software engineering side as we would like however as these projects get big there's nothing in particular that attracts top-notch UI UX designers for example to Jupiter that continues to be a challenge for us and other open-source projects in terms of how do we build design into the process and figure out how to engage designers in the development of projects so in terms of I mean you're speaking about a number of things here that include you know design but also hiring structuring an organization I know that you think a lot about getting funding for the project you talk about community development which you know these are all things we think about it data camp a lot as well so it sounds somewhat similar in several ways to running a company it probably is I've never run a company but when I talk to other people who are in different roles leading companies there's a lot of overlap there and our business model doesn't involve selling things to people in the traditional sense but most certainly we have customers and our interaction with those customers is very similar to that of a company who has paying customers in terms of we exist in a very dynamic fast-paced part of the economy and it's the type of thing that if Jupiter were to sort of relax and begin to coast there's hundreds of other open-source projects and for-profit companies building products quickly put soup etre in a position of becoming outdated and so there's a lot of thinking we do and work that we do around looking ahead and our three to five-year roadmap where we see data science machine learning and interactive computing going and how do we build the resources to tackle those ambitious things on those time frames also build a sustainable community along the way so how did you get interested in or involved in data science as opposed to being a physics professor and researcher how did you get into data science initially yeah that's a great question so we began working on interactive computing as part of a Python I was a classmate of Fernando Perez back in grad school at the University of Colorado and for not have created ipython in the early 2000s and at the time the world of interactive computing was really something that was done in the scientific realm in academic research and education I'm sure there were some companies at the time that we're doing little bits of it here or there but it wasn't something that was pervasive like it is today and so as we started to build ipython and Jupiter in the 2000s initially we felt like we had what we imagined was a very grand vision that everyone in the world of academic research and scientific computing would be using Python and the tools that we and many other people were building what we didn't see is that the whole world was about to discover data and that really sort of opened up a whole new audience instead of users to open-source data science tools and scientific computing tools that we never imagined and so honestly the way my own journey is more that we were doing what we had always done in terms of scientific computing and then woke up to realize that we were sort of right in the middle of the data science community that was forming both in the academic research side but also on the commercial industry side as well I want to delve into a bit more about the general nature project Jupiter in a minute but before that I'd like to speak a bit more about interactive computing so I'm going to quote project Jupiter project Jupiter states that it exists to develop open source software open standards and services for interactive computing across dozens of programming language and I'm wondering what a general working definition of interactive computing is and why why is it important yeah this is a great question and I think in the history of computer science interactive computing has not even really been a thing that's acknowledged in terms of a topic worthy of study and something that is worth really thinking about carefully and clarifying and it's something that we've been doing over the years and really that the Jupiter architecture is an expression are thinking about interactive computing and I would say that the core idea of interactive computing is that there's a computer program that's running where there's a human in a loop as the program runs that human is both writing and running code on the fly but then looking at the output or the result of running that code and making decisions about what code to write and run subsequently and so there's this sort of interactive mode of going back and forth between the human authorship of the code and then the computer running and the human and interacting with the result and this is ideal for so many aspects of the scientific research process right from exploratory data analysis to writing code embedded with inline results and images and text and that type of stuff absolutely this is something that we really think a lot about and that is that when humans are working with code and data eventually at some point for their work to be meaningful and impactful the code and data need to be embedded into what we think of as a narrative or a story around the code and data that enables humans to interact with it decisions based on it understand it and it's really that that human application towards decision making it really makes a difference to have a human in the loop when you're working with data and for our listeners out there who may not know what ipython is and how differs from Python would you spelling it out for them yeah so - it's the programming language I Python stands for interactive Python and it originally was a terminal based interactive shell for working with Python interactively you had a lot of and continues to have a lot of nice features that you want when you're working interactively such as nice tab-completion easy integration with the system shell inline help features like rich interactive shell and the interaction between ipython and Jupiter today is that originally when we built the notebook we call it the ipython notebook because they've only worked with Python and then over the years we realized that the same user interface in architecture would also work other programming languages the core developers of ipython then sort of spawned another project namely project Jupiter that's the home for the language independent aspects of the architecture and ipython continues to exist today as the main way that people are using Python within project chipler and so it continues to be a project that we're still working on so could you give us a high-level overview of what project Jupiter is and what it entails yeah so you've read the sort of a good summary of project Jupiter and that it's focused around open source software open standards and services for interactive computing and the I think a lot of people are familiar with some of the software projects that we have created namely the Jupiter notebook and I sure will get to talk more about that here in this conversation but underneath the Jupiter notebook is a set of open standards for interactive computing and I think when we think about Jupiter and its impact it's really those open standards that are at the core of that and one way to think about it is it's a similar situation as to the modern Internet where yes there's an event individual web browsers and websites but underneath all of that there's a set of open standards in HTTP tcp/ip HTML that enable all of the all of those things to work together open standards that Jupiter has built for interactive computing service similar role as those other protocols to in the context for the broader Internet so I want to find out a bit about the scale and reach of project Jupiter but I want to preface this by saying it's a story but you and I know but for our listeners recently I attended a talk you gave at Jupiter con here in New York City and before you started speaking you asked people to put up their hand if 10 or less people in their organization used some aspect to project Jupiter then asked people to put their hand up if 50 or less 100 or less 500 or less a thousand or less and so on and people put their hands up at every point and then you asked anybody in an organization which has over 10,000 people using some aspect to project Jupiter and a number of people put their hands up and that was a really large aha moment for me thinking about the scale and reach of the project as a whole so with that as a kind of intro I'm wondering what you can tell us about the scale and reach of the project yeah so this is something that's been really fun to be part of over the last few years to see the usage of Jupiter literally explode and take off in ways that we never imagined and there's a number of different ways of thinking about this being an open-source project we don't have an accurate precise way of tracking how many users we have our users obtained and install Jupiter through a number of different means and that does make it a challenge one nice thing that we're watching is the number of notebooks on github and this can be obtained by querying the github API is we have a project an open-source project where we're tracking that over time and as of this summer the total number of public notebooks is on the order of two and a half million and from talking with the github folks that we know it looks like there's roughly another similar amount jupiter notebooks that are not visible to the world and so the interesting thing there obviously the absolute number currently is interesting i think what's more telling is that over time we're seeing an exponential increase in the number of notebooks and the doubling period right now is around nine or ten months and so that really points to very strong current numbers as well as growth and you know it's difficult to put a number on the total number of people they're using Jupiter worldwide some of what makes it challenging is that most of our staff is in the US and Europe and yet we know from our Google Analytics traffic that Asia right now is one of the most popular continents that's using Jupiter and so we don't have many contacts with people there we don't know how Jupiter is being used but we see a very strong signal that it's thinking is heavily and how about in terms of contributions and amount of developers working on the project yeah so along with the usage there's definitely been an increase in the number of contributors I think that the total number of contributors is somewhere over 500 and it's a fairly large scale open source project we have over a hundred different repositories on github spread across the number of different orgs and the core team right now that our core maintainer is the project many of whom work mostly full-time on the project is around 25 people the Jupiter steering council is a key part of the the leadership of the project and I think there's currently 15 steering council members there's a number of new people who joined the steering council this summer's I don't remember the precise number and one thing that I want to emphasize with this is that sort of what is the right narrative to have about the different contributions of people to project Jupiter and I want to sort of make it analogy to Hollywood in terms of if Jupiter were a movie what type of movie would it be and I think it's important to note that it would not be a movie where there's a single superhero who comes and saves the day so sort of like a Superman narrative really doesn't fit the reality how Jupiter has been built a movie and I think that would be a better analogy to how dukers is built would be something like infinity Wars where you have a bunch of different superheroes all with very diverse skills and strengths contributing to the overall project and I think it's really important to note that in that yes I'm the one that's here talking to you today but I am one one among many people who've done absolutely amazing work on the project and for our listeners out there who would like to get involved in perhaps contributing to the project what a good ways to get involved and what a I hesitate to say bad ways to get involved but what a less good ways to get involved yeah so this is one thing that I talked about at Jupiter con in terms of in that context it was more thinking about what are healthy and productive ways for large companies to engage with open source so for individuals I would say one of the best ways would be to find a part of the project that you're interested in and then come on github and begin to interact with us a lot of our popular github repos have issues that are tagged for first-time consumers and so we're working hard to try to make the project a welcoming place for new contributors and so we welcome people to come and talk to us there we also have chat rooms that are public on geared I am so this is a online web-based chat platform that is integrated with github and so for example the Jupiter notebook Jupiter lab Jupiter hub Jupiter widgets all have chat rooms on getter and both the core contributors as well as the broader community hangs out in those contexts so that's a great way for people that to get involved and how about organizations that want to contribute to the project Brian I think in this case it's really helpful to have a good mental model of how open source projects work and how contributions function and that my favorite mental model is actually from Brent cannon who's one of the core Python devs and works at Microsoft in a tweet he said something like submitting a pull request to an open-source project is like getting someone a puppy and that is you have to understand that the person accepting the pull request is essentially agreeing to care for that puppy for the rest of its life and one of the patterns that we see in organizations companies that want to contribute to open source is that they're interested in particular features and so they have their employees contribute to open source in a way that generates a lot of large new pull requests for those new features and I think this is where the puppy mental model really helps and that is oftentimes the core maintainer zuv open source projects are completely overwhelmed with just the maintenance of the project and that could include bug fixes releasing issue triage and managing issues but also reviewing other contributors pull requests and so one of the the most helpful things that were trying to cultivate is basically a balanced perspective of contributions that includes not just submitting pull requests that have new features but also includes reviewing other people's pull requests involves helping other users with particular issues and even fixing bugs and one really nice thing that github has done recently is in their contributor user interface they have a new user interface for sort of expressing someone's contributions to particular github repository and there's sort of a XY coordinate system and for directions around that and it shows someone's contributions - I think it's review poll requests issues and there's one other from that you can get a perspective on how balanced someone's contributions are to open source project and so a simple way of putting it is encouraging people to have a balanced way of contributing to open source now we also want to specifically address first-time or new contributors and there I think the idea again is balance but in a way where the core contributor is an existing people working on the project can help new contributors to come along and begin to contribute in different ways and so even for new contributors those contributions don't necessarily have to be pull requests even checking out existing pull requests just testing them locally is really really helpful for all these right projects and so that balance perspective is something that we're trying to and it's incredible that you know github now has the feature you discuss which kind of facilitates just figuring out this balance right oh absolutely I was thrilled that to see them release that and I think it's happened in the last month we'll jump right back into our interview with Brian Granger after a short segment it's well known that humans can't multitask but apparently machines can I'm here today with fredericka sure research engineer at Cloudera fast forward labs and we're gonna talk about machines that multitask or multitask learning Frederika what is multitask learning how you go it's nice to be back on the dataframe podcast thanks for having me we humans may not excel at doing too many things at once the definition of multitasking but we do benefit from learning or having learned many related tasks that's multitasking for example several years ago I moved from Germany to the Netherlands as a native German speaker it was relatively easy for me to pick up Dutch as dramatic languages German or Dutch share semantic and syntactic properties that a German speaking Dutch language learner like me can benefit from I was multitasking learning Dutch was easier for me because I had learned German before and how about machines well machines struggled to learn multiple tasks today supervised machine learning algorithms algorithms had learned from Lidl data tend to learn one and only one task they achieve mastery through laser-like focus since they only learn one task they cannot benefit from the relations between tasks multitask learning allows machines to learn multiple tasks at once so that they can benefit from the relationships between tasks great would you mind giving us an example of course let's get into the nitty-gritty with the natural language processing or NLP part of speech tagging syntactic pricing and textual entailment are all tasks that assign labels to parts of sentences part of speech tagging assigns labels such as noun verb adjective etc syntactic parsing identifies the syntactic relations between the parts of a sentence textural and Tamland identifies whether one sentence is the consequence of another or contradicts another or whether there is no such sentence relationship of speech tags improve syntactic parsing part of speech tags and the pass tree improved textual payment now multitask learning enables us to train one model to master all three tasks so the texture attainment for example can benefit from the part of speech text importantly a model trained to master all three tasks is likely to outperform models trained to master one and only one task that is one of the benefits of martita's learning it allows machines to benefit from related tasks for better performance it's no surprise that the natural language processing library spacy I'm a huge fan users mouth it has learning for some of their models that sounds great so when would you use multitask learning Martita flirting is an approach to training supervised machine learning algorithms it is not an algorithm so in principle there's a multitask person of every supervisor algorithm he could use multitask learning all the time but you really want to use multi-touch learning if you're looking to solve complex problems single task learning thus a fine job when you're looking to solve simpler problems which is why it's been so popular until now complex problems benefit from the power of the multitask learning approach they benefit from learning the relations between tasks and can you give us some examples absolutely image segmentation algorithms for example identify object boundaries and images the boundaries of tissue types and medical images or building types and satellite images image segmentation is hard results are plagued by blurry edges because algorithms struggle to really delineate objects they assign different labels to pixels that object boundaries image segmentation benefits from joint training with boundary detection or shape identification most buildings are composed of rectangles their edges are straight lines using multi-touch learning we can exploit these regularities to sharply delineate object boundaries during image segmentation so what about natural language processing yeah that's a pretty sensible choice too a couple months ago Salesforce released one model that mastered many NLP toss from sentiment analysis semantic role labeling machine translations is a Shinto common sense reasoning and question answering multi test learning is only effective of course if the model is asked to learn tasks that are actually related in this case it's clear intuitively how common sense reasoning and question answering support each other common sense reasoning may help us get the answer we need thanks Federico for this enlightening introduction to multitask learning after that interlude it's time to jump back into our chat with Brian as we've been discussing that the scalar reach of project Jupiter is is massive so I'm sure there are so many different uses of notebooks in the project in general but I'm wondering to your mind what the main uses of Jupiter notebooks for data science and related worker yeah in terms of numbers of people using Jupiter for a particular purpose I would say interactive computing with data so data science machine learning AI is one of the most popular ways that Jupiter's being used both by practitioners so people who are working in industry on data science machine learning teams but also in educational contexts so within universities with online programs with boot camps you have instructors and students doing those activities around data science and machine learning but in an educational context I think that really captures the bigger picture of Jupiter Jupiter's use it yeah and in fact we use them at data camp for our projects infrastructure which as you know we teach a lot of skills in our courses in our projects we teach kind of into end data science workflows using Jupiter notebooks that project style workflow is something that I've seen when I've talked data science at Cal Poly my University in that oftentimes it's helpful to start with students in a very highly scripted manner where the exercises are very small-scale and focus on a particular aspect particular concept and then eventually transition to more open-ended project-based work and I know in those courses the towards the end of the quarter when the students have an opportunity to do sort of end-to-end data science that's a little more open-ended the learning really increases a lot the students get a lot out of it so I'm I'm thrilled to see that the data camp has type of experience as well now and of course we see notebooks pop up everywhere from in the slide deck that we discussed earlier you have a great slide on the large synoptic survey telescope on top of that of course the gravitational waves were discovered by the LIGO project and they've actually published all of their Jupiter notebooks so this is in basic scientific research right - what's happening now I mean there's a lot of stuff happening at Netflix with notebooks now so it's across the board right yeah and this is really a pattern that we've seen emerge in the last two years and that is the transition from ad-hoc usage individuals in organizations to official organization why deployments at scale and so use we're starting to see a lot more organizations adopt Jupiter in a similar way to LIGO or LSST or Netflix where it is officially deployed and maintained by the organization and many many users are using Jupiter on a regular basis and some of the larger deployments that were aware of are many of many thousands or even on the order of 10,000 or more people and so it's the scale is definitely getting large in these and I'm gonna state two things that I think of facts and correct me if I'm wrong Netflix runs over a hundred thousand automated notebook jobs a day and - at least - either contributors or core contributors to project Jupiter work full-time at Netflix as well yes absolutely so a Kyle Kelly and M pacer are on the notebook team I don't know if that's exactly the name of their team but they're one of the tools teams Netflix and they work both with us on the core project Jupiter projects but also they had a number of other open-source projects that work with the different Jupiter protocols one of those is interact which is another user interface from working with Jupiter notebooks that has focus on simplicity and personas where individuals do want to work some with code but they're not living and breathing in code all the time so business analysts would be a great example to take a persona that interact as targeting and then as you mentioned Netflix has really innovated in the area of using notebooks in a programmatic way running them as Basques jobs every day and I think your your number of around a hundred thousand batch jobs there are notebooks a day it sounds about right from what I remember and they have a number of open source projects out to help with those type of workflows one of those is paper mill the other is commuter and I think one of the things that I love about what's going on at Netflix and I think this really comes from the leadership of Kyle Kelley there and that is a deep understanding of the value of jupiter's open protocols and that is sort of the recognition that the different software projects that we've built on top of those protocols are sort of like a Lego set that you get you bring them home from the store and there's a default instruction set to build something out of the box but then realizing that the same pieces can be reassembled in different ways to build whatever your organization needs and I love how that thinking has really sort of seeped into all the different ways that Netflix is working with data and I think they're doing really interesting things as a real result and the note book payment at Netflix actually published a really interesting blog post article recently which will link to in in the show notes along with a lot of other things that we're we're talking about yeah and they also gave a number of talks of Jupiter Khan and those talks will be posted on the Jupiter Khan YouTube channel here coming month I think okay fantastic so when we have 2.5 million public notebooks on github I'm sure there's a lot of surprising stuff happening out there and this may be a bit of a curveball but I'm wondering if you've seen any uses of a Jupiter notebook that have surprised you you've been like oh wow that's interesting so what is the most surprising use of a notebook you've seen yeah I mean one of the fun things about working on a project is to follow all the amazing things that our users are doing and you know I think seeing the impact that Jupiter is having in the world of scientific research is something that we're really proud of and so to see large-scale science such as my go infer go winning Nobel Prize in Physics and as part of that publishing Jupiter notebooks that can anyone in the world can use to completely reproduce their analysis all the way from the raw data to the final publication ready visualizations that makes us really proud I don't know that surprise is the right word to use there some of that is that that's the community that we came out of and so we've always worked really hard to make sure that Jupiter was useful for those usage cases in terms of surprise the most surprising or shocking usage of Jupiter was by Cambridge analytic and SEO elections to build machine learning models to manipulate the 2016 election right and I do think you know surprising is one word they're stuffing is another word and I actually remember the first I saw a tweet and I think it was where's it was it was where's McKinney who tweeted words to that effect and we saw a screenshot of a Jupiter notebook with pandas dataframe with some scikit-learn fit and predicts or something like that and that was a moment where I also step back and really thought you know all these tools can be used for all types of purposes right so Cambridge analytical all of their web presence get a present is gone SEL elections which worked with them hasn't taken their stuff down for hasn't taken all their stuff down from get up and so there's a project called Jupiter stream and you could tell that the people working at SEL elections were typical data scientists who were excited to use these tools to do data science and the thing that's scary is if you look in the demo subdirectory in dull box there's a notebook layer and it's very clear the type of things they were doing now in this particular case it's nothing particularly sensitive it looks like they're tracking rip voter registration counts by calendar week and working with a panda's data frame with that but we were certainly again I'm not quite sure what the right word is surprise doesn't quite capture it I think it was really a moment of waking up for us and realizing that it's similar having an open-source project with a very free and liberal open-source license is very similar to free speech and that a liberal ease licensed open-source project can and will be used by just about anyone and that includes people doing really good things but also people doing really evil thing is creepy how this repo also says Jupiter to the rescue exclamation point it's really a trip I mean and that was the original interview that Christopher Wiley did so he was one of the data scientists at Cambridge analytic and in that first interview that came out I think it was in The Guardian he used the phrase build models and immediately I thought wait hold on a second this is a data scientist they're talking about building models what they're really saying there is imports I could learn at a Jupiter notebook and initially it was sort of like yeah they they might have used it or maybe they used our studio but then over time it's become very clear that they certainly we're using Jupiter at some point so this is a nice segue into my next question which may seem like that's an answer too but I suppose it isn't necessarily my next question is what adoption of notebooks as we've discussed have been huge and wondering if there are places you see notebooks used where they shouldn't be used yes I'm not quite sure I would freeze it where they shouldn't be used but certainly I think there's a little bit of a effect where the notebook is a hammer and so everything starts to look like a nail and in particular the the type of workflow where notebooks begin to be rather painful use one exploratory data science and machine learning becomes more software engineering and more about data engineering and in those usage cases it's not a fantastic software engineering apart it's really it's not designed for that purpose and now this is something we're hearing from our users that right now there's sort of a a very steep incline between working interactively in a notebook and software engineering and as someone moves across that transition at some point today they get to the point where really the right thing for them to do is stop using Jupiter and open up their favorite IDE and start to do traditional software engineering and that can be rather painful in the sense that most of the ideas that people love are not web-based and so you know if anyone was working with significant amounts of data and running in the cloud those ideas may not even really be a great option and so we are working a lot to improve the experience at that boundary between interactive data science and software engineering we don't envision that Jupiter's either ever going to replace full-blown Ige's but it's really at that boundary we're seeing a lot of user pain currently and we're notebooks themselves start to be not the best tool yeah and that dovetails nicely into my my next question which is around the fact that there are several common criticisms of notebooks such as they may encourage bad software engineering practices and I suppose most famously recently Jupiter Khan accepted Joel gross talk I don't like notebooks to be presented at Jupiter con and I'm just wondering what you consider the most important or relevant or valuable or insightful criticisms that can help moving the project forward yeah so I think there's I really appreciate the talk that you all gave at Jupiter called it was a really well-received talk and we want to hear things like this it's really important for us some of the criticisms that he had about project Jupiter are in the category of things that we could fix so existing user experiences or features that we offer don't offer or could improve and most of the things he brought up in that category I think the whole quarter team is more or less on the same page the other aspect that he was bringing up gets more to the heart of interactive computing with Jupiter notebooks you know I think it's helpful to bring those things up as it really forces us to clarify the value proposition of that type of workflow in a notebook compared to just to traditional software engineering and so that I think the discussion that's emerged out of that has been really helpful and something that that is helping us to clarify when should you use Jupiter notebooks or why would you use them and why should you not use them we'll jump right back into our interview with Bryan Granger after a short segment I'm back here with fredericka to talk about machines that multitask Frederika you said that multitask learning is only effective if the model is asked to learn tasks that are related what if they're not well if tasks aren't related but you're asking a single model to learn them all then multitask learning may actually lead to worse performance we call this negative transfer okay so how do I know if tasks are related yes so mathematicians have tried and failed to develop a formal measure of task relatedness and in the absence of such a formal measure you just give it a try to see if multitask learning leads to improved model performance if so it's likely that tasks are related as a rule of thumb if tasks are drawn from the same domain say natural language processing then they're likely related you see intuition even in the age of machines it's still extremely valuable can you give us an intuition for why multitask learning works well if you're providing a model with two instead of one set of labels you're giving it more information that it can learn from that is one of the reasons also while single task algorithms struggle with noisy labels multitask algorithms are more robust to noise usually measurement errors are independent so algorithms using multitask learning can average out noise during training finally martita's learning acts as a regularizer since the model has to learn multiple tasks at once it is less likely to overfit to one task and more likely to generalize to previously unseen data the reasons for this remarkable effectiveness of martita's learning are rooted in machine learning fundamentals it benefits you to know how and why algorithms work not just how to use them besides accuracy improvements are there other benefits of multitask learning yes Marta tough learning allows you to use not just one but multiple data sets to Train one single model researchers have gone as far as training one model to master 259 tasks based on 259 data sets here multi-tassel n was used to predict the likely interaction between medical targets and promising compounds a part of so-called virtual drug screening data on target compound interaction is produced in medical laboratories that investigate related but distinct questions and the circumstances at vary from lab to lab martita's learning allows you to take all this data to train one single model was respecting likely differences between data sets by treating each one as a distinct task to be clear these I related because the data is related it is drawn from the same domain two ants the same or similar research questions furthermore multi-touch models also tend to be more efficient in production a multitask model can give you multiple predictions while it needs to process the input data only once and finally multi-touch models can give you insight into task relations which is pretty neat got it so what did you do with multi task learning we will a prototype we call it an Uzi that puts current news in perspective we treat one multitask neural network on a body of news articles to predict news article category politics sports entertainment you name it news articles were taken from sensationalist tabloids think the New York Daily Post and the MOU buttoned up broad sheets think the New York Times News II was trained to predict the news category of broadsheet articles that's task one and tabloid ones that's task two so what does the prototype look like in its final form it's quite interactive you can take any article they interested in and analyze it as a broad sheet or tabloid article our multitasker our network leads the article a key human word from left to right and incrementally predicts the articles news category predictions are overlaid the article word by word you gain insight into words associated with news categories given the unique context the article you decided to look at the user can switch between a tabloid and broadsheet lens to study differences in language use and style across publication types crucially this model since it's been trained on 1.6 million articles services large-scale reporting tendencies and other biases that a human reader may not or may only unconsciously pick up on we found for example that reporting on the daily news tabloid newspapers are more likely to use language that signals factual statements such as reported a said this isn't true for broadsheet articles tabloids also tend to report in politics and world news as opinion which is an interesting choice since this may affect how we consumed the news it's good to surface these tendencies and new ceders exactly that using the power of multitask learning thanks fredericka for that informative dive into the benefits of multitask learning time to get straight back into our chat with Brian so we've discussed notebooks of course but something I'm really excited about I noticed something you're incredibly excited about is the next-generation user interface for project Jupiter which is Jupiter lab so maybe you can tell us what Jupiter lab is and why working data scientists would find it useful yeah Jupiter lab is definitely something that I and many other people and the core team are excited about you know jupiter lab is a next-generation user interface for project jupiter and we've been working on jupiter lab for around four years now and just in the last month it left beta so it is ready for production congratulations yeah thank you very much we're really pleased to get through that hurdle it's still not at a one-dollar release because some of the developer oriented extension API is are still stabilizing and so it's one of the big things we heard of users of the classic notebook is that people wanted the ability to customize and extend and embed aspects of the notebook with other applications and the original code base and the classic notebook just wasn't designed in a way that made that easy so one of the core design ideas and Jupiter lab is that everything in Jupiter lab is an extension and all those extensions are NPM packages and the idea there is that the core Jupiter team can't build everything for everyone and a lot of different individuals and organizations want to come along and add new things to do for the lab and though those extension API is which are the public developer oriented api's of Jupiter lab enable those extensions to be built and we're still in a process of stabilizing some of those API s but I want to emphasize that from a you perspective for people who are using Jupiter on a daily basis Jupiter lab is fully stable and production ready and in many ways at this point I would say it's a better user experience and more stable in the class great and what type of features do you have in Jupiter lab that you don't get in the classic notebook one of them is the building to work with multiple different activities or building blocks for interactive computing at the same time so the classic notebook each notebook or terminal or text editor will have a separate browser tab and that made it very difficult for us to integrate those different activities with each other so an example of how that integration would work in Jupiter lab is if you have multiple notebooks open side by side you can just drag cells between those two notebooks another example would be if you have a markdown file open you can right-click on the markdown file it open a live markdown preview and then also open a code console attached to that markdown file and start running code in any of the different languages that Jupiter supports in a manner that's more similar to an experience like our studio and so having the different building blocks of places to type code output terminals notebooks integrated in different ways to support some of these other workflows that come up and also a csv viewer right yes so we have so the other another big idea a design idea in jupiter lab is the idea of more direct manipulation user interfaces and so in many cases writing code is the most effective way of interacting with data however there's many situations where writing code is a bit painful and a great example of that is if you have a new CSV file and you don't know what's in it and you simply want to look at of course you can open up a notebook and import pandas and stuff to look at the CSV file but in many cases more indirect modes of interaction are highly productive and useful and so Jupiter labs file system at is based around the idea of the possibility of multiple viewers and editors for a given file type and so for example for a CSV file you can open it in a text editor edit it as a plain text CSV file or you can open it in this new grid or sort of tabular viewer that we have and that viewers the default for CSV files so you can just double click that I CSV file in Jupiter lab it will immediately open in a form that looks like a table we try to detect the delimiter x' this is a zwi file and even allowing the drop-down to select those and I recall from one demonstration that it supports wildly large CSV files as well right yeah so one of our core contributors Chris Culver and a lot of time building a well-designed data model and a viewer on top of that so the the data model for this grid viewer does not assume that all of the data is loaded into memory so it has an API that allows you to request data from the model as needed basis and where that's used is in the view that sits on top of that if you have a really large CSV or tabular data model the view is only going to request the portions of the data that are visible to the user to get in time and so for example right now like some of the demos that we're doing you can just double click on a CSV file it has over a million rows and it's big enough those those files are big enough that they don't open successfully in Microsoft Excel on the same laptop and they open just fine in Jupiter lab and the viewer there the renderer that Chris wrote uses canvas so it's a very high-performance tabular data viewer and to keep ourselves honest we've tested it with synthetic data sets so these are not concrete data sets they're generated on the fly but that have a trillion rows and the total columns the tabular or data set viewer works really well and you can scroll through the data set just fine and I think the another side effect of direct interaction with data is that when you make it easy for users to interact with data in those ways they're going to do that right so if you can double-click on a CSV file a users gonna find wow that's useful and they're going to do that and you have to spend a lot of time making sure the underlying architecture doesn't let them do things that are going to have adverse side effects we're trying to build an architecture that has both good user experience but also can deal with the realities of large day and there are you know several other features that we we could discuss but I'm just gonna pick one which i think is very attractive and fantastic which is the ability to collaboratively work on Jupiter notebooks with colleagues and collaborators yes so this is something that we've been working on for a while now and our first take on this was a Jupiter lab extension and a post-hoc UC Berkeley wrote Ian rose yeah this provided integration with Google Drive and the Google real time API to enable multiple people to open a single notebook and collaborate in real time on that notebook and see see the other people working and editing the notebook is it at the same time and then in the last year and a half we've started a new effort to build a real-time data model in datastore for Jupiter lab for two reasons one is that the Google real time API has been discontinued and then the others that we've heard very clearly from our users that there's many organizations for whom sending all their data to Google API is a no-go and so it's become really important for us to have a high-performance really well-designed real time data storage we've been working on that from last 18 months and again Chris Kolbert who did the data grid is the person working on them great and listeners out there this has been kind of a whirlwind introduction to a bunch of features in in Jupiter lab and I'd urge you to go and play with it yourself and check out some of the demos on online as well if you haven't yet and I want to clarify the the version of Jupiter lab that's out today does not yet have the real-time collaboration okay that's right still not so we've discussed ipython we've discussed jupiter notebooks we've discussed jupiter lab what else exists in the jupiter ecosystem could you give us just a brief rundown of a couple of the other things yeah absolutely probably the biggest other constellation of projects is that the jupiter hub project jupiter hub is its own organization on github and there's a number of different separate repos and projects there in jupiter hub provides basically the ability for organizations to deploy at scale to many users and you know with the sort of patterns of adoption that we're seeing right now that usage case is really really important in as a result of that Jupiter had the seeing both a lot of interest from people contributing and also organizations using it we discussed earlier the talk you recently gave it a Jupiter con in which you stated that project Jupiter is undergoing a phase transition from having individual users in organizations to having a large-scale institutional adoption and I'm wondering what the unique challenges the project is now facing due to this transition yes so there's both organizational and tactical challenges we're facing on the organizational side I would say the big challenge is that we're seeing an increasing number of organizations coming to us and wanting to interact with us rather than just individuals and those organizations and that really changes the type of people you're talking to in the organizations so in many cases in the past it may have been data scientists or machine learning researchers and increasingly it's managers project managers and other decision-makers who are thinking about the broader data strategy at the organizations from a technical perspective it brings out a lot of new usage cases in particular in Jupiter hub to address the the needs of large organizations and some examples of those are security security is a really important thing for larger in particular when their sensitive data in the mix another aspect of that is that in these large organizations there are typically a wide range of different skill sets responsibilities roles access permissions and priorities of the people working with Jupiter and so it's not necessarily just people who are living and breathing code all day long but a lot of other people in the organization working with data don't necessarily want to look at code all the time or even most of the time and so there's a lot of work we're doing is thinking about how will duper need to evolve to address those usage cases absolutely so Brian was my last question I'm wondering if you have a final call to action for all our listeners out there who may have used Jupiter notebooks may not have button may be interested in doing so yeah so I think there's a couple different cause of action one is for people to engage with open source individuals so if you're a data scientist or someone doing machine learning at a company or student learning about these tools and techniques engage with the open source projects find an open source project you're interested in understand more about the project maybe help with documentation and a lot of what we've found is that innovation happens when diverse groups of people come together and talk to each other and work towards common goals and so the more people we have joined the projects in contributing and helping us think about these things the better off and more healthy the open source projects will be but also the users of those projects will be better served agreed so in a second call to action would be for people working in larger organizations that are using open source tools in the space I think it's important to know that many of the open source projects in particular those that are community driven like Jupiter and many of the other known focused projects we continue to struggle with the long term sustainability and there are many core contributors to these projects that continue to lock long-term funding and the ability to focus on working on the projects and so if you're at an organization using these tools I would really encourage you to talk to the people in the organization to think about and understand how you can support the core contributors and the broader sustainability both in the sense of community but also in particular the financial sustainability of these efforts that would be really really helpful and all that one other final call to action there which is I started using Jupiter lab all the time instead of notebooks mid last year I think I would urge anyone out there who still uses the classic notebook to jump into Jupiter lab I think you have no reason reason not to these days it's such a wonderful place to use notebooks among many many other things yes that's a great a great point to go and even for the part of my job where I get to use the Jupiter notebook I transitioned in particular in teaching some research to using Jupiter lab back in January and it's worked really really well in this context and that's what I'm using Jupiter lab basically all the time so I just echo what you said and appreciate the kind words Brian thank you so much for coming on the show as I always love our conversations and it's been an absolute pleasure formalizing this and putting it out there yes and thank you so much you go for working on this podcast I know a lot of people really appreciate it thanks for having me thanks for joining our conversation with Brian we saw how much necessarily goes into open source projects for scale of Jupiter such as software engineering UX and UI design large-scale organizational considerations and community building to name a few we also saw the unique challenges faced by open source data tooling platforms in general due to the fact that we're now at a phase transition from having individual users in organizations to having large-scale institutional adoption Netflix is one example of many as is the LIGO projects Nobel Prize winning discovery of gravitational waves and lastly don't forget to go give Jupiter lab a test drive you've got everything to win and nothing to lose also make sure to check out our next episode a conversation with Cassie kozakov chief decision scientist at Google cloud Cassie and I will be talking about data science decision making and decision intelligence which Cassie likes to think of as data science + + augmented with the social and managerial sciences we'll talk about the different and evolving models for how the fruits of data science work can be used to inform robust decision making along with pros and cons of all the models for embedding data scientists in organizations relative to the decision function will tackle head-on why so many organizations fail at using data to robustly inform decision making along with best practices for working with data such as not verifying your results on the data that inspired your models as Cassie says split your damn data I'm your host Hugo Bound Anderson you can follow me on Twitter at Hugo Bound and data camp at data camp you can find all our episodes and show notes at data capcom slash communities slash politicsin this episode of data framed a date account podcast I'll be speaking with Bryan Granger co-founder and co-lead of project Jupiter physicist and co-creator of the Altair packaged for statistical visualization in Python will speak about data science interactive computing open source software and project Jupiter with over 2.5 million public Jupiter notebooks on github alone project Jupiter is a force to be reckoned with what is interactive computing and why is it important for data science work what are all the moving parts of the Jupiter ecosystem from notebooks to Jupiter lab to Jupiter hub and binder and why are they so relevant as more and more institutions adopt open source software for interactive computing and data science from Netflix running around a hundred thousand Jupiter notebook batch jobs a day to Lagos Nobel prize-winning discovery of gravitational waves and publishing all their results reproducibly using notebooks project Jupiter is everywhere I'm Hugo Bern Anderson a data scientist the data camp and this is data frame welcome to data framed a weekly data cam podcast exploring what data science looks like on the ground for working data scientists and what problems are console I'm your host Hugo Bound Anderson you can follow me on Twitter that's you go back and data camp at data camp you can find all our episodes and show notes at data camp comm slash community slash podcast before we get started I just wanted to let you all know that we have something new for our podcast listeners this week a trial of data camp for business now what is data camp for business but basically all the content you get as an individual subscriber plus tons of amazing tools for you and your colleagues to learn data science skills together all you need to do is email sales at data count comm with the subject podcast and you can redeem a free two week trial of data camp for you and up to 25 of your colleagues that's sales at data camp comm with the subject podcast and you can redeem a free two week trial of data camp for you and up to 25 of your colleagues we look forward to hearing from you hi there Brian and welcome to data framed it's such a pleasure to have you on the show and we're here today to talk about project Jupiter about interactive computing and in fact you sent me a great slide deck today of yours that you've been giving recently and something we're going to be focusing in on is actually a slide that you have there and I'm just going to quote this before we get started you wrote we are entering an era where large complex organizations need to scale interactive computing with data to their entire organization in a manner that is collaborative secure and human centered now these are all touch points we're going to be speaking about during this conversation but before we get into all of this and before we get into a conversation about Jupiter Jupiter notebooks Jupiter lab and all of these things I'd like to know a bit about you so first maybe you could tell me a bit about what you're known for in the data community yeah problem so I'm a professor at Cal Poly for the last close to 15 years I've been involved in a number of open source projects in the scientific computing and data science space in the early years I was involved in Simpa which is a symbolic computer algebra library for python and then also ipython defacto interactive shell for python and then the more recent years was one of the cofounders of project Jupiter and in the last few years I've also co-founded altair with Jake Vander Plaats which is a statistical visualization library so the big theme though is open source tools for humans working with coded data and speaking of Altair you're actually currently working on an LTI cause for data camp right yes the you're being a little bit optimistic about your verb tense there it's been a little bit stalled with all the different activities we have going on in the Jupiter world but yeah I think I'm around maybe 2/3 to 3/4 done with the data camp course for altair and so as a project lead for project Jupiter I'm wondering what type of skills come into play there because I know you have very strong background you're a physicist you have a lot of data analytic skills a lot of design and engineering and entrepreneurial skills presumably come into this role as well so I'm just wondering what type of things you need in order to do this job yeah it certainly has evolved over the years in the sort of early days of ipython and Jupiter we were spending most of our time doing software engineering there was a very small amount of design work UI UX design work and when it's only a handful of people in principle there's organizational work and community work to be done but it's at a very small scale that is in the background of relative to the software engineering as Jupiter has grown though the I would say the demand for more time and effort on the organizational and community side as well as the design aspects of the project have really increased one of the challenges in working on open source is that projects like Jupiter or Altair tend to attract really top-notch developers and software engineers and so that aspect of the project tends to be reasonably well staffed that doesn't mean that we all have as much time to put it into the projects on the software engineering side as we would like however as these projects get big there's nothing in particular that attracts top-notch UI UX designers for example to Jupiter that continues to be a challenge for us and other open-source projects in terms of how do we build design into the process and figure out how to engage designers in the development of projects so in terms of I mean you're speaking about a number of things here that include you know design but also hiring structuring an organization I know that you think a lot about getting funding for the project you talk about community development which you know these are all things we think about it data camp a lot as well so it sounds somewhat similar in several ways to running a company it probably is I've never run a company but when I talk to other people who are in different roles leading companies there's a lot of overlap there and our business model doesn't involve selling things to people in the traditional sense but most certainly we have customers and our interaction with those customers is very similar to that of a company who has paying customers in terms of we exist in a very dynamic fast-paced part of the economy and it's the type of thing that if Jupiter were to sort of relax and begin to coast there's hundreds of other open-source projects and for-profit companies building products quickly put soup etre in a position of becoming outdated and so there's a lot of thinking we do and work that we do around looking ahead and our three to five-year roadmap where we see data science machine learning and interactive computing going and how do we build the resources to tackle those ambitious things on those time frames also build a sustainable community along the way so how did you get interested in or involved in data science as opposed to being a physics professor and researcher how did you get into data science initially yeah that's a great question so we began working on interactive computing as part of a Python I was a classmate of Fernando Perez back in grad school at the University of Colorado and for not have created ipython in the early 2000s and at the time the world of interactive computing was really something that was done in the scientific realm in academic research and education I'm sure there were some companies at the time that we're doing little bits of it here or there but it wasn't something that was pervasive like it is today and so as we started to build ipython and Jupiter in the 2000s initially we felt like we had what we imagined was a very grand vision that everyone in the world of academic research and scientific computing would be using Python and the tools that we and many other people were building what we didn't see is that the whole world was about to discover data and that really sort of opened up a whole new audience instead of users to open-source data science tools and scientific computing tools that we never imagined and so honestly the way my own journey is more that we were doing what we had always done in terms of scientific computing and then woke up to realize that we were sort of right in the middle of the data science community that was forming both in the academic research side but also on the commercial industry side as well I want to delve into a bit more about the general nature project Jupiter in a minute but before that I'd like to speak a bit more about interactive computing so I'm going to quote project Jupiter project Jupiter states that it exists to develop open source software open standards and services for interactive computing across dozens of programming language and I'm wondering what a general working definition of interactive computing is and why why is it important yeah this is a great question and I think in the history of computer science interactive computing has not even really been a thing that's acknowledged in terms of a topic worthy of study and something that is worth really thinking about carefully and clarifying and it's something that we've been doing over the years and really that the Jupiter architecture is an expression are thinking about interactive computing and I would say that the core idea of interactive computing is that there's a computer program that's running where there's a human in a loop as the program runs that human is both writing and running code on the fly but then looking at the output or the result of running that code and making decisions about what code to write and run subsequently and so there's this sort of interactive mode of going back and forth between the human authorship of the code and then the computer running and the human and interacting with the result and this is ideal for so many aspects of the scientific research process right from exploratory data analysis to writing code embedded with inline results and images and text and that type of stuff absolutely this is something that we really think a lot about and that is that when humans are working with code and data eventually at some point for their work to be meaningful and impactful the code and data need to be embedded into what we think of as a narrative or a story around the code and data that enables humans to interact with it decisions based on it understand it and it's really that that human application towards decision making it really makes a difference to have a human in the loop when you're working with data and for our listeners out there who may not know what ipython is and how differs from Python would you spelling it out for them yeah so - it's the programming language I Python stands for interactive Python and it originally was a terminal based interactive shell for working with Python interactively you had a lot of and continues to have a lot of nice features that you want when you're working interactively such as nice tab-completion easy integration with the system shell inline help features like rich interactive shell and the interaction between ipython and Jupiter today is that originally when we built the notebook we call it the ipython notebook because they've only worked with Python and then over the years we realized that the same user interface in architecture would also work other programming languages the core developers of ipython then sort of spawned another project namely project Jupiter that's the home for the language independent aspects of the architecture and ipython continues to exist today as the main way that people are using Python within project chipler and so it continues to be a project that we're still working on so could you give us a high-level overview of what project Jupiter is and what it entails yeah so you've read the sort of a good summary of project Jupiter and that it's focused around open source software open standards and services for interactive computing and the I think a lot of people are familiar with some of the software projects that we have created namely the Jupiter notebook and I sure will get to talk more about that here in this conversation but underneath the Jupiter notebook is a set of open standards for interactive computing and I think when we think about Jupiter and its impact it's really those open standards that are at the core of that and one way to think about it is it's a similar situation as to the modern Internet where yes there's an event individual web browsers and websites but underneath all of that there's a set of open standards in HTTP tcp/ip HTML that enable all of the all of those things to work together open standards that Jupiter has built for interactive computing service similar role as those other protocols to in the context for the broader Internet so I want to find out a bit about the scale and reach of project Jupiter but I want to preface this by saying it's a story but you and I know but for our listeners recently I attended a talk you gave at Jupiter con here in New York City and before you started speaking you asked people to put up their hand if 10 or less people in their organization used some aspect to project Jupiter then asked people to put their hand up if 50 or less 100 or less 500 or less a thousand or less and so on and people put their hands up at every point and then you asked anybody in an organization which has over 10,000 people using some aspect to project Jupiter and a number of people put their hands up and that was a really large aha moment for me thinking about the scale and reach of the project as a whole so with that as a kind of intro I'm wondering what you can tell us about the scale and reach of the project yeah so this is something that's been really fun to be part of over the last few years to see the usage of Jupiter literally explode and take off in ways that we never imagined and there's a number of different ways of thinking about this being an open-source project we don't have an accurate precise way of tracking how many users we have our users obtained and install Jupiter through a number of different means and that does make it a challenge one nice thing that we're watching is the number of notebooks on github and this can be obtained by querying the github API is we have a project an open-source project where we're tracking that over time and as of this summer the total number of public notebooks is on the order of two and a half million and from talking with the github folks that we know it looks like there's roughly another similar amount jupiter notebooks that are not visible to the world and so the interesting thing there obviously the absolute number currently is interesting i think what's more telling is that over time we're seeing an exponential increase in the number of notebooks and the doubling period right now is around nine or ten months and so that really points to very strong current numbers as well as growth and you know it's difficult to put a number on the total number of people they're using Jupiter worldwide some of what makes it challenging is that most of our staff is in the US and Europe and yet we know from our Google Analytics traffic that Asia right now is one of the most popular continents that's using Jupiter and so we don't have many contacts with people there we don't know how Jupiter is being used but we see a very strong signal that it's thinking is heavily and how about in terms of contributions and amount of developers working on the project yeah so along with the usage there's definitely been an increase in the number of contributors I think that the total number of contributors is somewhere over 500 and it's a fairly large scale open source project we have over a hundred different repositories on github spread across the number of different orgs and the core team right now that our core maintainer is the project many of whom work mostly full-time on the project is around 25 people the Jupiter steering council is a key part of the the leadership of the project and I think there's currently 15 steering council members there's a number of new people who joined the steering council this summer's I don't remember the precise number and one thing that I want to emphasize with this is that sort of what is the right narrative to have about the different contributions of people to project Jupiter and I want to sort of make it analogy to Hollywood in terms of if Jupiter were a movie what type of movie would it be and I think it's important to note that it would not be a movie where there's a single superhero who comes and saves the day so sort of like a Superman narrative really doesn't fit the reality how Jupiter has been built a movie and I think that would be a better analogy to how dukers is built would be something like infinity Wars where you have a bunch of different superheroes all with very diverse skills and strengths contributing to the overall project and I think it's really important to note that in that yes I'm the one that's here talking to you today but I am one one among many people who've done absolutely amazing work on the project and for our listeners out there who would like to get involved in perhaps contributing to the project what a good ways to get involved and what a I hesitate to say bad ways to get involved but what a less good ways to get involved yeah so this is one thing that I talked about at Jupiter con in terms of in that context it was more thinking about what are healthy and productive ways for large companies to engage with open source so for individuals I would say one of the best ways would be to find a part of the project that you're interested in and then come on github and begin to interact with us a lot of our popular github repos have issues that are tagged for first-time consumers and so we're working hard to try to make the project a welcoming place for new contributors and so we welcome people to come and talk to us there we also have chat rooms that are public on geared I am so this is a online web-based chat platform that is integrated with github and so for example the Jupiter notebook Jupiter lab Jupiter hub Jupiter widgets all have chat rooms on getter and both the core contributors as well as the broader community hangs out in those contexts so that's a great way for people that to get involved and how about organizations that want to contribute to the project Brian I think in this case it's really helpful to have a good mental model of how open source projects work and how contributions function and that my favorite mental model is actually from Brent cannon who's one of the core Python devs and works at Microsoft in a tweet he said something like submitting a pull request to an open-source project is like getting someone a puppy and that is you have to understand that the person accepting the pull request is essentially agreeing to care for that puppy for the rest of its life and one of the patterns that we see in organizations companies that want to contribute to open source is that they're interested in particular features and so they have their employees contribute to open source in a way that generates a lot of large new pull requests for those new features and I think this is where the puppy mental model really helps and that is oftentimes the core maintainer zuv open source projects are completely overwhelmed with just the maintenance of the project and that could include bug fixes releasing issue triage and managing issues but also reviewing other contributors pull requests and so one of the the most helpful things that were trying to cultivate is basically a balanced perspective of contributions that includes not just submitting pull requests that have new features but also includes reviewing other people's pull requests involves helping other users with particular issues and even fixing bugs and one really nice thing that github has done recently is in their contributor user interface they have a new user interface for sort of expressing someone's contributions to particular github repository and there's sort of a XY coordinate system and for directions around that and it shows someone's contributions - I think it's review poll requests issues and there's one other from that you can get a perspective on how balanced someone's contributions are to open source project and so a simple way of putting it is encouraging people to have a balanced way of contributing to open source now we also want to specifically address first-time or new contributors and there I think the idea again is balance but in a way where the core contributor is an existing people working on the project can help new contributors to come along and begin to contribute in different ways and so even for new contributors those contributions don't necessarily have to be pull requests even checking out existing pull requests just testing them locally is really really helpful for all these right projects and so that balance perspective is something that we're trying to and it's incredible that you know github now has the feature you discuss which kind of facilitates just figuring out this balance right oh absolutely I was thrilled that to see them release that and I think it's happened in the last month we'll jump right back into our interview with Brian Granger after a short segment it's well known that humans can't multitask but apparently machines can I'm here today with fredericka sure research engineer at Cloudera fast forward labs and we're gonna talk about machines that multitask or multitask learning Frederika what is multitask learning how you go it's nice to be back on the dataframe podcast thanks for having me we humans may not excel at doing too many things at once the definition of multitasking but we do benefit from learning or having learned many related tasks that's multitasking for example several years ago I moved from Germany to the Netherlands as a native German speaker it was relatively easy for me to pick up Dutch as dramatic languages German or Dutch share semantic and syntactic properties that a German speaking Dutch language learner like me can benefit from I was multitasking learning Dutch was easier for me because I had learned German before and how about machines well machines struggled to learn multiple tasks today supervised machine learning algorithms algorithms had learned from Lidl data tend to learn one and only one task they achieve mastery through laser-like focus since they only learn one task they cannot benefit from the relations between tasks multitask learning allows machines to learn multiple tasks at once so that they can benefit from the relationships between tasks great would you mind giving us an example of course let's get into the nitty-gritty with the natural language processing or NLP part of speech tagging syntactic pricing and textual entailment are all tasks that assign labels to parts of sentences part of speech tagging assigns labels such as noun verb adjective etc syntactic parsing identifies the syntactic relations between the parts of a sentence textural and Tamland identifies whether one sentence is the consequence of another or contradicts another or whether there is no such sentence relationship of speech tags improve syntactic parsing part of speech tags and the pass tree improved textual payment now multitask learning enables us to train one model to master all three tasks so the texture attainment for example can benefit from the part of speech text importantly a model trained to master all three tasks is likely to outperform models trained to master one and only one task that is one of the benefits of martita's learning it allows machines to benefit from related tasks for better performance it's no surprise that the natural language processing library spacy I'm a huge fan users mouth it has learning for some of their models that sounds great so when would you use multitask learning Martita flirting is an approach to training supervised machine learning algorithms it is not an algorithm so in principle there's a multitask person of every supervisor algorithm he could use multitask learning all the time but you really want to use multi-touch learning if you're looking to solve complex problems single task learning thus a fine job when you're looking to solve simpler problems which is why it's been so popular until now complex problems benefit from the power of the multitask learning approach they benefit from learning the relations between tasks and can you give us some examples absolutely image segmentation algorithms for example identify object boundaries and images the boundaries of tissue types and medical images or building types and satellite images image segmentation is hard results are plagued by blurry edges because algorithms struggle to really delineate objects they assign different labels to pixels that object boundaries image segmentation benefits from joint training with boundary detection or shape identification most buildings are composed of rectangles their edges are straight lines using multi-touch learning we can exploit these regularities to sharply delineate object boundaries during image segmentation so what about natural language processing yeah that's a pretty sensible choice too a couple months ago Salesforce released one model that mastered many NLP toss from sentiment analysis semantic role labeling machine translations is a Shinto common sense reasoning and question answering multi test learning is only effective of course if the model is asked to learn tasks that are actually related in this case it's clear intuitively how common sense reasoning and question answering support each other common sense reasoning may help us get the answer we need thanks Federico for this enlightening introduction to multitask learning after that interlude it's time to jump back into our chat with Brian as we've been discussing that the scalar reach of project Jupiter is is massive so I'm sure there are so many different uses of notebooks in the project in general but I'm wondering to your mind what the main uses of Jupiter notebooks for data science and related worker yeah in terms of numbers of people using Jupiter for a particular purpose I would say interactive computing with data so data science machine learning AI is one of the most popular ways that Jupiter's being used both by practitioners so people who are working in industry on data science machine learning teams but also in educational contexts so within universities with online programs with boot camps you have instructors and students doing those activities around data science and machine learning but in an educational context I think that really captures the bigger picture of Jupiter Jupiter's use it yeah and in fact we use them at data camp for our projects infrastructure which as you know we teach a lot of skills in our courses in our projects we teach kind of into end data science workflows using Jupiter notebooks that project style workflow is something that I've seen when I've talked data science at Cal Poly my University in that oftentimes it's helpful to start with students in a very highly scripted manner where the exercises are very small-scale and focus on a particular aspect particular concept and then eventually transition to more open-ended project-based work and I know in those courses the towards the end of the quarter when the students have an opportunity to do sort of end-to-end data science that's a little more open-ended the learning really increases a lot the students get a lot out of it so I'm I'm thrilled to see that the data camp has type of experience as well now and of course we see notebooks pop up everywhere from in the slide deck that we discussed earlier you have a great slide on the large synoptic survey telescope on top of that of course the gravitational waves were discovered by the LIGO project and they've actually published all of their Jupiter notebooks so this is in basic scientific research right - what's happening now I mean there's a lot of stuff happening at Netflix with notebooks now so it's across the board right yeah and this is really a pattern that we've seen emerge in the last two years and that is the transition from ad-hoc usage individuals in organizations to official organization why deployments at scale and so use we're starting to see a lot more organizations adopt Jupiter in a similar way to LIGO or LSST or Netflix where it is officially deployed and maintained by the organization and many many users are using Jupiter on a regular basis and some of the larger deployments that were aware of are many of many thousands or even on the order of 10,000 or more people and so it's the scale is definitely getting large in these and I'm gonna state two things that I think of facts and correct me if I'm wrong Netflix runs over a hundred thousand automated notebook jobs a day and - at least - either contributors or core contributors to project Jupiter work full-time at Netflix as well yes absolutely so a Kyle Kelly and M pacer are on the notebook team I don't know if that's exactly the name of their team but they're one of the tools teams Netflix and they work both with us on the core project Jupiter projects but also they had a number of other open-source projects that work with the different Jupiter protocols one of those is interact which is another user interface from working with Jupiter notebooks that has focus on simplicity and personas where individuals do want to work some with code but they're not living and breathing in code all the time so business analysts would be a great example to take a persona that interact as targeting and then as you mentioned Netflix has really innovated in the area of using notebooks in a programmatic way running them as Basques jobs every day and I think your your number of around a hundred thousand batch jobs there are notebooks a day it sounds about right from what I remember and they have a number of open source projects out to help with those type of workflows one of those is paper mill the other is commuter and I think one of the things that I love about what's going on at Netflix and I think this really comes from the leadership of Kyle Kelley there and that is a deep understanding of the value of jupiter's open protocols and that is sort of the recognition that the different software projects that we've built on top of those protocols are sort of like a Lego set that you get you bring them home from the store and there's a default instruction set to build something out of the box but then realizing that the same pieces can be reassembled in different ways to build whatever your organization needs and I love how that thinking has really sort of seeped into all the different ways that Netflix is working with data and I think they're doing really interesting things as a real result and the note book payment at Netflix actually published a really interesting blog post article recently which will link to in in the show notes along with a lot of other things that we're we're talking about yeah and they also gave a number of talks of Jupiter Khan and those talks will be posted on the Jupiter Khan YouTube channel here coming month I think okay fantastic so when we have 2.5 million public notebooks on github I'm sure there's a lot of surprising stuff happening out there and this may be a bit of a curveball but I'm wondering if you've seen any uses of a Jupiter notebook that have surprised you you've been like oh wow that's interesting so what is the most surprising use of a notebook you've seen yeah I mean one of the fun things about working on a project is to follow all the amazing things that our users are doing and you know I think seeing the impact that Jupiter is having in the world of scientific research is something that we're really proud of and so to see large-scale science such as my go infer go winning Nobel Prize in Physics and as part of that publishing Jupiter notebooks that can anyone in the world can use to completely reproduce their analysis all the way from the raw data to the final publication ready visualizations that makes us really proud I don't know that surprise is the right word to use there some of that is that that's the community that we came out of and so we've always worked really hard to make sure that Jupiter was useful for those usage cases in terms of surprise the most surprising or shocking usage of Jupiter was by Cambridge analytic and SEO elections to build machine learning models to manipulate the 2016 election right and I do think you know surprising is one word they're stuffing is another word and I actually remember the first I saw a tweet and I think it was where's it was it was where's McKinney who tweeted words to that effect and we saw a screenshot of a Jupiter notebook with pandas dataframe with some scikit-learn fit and predicts or something like that and that was a moment where I also step back and really thought you know all these tools can be used for all types of purposes right so Cambridge analytical all of their web presence get a present is gone SEL elections which worked with them hasn't taken their stuff down for hasn't taken all their stuff down from get up and so there's a project called Jupiter stream and you could tell that the people working at SEL elections were typical data scientists who were excited to use these tools to do data science and the thing that's scary is if you look in the demo subdirectory in dull box there's a notebook layer and it's very clear the type of things they were doing now in this particular case it's nothing particularly sensitive it looks like they're tracking rip voter registration counts by calendar week and working with a panda's data frame with that but we were certainly again I'm not quite sure what the right word is surprise doesn't quite capture it I think it was really a moment of waking up for us and realizing that it's similar having an open-source project with a very free and liberal open-source license is very similar to free speech and that a liberal ease licensed open-source project can and will be used by just about anyone and that includes people doing really good things but also people doing really evil thing is creepy how this repo also says Jupiter to the rescue exclamation point it's really a trip I mean and that was the original interview that Christopher Wiley did so he was one of the data scientists at Cambridge analytic and in that first interview that came out I think it was in The Guardian he used the phrase build models and immediately I thought wait hold on a second this is a data scientist they're talking about building models what they're really saying there is imports I could learn at a Jupiter notebook and initially it was sort of like yeah they they might have used it or maybe they used our studio but then over time it's become very clear that they certainly we're using Jupiter at some point so this is a nice segue into my next question which may seem like that's an answer too but I suppose it isn't necessarily my next question is what adoption of notebooks as we've discussed have been huge and wondering if there are places you see notebooks used where they shouldn't be used yes I'm not quite sure I would freeze it where they shouldn't be used but certainly I think there's a little bit of a effect where the notebook is a hammer and so everything starts to look like a nail and in particular the the type of workflow where notebooks begin to be rather painful use one exploratory data science and machine learning becomes more software engineering and more about data engineering and in those usage cases it's not a fantastic software engineering apart it's really it's not designed for that purpose and now this is something we're hearing from our users that right now there's sort of a a very steep incline between working interactively in a notebook and software engineering and as someone moves across that transition at some point today they get to the point where really the right thing for them to do is stop using Jupiter and open up their favorite IDE and start to do traditional software engineering and that can be rather painful in the sense that most of the ideas that people love are not web-based and so you know if anyone was working with significant amounts of data and running in the cloud those ideas may not even really be a great option and so we are working a lot to improve the experience at that boundary between interactive data science and software engineering we don't envision that Jupiter's either ever going to replace full-blown Ige's but it's really at that boundary we're seeing a lot of user pain currently and we're notebooks themselves start to be not the best tool yeah and that dovetails nicely into my my next question which is around the fact that there are several common criticisms of notebooks such as they may encourage bad software engineering practices and I suppose most famously recently Jupiter Khan accepted Joel gross talk I don't like notebooks to be presented at Jupiter con and I'm just wondering what you consider the most important or relevant or valuable or insightful criticisms that can help moving the project forward yeah so I think there's I really appreciate the talk that you all gave at Jupiter called it was a really well-received talk and we want to hear things like this it's really important for us some of the criticisms that he had about project Jupiter are in the category of things that we could fix so existing user experiences or features that we offer don't offer or could improve and most of the things he brought up in that category I think the whole quarter team is more or less on the same page the other aspect that he was bringing up gets more to the heart of interactive computing with Jupiter notebooks you know I think it's helpful to bring those things up as it really forces us to clarify the value proposition of that type of workflow in a notebook compared to just to traditional software engineering and so that I think the discussion that's emerged out of that has been really helpful and something that that is helping us to clarify when should you use Jupiter notebooks or why would you use them and why should you not use them we'll jump right back into our interview with Bryan Granger after a short segment I'm back here with fredericka to talk about machines that multitask Frederika you said that multitask learning is only effective if the model is asked to learn tasks that are related what if they're not well if tasks aren't related but you're asking a single model to learn them all then multitask learning may actually lead to worse performance we call this negative transfer okay so how do I know if tasks are related yes so mathematicians have tried and failed to develop a formal measure of task relatedness and in the absence of such a formal measure you just give it a try to see if multitask learning leads to improved model performance if so it's likely that tasks are related as a rule of thumb if tasks are drawn from the same domain say natural language processing then they're likely related you see intuition even in the age of machines it's still extremely valuable can you give us an intuition for why multitask learning works well if you're providing a model with two instead of one set of labels you're giving it more information that it can learn from that is one of the reasons also while single task algorithms struggle with noisy labels multitask algorithms are more robust to noise usually measurement errors are independent so algorithms using multitask learning can average out noise during training finally martita's learning acts as a regularizer since the model has to learn multiple tasks at once it is less likely to overfit to one task and more likely to generalize to previously unseen data the reasons for this remarkable effectiveness of martita's learning are rooted in machine learning fundamentals it benefits you to know how and why algorithms work not just how to use them besides accuracy improvements are there other benefits of multitask learning yes Marta tough learning allows you to use not just one but multiple data sets to Train one single model researchers have gone as far as training one model to master 259 tasks based on 259 data sets here multi-tassel n was used to predict the likely interaction between medical targets and promising compounds a part of so-called virtual drug screening data on target compound interaction is produced in medical laboratories that investigate related but distinct questions and the circumstances at vary from lab to lab martita's learning allows you to take all this data to train one single model was respecting likely differences between data sets by treating each one as a distinct task to be clear these I related because the data is related it is drawn from the same domain two ants the same or similar research questions furthermore multi-touch models also tend to be more efficient in production a multitask model can give you multiple predictions while it needs to process the input data only once and finally multi-touch models can give you insight into task relations which is pretty neat got it so what did you do with multi task learning we will a prototype we call it an Uzi that puts current news in perspective we treat one multitask neural network on a body of news articles to predict news article category politics sports entertainment you name it news articles were taken from sensationalist tabloids think the New York Daily Post and the MOU buttoned up broad sheets think the New York Times News II was trained to predict the news category of broadsheet articles that's task one and tabloid ones that's task two so what does the prototype look like in its final form it's quite interactive you can take any article they interested in and analyze it as a broad sheet or tabloid article our multitasker our network leads the article a key human word from left to right and incrementally predicts the articles news category predictions are overlaid the article word by word you gain insight into words associated with news categories given the unique context the article you decided to look at the user can switch between a tabloid and broadsheet lens to study differences in language use and style across publication types crucially this model since it's been trained on 1.6 million articles services large-scale reporting tendencies and other biases that a human reader may not or may only unconsciously pick up on we found for example that reporting on the daily news tabloid newspapers are more likely to use language that signals factual statements such as reported a said this isn't true for broadsheet articles tabloids also tend to report in politics and world news as opinion which is an interesting choice since this may affect how we consumed the news it's good to surface these tendencies and new ceders exactly that using the power of multitask learning thanks fredericka for that informative dive into the benefits of multitask learning time to get straight back into our chat with Brian so we've discussed notebooks of course but something I'm really excited about I noticed something you're incredibly excited about is the next-generation user interface for project Jupiter which is Jupiter lab so maybe you can tell us what Jupiter lab is and why working data scientists would find it useful yeah Jupiter lab is definitely something that I and many other people and the core team are excited about you know jupiter lab is a next-generation user interface for project jupiter and we've been working on jupiter lab for around four years now and just in the last month it left beta so it is ready for production congratulations yeah thank you very much we're really pleased to get through that hurdle it's still not at a one-dollar release because some of the developer oriented extension API is are still stabilizing and so it's one of the big things we heard of users of the classic notebook is that people wanted the ability to customize and extend and embed aspects of the notebook with other applications and the original code base and the classic notebook just wasn't designed in a way that made that easy so one of the core design ideas and Jupiter lab is that everything in Jupiter lab is an extension and all those extensions are NPM packages and the idea there is that the core Jupiter team can't build everything for everyone and a lot of different individuals and organizations want to come along and add new things to do for the lab and though those extension API is which are the public developer oriented api's of Jupiter lab enable those extensions to be built and we're still in a process of stabilizing some of those API s but I want to emphasize that from a you perspective for people who are using Jupiter on a daily basis Jupiter lab is fully stable and production ready and in many ways at this point I would say it's a better user experience and more stable in the class great and what type of features do you have in Jupiter lab that you don't get in the classic notebook one of them is the building to work with multiple different activities or building blocks for interactive computing at the same time so the classic notebook each notebook or terminal or text editor will have a separate browser tab and that made it very difficult for us to integrate those different activities with each other so an example of how that integration would work in Jupiter lab is if you have multiple notebooks open side by side you can just drag cells between those two notebooks another example would be if you have a markdown file open you can right-click on the markdown file it open a live markdown preview and then also open a code console attached to that markdown file and start running code in any of the different languages that Jupiter supports in a manner that's more similar to an experience like our studio and so having the different building blocks of places to type code output terminals notebooks integrated in different ways to support some of these other workflows that come up and also a csv viewer right yes so we have so the other another big idea a design idea in jupiter lab is the idea of more direct manipulation user interfaces and so in many cases writing code is the most effective way of interacting with data however there's many situations where writing code is a bit painful and a great example of that is if you have a new CSV file and you don't know what's in it and you simply want to look at of course you can open up a notebook and import pandas and stuff to look at the CSV file but in many cases more indirect modes of interaction are highly productive and useful and so Jupiter labs file system at is based around the idea of the possibility of multiple viewers and editors for a given file type and so for example for a CSV file you can open it in a text editor edit it as a plain text CSV file or you can open it in this new grid or sort of tabular viewer that we have and that viewers the default for CSV files so you can just double click that I CSV file in Jupiter lab it will immediately open in a form that looks like a table we try to detect the delimiter x' this is a zwi file and even allowing the drop-down to select those and I recall from one demonstration that it supports wildly large CSV files as well right yeah so one of our core contributors Chris Culver and a lot of time building a well-designed data model and a viewer on top of that so the the data model for this grid viewer does not assume that all of the data is loaded into memory so it has an API that allows you to request data from the model as needed basis and where that's used is in the view that sits on top of that if you have a really large CSV or tabular data model the view is only going to request the portions of the data that are visible to the user to get in time and so for example right now like some of the demos that we're doing you can just double click on a CSV file it has over a million rows and it's big enough those those files are big enough that they don't open successfully in Microsoft Excel on the same laptop and they open just fine in Jupiter lab and the viewer there the renderer that Chris wrote uses canvas so it's a very high-performance tabular data viewer and to keep ourselves honest we've tested it with synthetic data sets so these are not concrete data sets they're generated on the fly but that have a trillion rows and the total columns the tabular or data set viewer works really well and you can scroll through the data set just fine and I think the another side effect of direct interaction with data is that when you make it easy for users to interact with data in those ways they're going to do that right so if you can double-click on a CSV file a users gonna find wow that's useful and they're going to do that and you have to spend a lot of time making sure the underlying architecture doesn't let them do things that are going to have adverse side effects we're trying to build an architecture that has both good user experience but also can deal with the realities of large day and there are you know several other features that we we could discuss but I'm just gonna pick one which i think is very attractive and fantastic which is the ability to collaboratively work on Jupiter notebooks with colleagues and collaborators yes so this is something that we've been working on for a while now and our first take on this was a Jupiter lab extension and a post-hoc UC Berkeley wrote Ian rose yeah this provided integration with Google Drive and the Google real time API to enable multiple people to open a single notebook and collaborate in real time on that notebook and see see the other people working and editing the notebook is it at the same time and then in the last year and a half we've started a new effort to build a real-time data model in datastore for Jupiter lab for two reasons one is that the Google real time API has been discontinued and then the others that we've heard very clearly from our users that there's many organizations for whom sending all their data to Google API is a no-go and so it's become really important for us to have a high-performance really well-designed real time data storage we've been working on that from last 18 months and again Chris Kolbert who did the data grid is the person working on them great and listeners out there this has been kind of a whirlwind introduction to a bunch of features in in Jupiter lab and I'd urge you to go and play with it yourself and check out some of the demos on online as well if you haven't yet and I want to clarify the the version of Jupiter lab that's out today does not yet have the real-time collaboration okay that's right still not so we've discussed ipython we've discussed jupiter notebooks we've discussed jupiter lab what else exists in the jupiter ecosystem could you give us just a brief rundown of a couple of the other things yeah absolutely probably the biggest other constellation of projects is that the jupiter hub project jupiter hub is its own organization on github and there's a number of different separate repos and projects there in jupiter hub provides basically the ability for organizations to deploy at scale to many users and you know with the sort of patterns of adoption that we're seeing right now that usage case is really really important in as a result of that Jupiter had the seeing both a lot of interest from people contributing and also organizations using it we discussed earlier the talk you recently gave it a Jupiter con in which you stated that project Jupiter is undergoing a phase transition from having individual users in organizations to having a large-scale institutional adoption and I'm wondering what the unique challenges the project is now facing due to this transition yes so there's both organizational and tactical challenges we're facing on the organizational side I would say the big challenge is that we're seeing an increasing number of organizations coming to us and wanting to interact with us rather than just individuals and those organizations and that really changes the type of people you're talking to in the organizations so in many cases in the past it may have been data scientists or machine learning researchers and increasingly it's managers project managers and other decision-makers who are thinking about the broader data strategy at the organizations from a technical perspective it brings out a lot of new usage cases in particular in Jupiter hub to address the the needs of large organizations and some examples of those are security security is a really important thing for larger in particular when their sensitive data in the mix another aspect of that is that in these large organizations there are typically a wide range of different skill sets responsibilities roles access permissions and priorities of the people working with Jupiter and so it's not necessarily just people who are living and breathing code all day long but a lot of other people in the organization working with data don't necessarily want to look at code all the time or even most of the time and so there's a lot of work we're doing is thinking about how will duper need to evolve to address those usage cases absolutely so Brian was my last question I'm wondering if you have a final call to action for all our listeners out there who may have used Jupiter notebooks may not have button may be interested in doing so yeah so I think there's a couple different cause of action one is for people to engage with open source individuals so if you're a data scientist or someone doing machine learning at a company or student learning about these tools and techniques engage with the open source projects find an open source project you're interested in understand more about the project maybe help with documentation and a lot of what we've found is that innovation happens when diverse groups of people come together and talk to each other and work towards common goals and so the more people we have joined the projects in contributing and helping us think about these things the better off and more healthy the open source projects will be but also the users of those projects will be better served agreed so in a second call to action would be for people working in larger organizations that are using open source tools in the space I think it's important to know that many of the open source projects in particular those that are community driven like Jupiter and many of the other known focused projects we continue to struggle with the long term sustainability and there are many core contributors to these projects that continue to lock long-term funding and the ability to focus on working on the projects and so if you're at an organization using these tools I would really encourage you to talk to the people in the organization to think about and understand how you can support the core contributors and the broader sustainability both in the sense of community but also in particular the financial sustainability of these efforts that would be really really helpful and all that one other final call to action there which is I started using Jupiter lab all the time instead of notebooks mid last year I think I would urge anyone out there who still uses the classic notebook to jump into Jupiter lab I think you have no reason reason not to these days it's such a wonderful place to use notebooks among many many other things yes that's a great a great point to go and even for the part of my job where I get to use the Jupiter notebook I transitioned in particular in teaching some research to using Jupiter lab back in January and it's worked really really well in this context and that's what I'm using Jupiter lab basically all the time so I just echo what you said and appreciate the kind words Brian thank you so much for coming on the show as I always love our conversations and it's been an absolute pleasure formalizing this and putting it out there yes and thank you so much you go for working on this podcast I know a lot of people really appreciate it thanks for having me thanks for joining our conversation with Brian we saw how much necessarily goes into open source projects for scale of Jupiter such as software engineering UX and UI design large-scale organizational considerations and community building to name a few we also saw the unique challenges faced by open source data tooling platforms in general due to the fact that we're now at a phase transition from having individual users in organizations to having large-scale institutional adoption Netflix is one example of many as is the LIGO projects Nobel Prize winning discovery of gravitational waves and lastly don't forget to go give Jupiter lab a test drive you've got everything to win and nothing to lose also make sure to check out our next episode a conversation with Cassie kozakov chief decision scientist at Google cloud Cassie and I will be talking about data science decision making and decision intelligence which Cassie likes to think of as data science + + augmented with the social and managerial sciences we'll talk about the different and evolving models for how the fruits of data science work can be used to inform robust decision making along with pros and cons of all the models for embedding data scientists in organizations relative to the decision function will tackle head-on why so many organizations fail at using data to robustly inform decision making along with best practices for working with data such as not verifying your results on the data that inspired your models as Cassie says split your damn data I'm your host Hugo Bound Anderson you can follow me on Twitter at Hugo Bound and data camp at data camp you can find all our episodes and show notes at data capcom slash communities slash politics\n"

#44 Project Jupyter and Interactive Computing (with Brian Granger)

Random Videos